1 Overview

Data is generally organized in a database that can be accessed, managed, and updated by users. Databases are often defined as relational or non-relational.

1.1 Relational Databases

A relational database model organizes data into a set of tables with rows (records) and columns (attributes) with a primary key (sometimes multiple columns) to uniquely identify each row. Primary keys that are found in other tables are called foreign keys and are often used to join the tables together.

SQL (Structured Query Language) is a language used to query and maintain relational databases. Some common relational databases include Microsoft SQL Server, Oracle Database, and MySQL.

1.2 Non-Relational Databases

Non-relational databases, sometimes called NoSQL (non SQL or not only SQL) databases, do not require the structure of rows and columns, but instead is optimized for the type of data it holds. They may contain JSON documents with key/value pairs, graph data with edges and vertices, time series data, etc.

Some common non-relational databases include Apache Cassandra, MongoDB, and Oracle NoSQL.

2 Connecting to Databases in R

We are able to connect directly to a database in R. This is one example of connecting to a database (I used this to connect to an Oracle database).

After setting up the connection to a database (in Windows using ODBC Data Source Administrator), we can connect to the database and query from one (or more) of the tables. After storing the queried data in a dataframe, we can disconnect from the database.

library(DBI)
library(odbc)
con = dbConnect(odbc::odbc(), "<Data Source Name>", UID = "username", PWD = "password")
result.df = dbGetQuery(con, "select * from <Table Name> ")
dbDisconnect(con)

3 Common SQL Commands

We explore some common SQL commands: Link to more commands

3.1 Aggregate Functions

  • SELECT <>(column_1) from table_name WHERE column_1 = ___
    • <COUNT, SUM, MAX, MIN, AVG>
  • Often GROUP BY a column when using aggregate functions
    • Can GROUP BY column name or by 1,2,… relating to column names in SELECT statement
  • WHERE filters rows, HAVING filters groups (often used after GROUP BY)
  • Order of commands:
    • SELECT - DISTINCT - AS - FROM - WHERE - GROUP BY - HAVING - ORDER BY - LIMIT

3.2 Joining Tables

  • SELECT * FROM table_1 JOIN table_2 ON table_1.column = table_2.column
    • If column name same for both tables: SELECT * FROM table_1 JOIN table_2 USING(column)
  • Join types:
    • INNER JOIN: only includes rows that match on condition
    • LEFT JOIN: keeps all rows from first table even if there’s not a match with the second
    • RIGHT JOIN: keeps all rows from second table
    • FULL OUTER JOIN or FULL JOIN: keeps all rows from both tables, matching when possible
  • SELF JOIN: joins table with itself
    • SELECT t1.country AS country1, t2.country AS country2, t1.continent
      FROM table AS t1
      INNER JOIN table AS t2
      ON t1.continent = t2.continent AND t1.country <> t2.country
    • This results in all countries matched with each other within a continent
    • Necessary to alias table
  • CROSS JOIN: each row of the first tables joins with all rows of another
    • Creates combination with lots of repeated values

3.3 Set Operations

Set operations combine columns

  • UNION
    • SELECT column_name(s) FROM table1
      UNION
      SELECT column_name(s) FROM table2
    • Must have same number of columns in SELECT, same data type, same order
    • UNION takes distinct values
  • UNION ALL: same as UNION but allows duplicate values
  • INTERSECT: Only returns duplicate values (values that appear in both tables)
  • EXCEPT: Returns rows in first table that aren’t in second table

3.4 Subqueries

These examples were taken from the SQL courses on Codecademy

3.4.1 Non-Correlated Subqueries

Subquery can be run independently of outer query. Used to complete multi-step transformation.

Example using airports and flights tables:

  • SELECT * FROM flights
    WHERE origin IN (
    SELECT code FROM airport
    WHERE elevation > 2000)
    • Returns flights with origins from airports with elevation greater than 2000
    • Note, the query inside the parenthesis could be run by itself

Subquery with a join: General structure

  • WITH previous_results AS (
    SELECT …
    )
    SELECT *
    FROM previous_results
    JOIN table1 ON previous_results.id = table1.id
    • We could save previous_results as a table instead of executing everything together

3.4.2 Correlated Subqueries

Subquery CANNOT be run independently of outer query

  • Row is processed in outer query
  • For particular row in outer query, subquery is executed

Example using flights tables:

  • SELECT id FROM flights AS f
    WHERE distance > ( SELECT AVG(distance)
    FROM flights WHERE
    carrier = f.carrier)
    • Returns id of flights where the distance for that flight is greater than the average distance of all flights for that carrier