SQL with R

display this

SQLite

SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.

If you want to learn more about SQLite visit the website: https://www.sqlite.org/

Installing SQLite on your mac

  1. Press Command+Space and type Terminal and press enter/return key

  2. Run in Terminal app

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
< /dev/null 2> /dev/null
  1. Run
brew install sqlite

Let us checking now the installation installation

$ sqlite3
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> 

First steps in SQLite

  1. Navigate to the directory with your data,
$ cd /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018
  1. Start SQLite by creating a new database, we call it flightsBD (set of tables)
$ sqlite3 flightsDB
  1. Type .databases to show all currently available databases:
sqlite> .databases
seq  name             file                                                      
---  ---------------  ---------------------------------------------------------------
0    main             /Users/dhafermalouche/Documents/Teaching/BigDataWithR/fligthsDB

You can use R to check the existence of the file in the directory /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018

> file.exists("flightsDB")
[1] TRUE

Importing an csv file

We will show how can we import a csv file like (flight.csv ~156 MB)

  1. Type of the separator in the csv file
sqlite> .separator ","
  1. Importing the csv file as a new table called flightsTB
sqlite> .import flight.csv flightsTB

Let’s notice that it takes few seconds to be loaded

  1. Check the available tables by running the .tables
sqlite> .tables
flightsTB
  1. Let’s now checking the data in SQLite

Let’s inspect the structure of the table using a PRAGMA statement:

sqlite> PRAGMA table_info("flightsTB");
0,YEAR,TEXT,0,,0
1,MONTH,TEXT,0,,0
2,DAY_OF_MONTH,TEXT,0,,0
3,DAY_OF_WEEK,TEXT,0,,0
4,FL_DATE,TEXT,0,,0
5,UNIQUE_CARRIER,TEXT,0,,0
6,AIRLINE_ID,TEXT,0,,0
7,TAIL_NUM,TEXT,0,,0
8,FL_NUM,TEXT,0,,0
9,ORIGIN_AIRPORT_ID,TEXT,0,,0
10,ORIGIN,TEXT,0,,0
11,ORIGIN_CITY_NAME,TEXT,0,
... 

We print the scheme of the table using .scheme (names of the variables, their classes)

sqlite> .schema flightsTB
CREATE TABLE flightsTB(
  "YEAR" TEXT,
  "MONTH" TEXT,
  "DAY_OF_MONTH" TEXT,
  "DAY_OF_WEEK" TEXT,
  "FL_DATE" TEXT,
  "UNIQUE_CARRIER" TEXT,
  "AIRLINE_ID" TEXT,
  "TAIL_NUM" TEXT,
  "FL_NUM" TEXT,
...   output truncated

Connecting SQLite from RStudio

  1. Make sure that the file flightsDB newly created (in RStudio)
> file.exists("flightsDB")
[1] TRUE
  1. Installing the R packages for SQLite
> install.packages("devtools")
> devtools::install_github("RcppCore/Rcpp")
> devtools::install_github("rstats-db/DBI")
> install.packages("RSQLite")
  1. Load the packages
> library(DBI)
> library(RSQLite)
  1. We create a connection with the flightsDB SQLite database:
> con<-dbConnect(RSQLite::SQLite(), "flightsDB")
> con
<SQLiteConnection>
  Path: /Users/dhafermalouche/Documents/Teaching/BigDataWithR/flightsDB
  Extensions: TRUE

Let’s see now how can we manage the SQL databases from RStudio

  • dbListTables() provides information on the available tables in the connected database and columns within a specified table,
> dbListTables(con)
[1] "flightsTB" 
  • You can also remove tables from your database.
> dbRemoveTable(con, "query_2_TB")
> dbListTables(con)
[1] "flightsTB"
  • dbListFields() shows the names of the variables in the table
> dbListFields(con,"flightsTB")
 [1] "YEAR"              "MONTH"             "DAY_OF_MONTH"      "DAY_OF_WEEK"      
 [5] "FL_DATE"           "UNIQUE_CARRIER"    "AIRLINE_ID"        "TAIL_NUM"         
 [9] "FL_NUM"            "ORIGIN_AIRPORT_ID" "ORIGIN"            "ORIGIN_CITY_NAME" 
[13] "ORIGIN_STATE_NM" 
... # truncated output

Let us creating a new table from the other table. We will extract the data on the flights in the month of September.

> query.1 <- dbSendQuery(con, "SELECT * FROM flightsTB WHERE MONTH = 9")
> dbGetStatement(query.1)
[1] "SELECT * FROM flightsTB WHERE MONTH = 9"

Extracting only the first 20 rows

> query.1.res <- fetch(query.1, n=20)
> class(query.1.res)
[1] "data.frame"
> head(query.1.res[,1:4])
  YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK
1 2015     9           22           2
2 2015     9           22           2
3 2015     9           22           2
4 2015     9           22           2
5 2015     9           22           2
6 2015     9           22           2
> dim(query.1.res)
[1] 20 29

Getting informations on the new data

  • We can obtain additional information, for example its full SQL statement, the structure of the results set, and how many rows it returned:
> info <- dbGetInfo(query.1)
> str(info)
List of 4
 $ statement    : chr "SELECT * FROM flightsTB WHERE MONTH = 9"
 $ row.count    : int 40
 $ rows.affected: int 0
 $ has.completed: logi FALSE
  • We can clear the obtained results
> dbClearResult(query.1)
  • Retrieving all records from the month 9.
> query.1 <- dbSendQuery(con,
+                        "SELECT * FROM flightsTB WHERE MONTH = 9")
> ptm=proc.time()
> query.0.res <- fetch(query.1, n=-1)
> proc.time()-ptm
   user  system elapsed 
  8.252   0.256   8.804 
> dim(query.0.res)
[1] 929892     29
> format(object.size(query.0.res),"Mb")
[1] "206.6 Mb"
> class(query.0.res)
[1] "data.frame"

Statictics from the data with SQL statements

Computing average of delays according to Departure and Arrival cities.

> ptm=proc.time()
> query.2 <- dbSendQuery(con, "SELECT ORIGIN_CITY_NAME, DEST_CITY_NAME,
+ AVG(DEP_DELAY) AS 'AVR_DEP_DELAY',AVG(ARR_DELAY) AS 'AVR_DEP_DELAY', 
+                        COUNT(DEP_DELAY) AS 'NUMB'
+                        FROM flightsTB  GROUP BY ORIGIN_CITY_NAME, DEST_CITY_NAME"
+ )
Warning message:
Closing open result set, pending rows 
> proc.time()-ptm
   user  system elapsed 
  1.308   0.544   1.865 
> 
> ptm=proc.time()
> query.2.res <- fetch(query.2, n=-1)
> proc.time()-ptm
   user  system elapsed 
  1.129   0.091   1.240 
> 
> dim(query.2.res)
[1] 3746    5

Application Let’s draw a dynamic scatter plot with highcherter package where the average departure delays is in the x-axis and the average of the arrival delays is in the y-axis. The size of the bullets are proportional to the number of flights and if you click on one bullet a pop-up will be displayed with all the information about it.

> library("highcharter", lib.loc="/Library/Frameworks/R.framework/Versions/3.5/Resources/library")
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
> x <- c("Origin","Destination", "Depart delay", "Arrival delay","Number")
> y1 <- sprintf("{point.%s:.2f}", c("AVR_DEP_DELAY",
+                                   "AVR_ARR_DELAY"))
> y2 <- sprintf("{point.%s:.0f}", c("NUMB"))
> y01<-paste0("{point.","ORIGIN_CITY_NAME","}")
> y02<-paste0("{point.","DEST_CITY_NAME","}")
> y<-c(y01,y02,y1,y2)
> tltip <- tooltip_table(x, y)
> p<-hchart(query.2.res, "scatter", 
+           hcaes(x = AVR_DEP_DELAY, y = AVR_ARR_DELAY, 
+                 size = NUMB),
+           debug = TRUE)
> p<-hchart(query.2.res, "scatter", 
+           hcaes(x = AVR_DEP_DELAY, y = AVR_ARR_DELAY, 
+                 size = NUMB),
+           debug = TRUE)
> p <- p %>%  hc_tooltip(useHTML = TRUE, headerFormat = "", pointFormat = tltip) %>% 
+   hc_size(height = 600)
> p<- p%>%  hc_chart(backgroundColor = "white") %>% 
+   hc_xAxis(title=list(text="Departure Delays"), gridLineWidth = 0) %>% 
+   hc_yAxis(title=list(text="Arrival  Delays"), gridLineWidth = 0)
> p <- p %>% hc_add_theme(hc_theme_flat())
> p

Let’s now export the data into a csv file.

> ptm=proc.time()
> dbWriteTable(conn = con,name =  "query_2_TB",
+              value =  query.2.res[,1:5],
+              overwrite = TRUE, row.names = FALSE)
> proc.time()-ptm
   user  system elapsed 
  0.113   0.007   0.126
> dbListTables(con)
[1] "flightsTB"  "query_2_TB"