SQL with R
SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.
If you want to learn more about SQLite visit the website: https://www.sqlite.org/
Installing SQLite
on your mac
Press Command+Space and type Terminal and press enter/return key
Run in Terminal app
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
< /dev/null 2> /dev/null
brew install sqlite
Let us checking now the installation installation
$ sqlite3
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>
$ cd /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018
$ sqlite3 flightsDB
.databases
to show all currently available databases:sqlite> .databases
seq name file
--- --------------- ---------------------------------------------------------------
0 main /Users/dhafermalouche/Documents/Teaching/BigDataWithR/fligthsDB
You can use R
to check the existence of the file in the directory /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018
> file.exists("flightsDB")
[1] TRUE
We will show how can we import a csv file like (flight.csv
~156 MB)
csv
filesqlite> .separator ","
csv
file as a new table called flightsTB
sqlite> .import flight.csv flightsTB
Let’s notice that it takes few seconds to be loaded
.tables
sqlite> .tables
flightsTB
Let’s inspect the structure of the table using a PRAGMA statement:
sqlite> PRAGMA table_info("flightsTB");
0,YEAR,TEXT,0,,0
1,MONTH,TEXT,0,,0
2,DAY_OF_MONTH,TEXT,0,,0
3,DAY_OF_WEEK,TEXT,0,,0
4,FL_DATE,TEXT,0,,0
5,UNIQUE_CARRIER,TEXT,0,,0
6,AIRLINE_ID,TEXT,0,,0
7,TAIL_NUM,TEXT,0,,0
8,FL_NUM,TEXT,0,,0
9,ORIGIN_AIRPORT_ID,TEXT,0,,0
10,ORIGIN,TEXT,0,,0
11,ORIGIN_CITY_NAME,TEXT,0,
...
We print the scheme of the table using .scheme (names of the variables, their classes)
sqlite> .schema flightsTB
CREATE TABLE flightsTB(
"YEAR" TEXT,
"MONTH" TEXT,
"DAY_OF_MONTH" TEXT,
"DAY_OF_WEEK" TEXT,
"FL_DATE" TEXT,
"UNIQUE_CARRIER" TEXT,
"AIRLINE_ID" TEXT,
"TAIL_NUM" TEXT,
"FL_NUM" TEXT,
... output truncated
RStudio
flightsDB
newly created (in RStudio
)> file.exists("flightsDB")
[1] TRUE
R
packages for SQLite> install.packages("devtools")
> devtools::install_github("RcppCore/Rcpp")
> devtools::install_github("rstats-db/DBI")
> install.packages("RSQLite")
> library(DBI)
> library(RSQLite)
flightsDB
SQLite database:> con<-dbConnect(RSQLite::SQLite(), "flightsDB")
> con
<SQLiteConnection>
Path: /Users/dhafermalouche/Documents/Teaching/BigDataWithR/flightsDB
Extensions: TRUE
Let’s see now how can we manage the SQL databases from RStudio
dbListTables()
provides information on the available tables in the connected database and columns within a specified table,> dbListTables(con)
[1] "flightsTB"
> dbRemoveTable(con, "query_2_TB")
> dbListTables(con)
[1] "flightsTB"
dbListFields()
shows the names of the variables in the table> dbListFields(con,"flightsTB")
[1] "YEAR" "MONTH" "DAY_OF_MONTH" "DAY_OF_WEEK"
[5] "FL_DATE" "UNIQUE_CARRIER" "AIRLINE_ID" "TAIL_NUM"
[9] "FL_NUM" "ORIGIN_AIRPORT_ID" "ORIGIN" "ORIGIN_CITY_NAME"
[13] "ORIGIN_STATE_NM"
... # truncated output
Let us creating a new table from the other table. We will extract the data on the flights in the month of September.
> query.1 <- dbSendQuery(con, "SELECT * FROM flightsTB WHERE MONTH = 9")
> dbGetStatement(query.1)
[1] "SELECT * FROM flightsTB WHERE MONTH = 9"
Extracting only the first 20 rows
> query.1.res <- fetch(query.1, n=20)
> class(query.1.res)
[1] "data.frame"
> head(query.1.res[,1:4])
YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK
1 2015 9 22 2
2 2015 9 22 2
3 2015 9 22 2
4 2015 9 22 2
5 2015 9 22 2
6 2015 9 22 2
> dim(query.1.res)
[1] 20 29
> info <- dbGetInfo(query.1)
> str(info)
List of 4
$ statement : chr "SELECT * FROM flightsTB WHERE MONTH = 9"
$ row.count : int 40
$ rows.affected: int 0
$ has.completed: logi FALSE
> dbClearResult(query.1)
> query.1 <- dbSendQuery(con,
+ "SELECT * FROM flightsTB WHERE MONTH = 9")
> ptm=proc.time()
> query.0.res <- fetch(query.1, n=-1)
> proc.time()-ptm
user system elapsed
8.252 0.256 8.804
> dim(query.0.res)
[1] 929892 29
> format(object.size(query.0.res),"Mb")
[1] "206.6 Mb"
> class(query.0.res)
[1] "data.frame"
Computing average of delays according to Departure and Arrival cities.
> ptm=proc.time()
> query.2 <- dbSendQuery(con, "SELECT ORIGIN_CITY_NAME, DEST_CITY_NAME,
+ AVG(DEP_DELAY) AS 'AVR_DEP_DELAY',AVG(ARR_DELAY) AS 'AVR_DEP_DELAY',
+ COUNT(DEP_DELAY) AS 'NUMB'
+ FROM flightsTB GROUP BY ORIGIN_CITY_NAME, DEST_CITY_NAME"
+ )
Warning message:
Closing open result set, pending rows
> proc.time()-ptm
user system elapsed
1.308 0.544 1.865
>
> ptm=proc.time()
> query.2.res <- fetch(query.2, n=-1)
> proc.time()-ptm
user system elapsed
1.129 0.091 1.240
>
> dim(query.2.res)
[1] 3746 5
Application Let’s draw a dynamic scatter plot with highcherter
package where the average departure delays is in the x-axis and the average of the arrival delays is in the y-axis. The size of the bullets are proportional to the number of flights and if you click on one bullet a pop-up will be displayed with all the information about it.
> library("highcharter", lib.loc="/Library/Frameworks/R.framework/Versions/3.5/Resources/library")
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
> x <- c("Origin","Destination", "Depart delay", "Arrival delay","Number")
> y1 <- sprintf("{point.%s:.2f}", c("AVR_DEP_DELAY",
+ "AVR_ARR_DELAY"))
> y2 <- sprintf("{point.%s:.0f}", c("NUMB"))
> y01<-paste0("{point.","ORIGIN_CITY_NAME","}")
> y02<-paste0("{point.","DEST_CITY_NAME","}")
> y<-c(y01,y02,y1,y2)
> tltip <- tooltip_table(x, y)
> p<-hchart(query.2.res, "scatter",
+ hcaes(x = AVR_DEP_DELAY, y = AVR_ARR_DELAY,
+ size = NUMB),
+ debug = TRUE)
> p<-hchart(query.2.res, "scatter",
+ hcaes(x = AVR_DEP_DELAY, y = AVR_ARR_DELAY,
+ size = NUMB),
+ debug = TRUE)
> p <- p %>% hc_tooltip(useHTML = TRUE, headerFormat = "", pointFormat = tltip) %>%
+ hc_size(height = 600)
> p<- p%>% hc_chart(backgroundColor = "white") %>%
+ hc_xAxis(title=list(text="Departure Delays"), gridLineWidth = 0) %>%
+ hc_yAxis(title=list(text="Arrival Delays"), gridLineWidth = 0)
> p <- p %>% hc_add_theme(hc_theme_flat())
> p
Let’s now export the data into a csv file.
> ptm=proc.time()
> dbWriteTable(conn = con,name = "query_2_TB",
+ value = query.2.res[,1:5],
+ overwrite = TRUE, row.names = FALSE)
> proc.time()-ptm
user system elapsed
0.113 0.007 0.126
> dbListTables(con)
[1] "flightsTB" "query_2_TB"