Large and Massive Data with R

Introduction

What’s a large and massive data?

  • R is not well-suited for working with data structures larger than about 10-20% of a computer’s RAM.

  • Data exceeding 50% of available RAM are essentially unusable.

  • We consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%

In this first lesson we will show how you can use powerful mathematical and data modeling R packages in large datasets, without the need for distributed computing.

In lesson we will see how

You can use R packages such as ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory performance

You can apply statistical methods to large R objects through the biglm and ffbase packages.

You can use data.table package for faster faster data manipulation methods

The ff, ffbase and ffbase2 packages

The title of the ff package is Memory-efficient storage of large data on disk and fast access functions. Indeed this packages works by chunking the dataset and storing it in your hard drive and ff structure in your R a mapping to the the partitioned dataset.

  • The chunks of raw data are simply binary flat files in native encoding,

  • The ff objects keep the metadata, which describe and link to the created binary files.

  • Installation

>  install.packages("ff")
>  install.packages("ffbase")
>  install.packages("ffbase2")
  • Example: Let us consider the data extracted from the website Bureau of Transportation Statistics a dataset on the flights in US in the month of September. The size of the csv file is ~5Mb. It contains about ~66,000 records. It’s considered small file but we will use it as an example to show how the ff package works.

Let’s show how can we import it into R

  1. First I load the package and create the directory where the data will be stored.
> system("mkdir air_ffdf")
mkdir: air_ffdf: File exists
  1. I will indicates the path to the newly created directory
> dir_air=paste0(getwd(),"/air_ffdf")
> dir_air
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf"
> options(fftempdir = dir_air)
  1. Now I can import the airline dataset
> ptm<-proc.time()
> airline.ff<- read.csv.ffdf(file = "flight_ff.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..65499 (65499)  csv-read=0.351sec ffdf-write=0.067sec
 csv-read=0.351sec  ffdf-write=0.067sec  TOTAL=0.418sec
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  0.353   0.035   0.424

Let’s recall that the size of the file is ~ 5Mb. However we can easily check that the new R object airline.ff is with a small size

> format(object.size(airline.ff),"Mb")
[1] "0.1 Mb"

We can notice in our working directory a new directory called air_fdff and we can see that it contains files like that

> list.files(dir_air)
 [1] "ffdf7051495570f.ff" "ffdf705151999fb.ff" "ffdf70516c0c4f7.ff" "ffdf7051cb88764.ff"
 [5] "ffdf70531918fe9.ff" "ffdf7053616d57b.ff" "ffdf70536e939fe.ff" "ffdf70543cc388f.ff"
... Truncated output 
> length(list.files(dir_air))
[1] 15

Every binary file corresponds to one variable in the data

> basename(filename(airline.ff$YEAR))
[1] "ffdf70536e939fe.ff"

Example Let’s now try a largest dataset with a size ~ 150Mb. This file contains ~ 950,000 of records about flight in US in September 2005.

> system("mkdir air_ffdf_l")
> dir_air_l=paste0(getwd(),"/air_ffdf_l")
> dir_air_l
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_l"
> options(fftempdir = dir_air_l)
> file.size("flight_large.csv")/1024/1024
[1] 149.0488
> ptm<-proc.time()
> airline.ff_l<- read.csv.ffdf(file = "flight_large.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..951111 (951111)  csv-read=8.92sec ffdf-write=0.694sec
 csv-read=8.92sec  ffdf-write=0.694sec  TOTAL=9.614sec
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  8.887   0.553   9.616 
> dim(airline.ff_l)
[1] 951111     28

Let us compare with read_csv of the readr package

> library(readr)
> ptm<-proc.time()
> airline_l<-read_csv("flight_large.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  FL_DATE = col_date(format = ""),
  UNIQUE_CARRIER = col_character(),
  TAIL_NUM = col_character(),
  ORIGIN = col_character(),
  ORIGIN_CITY_NAME = col_character(),
  ORIGIN_STATE_NM = col_character(),
  DEST = col_character(),
  DEST_CITY_NAME = col_character(),
  DEST_STATE_NM = col_character(),
  CANCELLATION_CODE = col_character()
)
See spec(...) for full column specifications.
|===================================================================================| 100%  149 MB
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  2.924   0.306   3.179 

But, we occupy less memory with ff objects:

> format(object.size(airline_l),"Mb")
[1] "138.2 Mb"
> format(object.size(airline.ff_l),"Mb")
[1] "0.5 Mb"

Example 3 We will try now a largest dataset. The airline_20MM.csv is also extracted from the same website but it’s a data that contains 20 millions records and its size is 1.5Gb. It’s not possible then to use the classical way that consists of using read_csv command because your compute may crash. We will show below how much takes times and memory-space to import this data into R.

> system("mkdir air_ffdf_20mm")
> dir_air_20mm=paste0(getwd(),"/air_ffdf_20mm")
> dir_air_20mm
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_20mm"
> options(fftempdir = dir_air_20mm)
> ptm<-proc.time()
> airline.ff_20mm<- read.csv.ffdf(file = "airline_20MM.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..20000000 (20000000)  csv-read=175.176sec ffdf-write=8.6sec
 csv-read=175.176sec  ffdf-write=8.6sec  TOTAL=183.776sec
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
163.075  13.505 183.835

Data manipulations with ff objects

Computing cross tabs

> ptm<-proc.time()
> t1.ff<-table.ff(airline.ff_l$ORIGIN_STATE_NM,airline.ff_l$DEST_STATE_NM)
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  0.439   0.034   0.478 
> 
> format(object.size(t1.ff),"Kb")
[1] "18.6 Kb"

and we consider the classic way

> ptm<-proc.time()
> t1<-table(airline_l$ORIGIN_STATE_NM,airline_l$DEST_STATE_NM)
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  0.135   0.024   0.158 
> 
> format(object.size(t1),"Kb")
[1] "18.6 Kb"

We will show then using the packages ffbase and ffbase2 how to

  • discretize a numeric variable
  • use the function ffdfdply
  • extract subset of the data
  • Save, load ff objects and export them in csv files

Discretize a numeric variable

Let’s first show how can we describe one numeric variable

> library(Hmisc)
> ptm<-proc.time()
> dd=describe(as.data.frame.ffdf(airline.ff_20mm$Distance))
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 19.006   2.602  22.462 
> dd
as.data.frame.ffdf(airline.ff_20mm$Distance) 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50 
   2e+07        0     1531        1      726    582.6      144      199      321      576 
     .75      .90      .95 
     952     1522     1964 

lowest :   11   16   17   18   21, highest: 4243 4431 4433 4502 4962
> max(airline.ff_20mm$Distance)
[1] 4962

Let me recall that Gmd stands for Gini’s mean difference. It is computed for numeric variables and it is a robust measure of dispersion that is the mean absolute difference between any pairs of observations.

> ptm<-proc.time()
> dist_cat=cut.ff(airline.ff_20mm$Distance,breaks = c(0,150,300,450,600,750,900,1050,1200,2000,5000),right=F)
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  2.418   0.586   3.035 

Let’s then compute the frequency of each category

> ptm<-proc.time()
> zz=table.ff(dist_cat)
> zz=as.data.frame(zz)
> zz
                 Var1    Freq
1             [0,150) 1045424
2           [150,300) 3429694
3           [300,450) 3582526
4           [450,600) 2512883
5           [600,750) 2174993
6           [750,900) 1663260
7      [900,1.05e+03) 1602199
8  [1.05e+03,1.2e+03)  953791
9     [1.2e+03,2e+03) 2103486
10      [2e+03,5e+03)  931744
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 10.105   0.506  10.640 

I will change the names of the last 4 categories and plot the Bar chart using googleViz package

> zz$Var1=as.character(zz$Var1)
>  zz$Var1[7:10]=c("[900,1050)","[1050,1200)","[1200,2000)","More than 2000")
> zz
             Var1    Freq
1         [0,150) 1045424
2       [150,300) 3429694
3       [300,450) 3582526
4       [450,600) 2512883
5       [600,750) 2174993
6       [750,900) 1663260
7      [900,1050) 1602199
8     [1050,1200)  953791
9     [1200,2000) 2103486
10 More than 2000  931744
> colnames(zz)[2]="# flights"
> library(googleVis)
> Column <- gvisColumnChart(zz,options=list(legend="none",width=1000, height=600,
+                                           title="Number of flights by distance category (miles)"))
> plot(Column)

How to use ffdfdply function?

We will show on an example how we can use the ffdfdply function. It allows us to carry out several operations on data. We will in the example below compute from the newly imported data from the dataset in the file airline_20MM.csv the average of the departure delays by the the origin city.

We will use also the summaryBy function for doBy package to calculate group-wise summary statistics, much like the summary procedure of SAS.

> library(doBy)
> ptm<-proc.time()
> DepDelayByDayofMonth <- ffdfdply(airline.ff_20mm, 
+                                split = airline.ff_20mm$DayofMonth,
+                                FUN=function(x) {
+                                  summaryBy(DepDelay~DayofMonth, 
+                                            data=x, FUN=mean, na.rm=TRUE)})
2018-10-29 09:55:19, calculating split sizes
2018-10-29 09:55:25, building up split locations
2018-10-29 09:55:38, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB
2018-10-29 09:55:42, ... applying FUN to selected data
 ...Truncated Output 
 
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 92.310  22.068 120.041 
 
> ptm<-proc.time()
> ArrDelayByDayofMonth <- ffdfdply(airline.ff_20mm, 
+                                  split = airline.ff_20mm$DayofMonth,
+                                  FUN=function(x) {
+                                    summaryBy(ArrDelay~DayofMonth, 
+                                              data=x, FUN=mean, na.rm=TRUE)})
2018-10-29 10:13:56, calculating split sizes
2018-10-29 10:14:02, building up split locations
2018-10-29 10:14:14, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB
 ...Truncated Output 

x_time=proc.time()-ptm
x_time
   user  system elapsed 
 96.844  20.907 120.721 

Let us now construct a dataframe with three columns: The day of the Month, the Average of the Departure Delay and the Average of the Arrival Delay.

> d1=as.data.frame(DepDelayByDayofMonth)
> d1=d1[order(d1$DayofMonth),]
> d2=as.data.frame(ArrDelayByDayofMonth)
> d2=d2[order(d2$DayofMonth),]
> dd=cbind.data.frame(d1$DayofMonth,d1$DepDelay.mean,d2$ArrDelay.mean)
> colnames(dd)=c("Day of the Month","Departure Delay","Arrival Delay")

And here’s the graph with googleVis

> gvisline1 <- gvisAreaChart(dd,"Day of the Month", c("Departure Delay","Arrival Delay"),
+                            options=list(width=1200, height=600,
+                              series="[{targetAxisIndex: 0},
+                              {targetAxisIndex:1}]"
+                            ))
> plot(gvisline1)

Extracting subset of dataset

We will extract from the ff object airline.ff_20mm the subset of data of flights with delays only on arrival and departure. We will will also some other variables like Distance, FlightNum, Year..

> unique(airline.ff_20mm$IsDepDelayed)
ff (open) integer length=2 (2)
[1] [2] 
  0   1 
> unique(airline.ff_20mm$IsArrDelayed)
ff (open) integer length=2 (2)
[1] [2] 
  0   1 
> table.ff(airline.ff_20mm$IsArrDelayed,airline.ff_20mm$IsDepDelayed)
   
          0       1
  0 8788725 1827306
  1 2733591 6650378
> ptm<-proc.time()
> subs1.ff <- subset.ffdf(airline.ff_20mm, IsDepDelayed == 1 & IsArrDelayed==1, 
+                         select = c(Distance, ArrDelay, 
+                                    DepDelay,
+                                    FlightNum,
+                                    Year,
+                                    Month,
+                                    DayofMonth))
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  2.139   1.030   3.226 
> dim(subs1.ff)
[1] 6650378       7

Exercice Write an R code that gives the averages of delays at arrival and departure by every flight number and year? Draw a bubble chart using GoogleVis.

Saving, laoding ff objects

Now we may need to save the newly created data and load it later. Exactly like before we need to create a new directory to store the new binary files.

> system("mkdir subs_ff")
mkdir: subs_ff: File exists
> dir_subs1=paste0(getwd(),"/subs1_ff")
> dir_subs1
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff"
> save.ffdf(subs1.ff,dir = dir_subs1)
> list.files(dir_subs1)
[1] "subs1.ff$ArrDelay.ff"   "subs1.ff$DayofMonth.ff" "subs1.ff$DepDelay.ff"  
[4] "subs1.ff$Distance.ff"   "subs1.ff$FlightNum.ff"  "subs1.ff$Month.ff"     
[7] "subs1.ff$Year.ff"

Let us now remove the newly created data from our R environment and load it again into R

> rm(subs1.ff)
> load.ffdf(dir_subs1)
> ls()
 ... Truncated output 
 
[13] "dir_subs1"            "dist_cat"             "gvisline1"           
[16] "ptm"                  "subs1.ff"             "t1"                  
....

To end this section we will show now how to save the data in a csv format file.

> ptm<-proc.time()
> write.csv.ffdf(subs1.ff, "subset1.csv", VERBOSE = TRUE)
write.table.ffdf 1..opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Distance.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$ArrDelay.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DepDelay.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$FlightNum.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Year.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Month.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DayofMonth.ff
599186 (599186, 9%)  ffdf-read=0.281sec csv-write=1.748sec
write.table.ffdf 599187..1198372 (599186, 18%)  ffdf-read=0.246sec csv-write=1.718sec
.... Truncated Output 
write.table.ffdf 4194303..4793488 (599186, 72.1%)  ffdf-read=0.423sec csv-write=1.674sec
write.table.ffdf 4793489..5392674 (599186, 81.1%)  ffdf-read=0.474sec csv-write=1.695sec
write.table.ffdf 5392675..5991860 (599186, 90.1%)  ffdf-read=0.235sec csv-write=1.733sec
.... Truncated output 
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 21.566   0.987  23.170 

Performing a GLM with ff objects

I will be showing in this section an example of performing Generalized Linear Models on massive dataset. This example can be found in the book “Big Data Analytics with R”.

The data will be used is from available from the Machine Learning Depository maintained by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html. It’s the Chronic Kidney Disease Dataset.

The data is in a file in .arff format (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. To import this data we need to install and load RWeka package.

We import the data into R

> library(RWeka)
> ckd <- read.arff("ckd_full.arff")
> class(ckd)
[1] "data.frame"
> colnames(ckd)
 [1] "age"   "bp"    "sg"    "al"    "su"    "rbc"   "pc"    "pcc"   "ba"    "bgr"  
[11] "bu"    "sc"    "sod"   "pot"   "hemo"  "pcv"   "wbcc"  "rbcc"  "htn"   "dm"   
[21] "cad"   "appet" "pe"    "ane"   "class"
> dim(ckd)
[1] 400  25

We need to recode the target variable class to a binary variable

> library(ETLUtils)
> ckd$class <- recoder(ckd$class, from = c(1,2), to=c(1,0))
> unique(ckd$class)
[1] 1 0
> table(ckd$class)

  0   1 
150 250 

First Generalized Linear Model

> model0 <- glm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd, 
+               family=binomial(link = "logit"), 
+               na.action = na.omit)
> summary(model0)

Call:
glm(formula = class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, 
    family = binomial(link = "logit"), data = ckd, na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3531  -0.0441  -0.0025   0.0016   3.2946  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) 12.6368996  6.2190363   2.032 0.042157 *  
age          0.0238911  0.0327356   0.730 0.465500    
bp           0.1227932  0.0591324   2.077 0.037840 *  
bgr          0.0792933  0.0212243   3.736 0.000187 ***
bu           0.0026834  0.0296290   0.091 0.927838    
rbcc        -0.9518575  0.8656291  -1.100 0.271501    
wbcc         0.0002601  0.0002432   1.070 0.284725    
hemo        -2.3246914  0.6404712  -3.630 0.000284 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 325.910  on 236  degrees of freedom
Residual deviance:  35.121  on 229  degrees of freedom
  (163 observations deleted due to missingness)
AIC: 51.121

Number of Fisher Scoring iterations: 9

Now we will use the bigglm.ffdf() function from the ffbase package to perform a GLM. But bigglm.ffdf can be applied only on ff objects that’s why we will transform the ckd data to an ff package. We need first to create a directory where the binary variables will be stored.

> library(ffbase)
> system("mkdir ckd_dir")
> dir_ckd=paste0(getwd(),"/ckd_dir")
> dir_ckd
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/ckd_dir"
> options(fftempdir = dir_ckd)
> ckd.ff <- as.ffdf(ckd)
> list.files(dir_ckd)
 [1] "ffdf705103d9817.ff" "ffdf70514602150.ff" "ffdf70514675b1.ff" 
 [4] "ffdf705166a2bc7.ff" "ffdf7051b3fd709.ff" "ffdf7051c1fd6ba.ff"
 [7] "ffdf7051db22782.ff"
... Truncated output 

We use then bigglm.ff function.

> model1 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, 
+                       data = ckd.ff, family=binomial(link = "logit"), 
+                       na.action = na.exclude)
> summary(model1)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff, 
    family = binomial(link = "logit"), na.action = na.exclude)
Sample size =  400 
failed to converge after 8  iterations
               Coef    (95%     CI)     SE      p
(Intercept) 12.6330  0.3275 24.9384 6.1527 0.0401
age          0.0239 -0.0410  0.0888 0.0324 0.4616
bp           0.1228  0.0061  0.2394 0.0583 0.0353
bgr          0.0793  0.0375  0.1210 0.0209 0.0001
bu           0.0027 -0.0558  0.0611 0.0292 0.9268
rbcc        -0.9515 -2.6587  0.7557 0.8536 0.2650
wbcc         0.0003 -0.0002  0.0007 0.0002 0.2780
hemo        -2.3239 -3.5804 -1.0675 0.6282 0.0002

We will now show how this previous function works with massive data. We will then expand the size of the data to 8,000,000 observations according to the following process

> x=ffseq_len(4)
> x
ff (open) integer length=4 (4)
[1] [2] [3] [4] 
  1   2   3   4 
> expand.ffgrid(x, ff(1:2))
ffdf (all open) dim=c(8,2), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
     PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix
Var1         Var1      integer       integer FALSE           FALSE            FALSE
Var2         Var2      integer       integer FALSE           FALSE            FALSE
     PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
Var1                 1                1               1           TRUE
Var2                 2                1               1           TRUE
ffdf data
  Var1 Var2
1    1    1
2    2    1
3    3    1
4    4    1
5    1    2
6    2    2
7    3    2
8    4    2

Let’s do the same with ckd by expanding it to 8,000,000= 400 x 20,000

> ckd.ff$id <- ffseq_len(nrow(ckd.ff))
> ckd.ff$id
ff (open) integer length=400 (400)
  [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]       [393] [394] [395] [396] [397] 
    1     2     3     4     5     6     7     8     :   393   394   395   396   397 
[398] [399] [400] 
  398   399   400 
> ptm<-proc.time()
> big.ckd.ff <- expand.ffgrid(ckd.ff$id, ff(1:20000))
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  0.423   0.277   0.815 
> colnames(big.ckd.ff) <- c("id","expand")
> head(big.ckd.ff)
  ffdf (all open) dim=c(8000000,2), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
       PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix
id             Var1      integer       integer FALSE           FALSE            FALSE
expand         Var2      integer       integer FALSE           FALSE            FALSE
       PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
id                     1                1               1           TRUE
expand                 2                1               1           TRUE
ffdf data
           id expand
1           1      1
2           2      1
3           3      1
4           4      1
5           5      1
6           6      1
7           7      1
8           8      1
:           :      :
7999993   393  20000
7999994   394  20000
7999995   395  20000
7999996   396  20000
7999997   397  20000
7999998   398  20000
7999999   399  20000
8000000   400  20000

And finally we merge big.ckd.ff with ckd.ff in order to obtain an expansion of ckd data to a data with 8,000,000

> ptm<-proc.time()
> big.ckd.ff <- merge.ffdf(big.ckd.ff, ckd.ff, 
+                          by.x="id", by.y="id", 
+                          all.x=TRUE, all.y=FALSE)
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  3.860   2.611   7.100 
> dim(big.ckd.ff)
[1] 8000000      27

Let us then perform the logistic model on the newly created data

> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, 
+                       data = big.ckd.ff, family=binomial(), 
+                       na.action = na.exclude)
Warning message:
In bigglm.function(formula, data = ffchunk, family = family, ...) :
  ran out of iterations and failed to converge
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 97.958   6.797 105.977 
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff, 
    family = binomial(), na.action = na.exclude)
Sample size =  8e+06 
failed to converge after 8  iterations
               Coef    (95%     CI)     SE p
(Intercept) 12.6330 12.5460 12.7200 0.0435 0
age          0.0239  0.0234  0.0243 0.0002 0
bp           0.1228  0.1219  0.1236 0.0004 0
bgr          0.0793  0.0790  0.0796 0.0001 0
bu           0.0027  0.0023  0.0031 0.0002 0
rbcc        -0.9515 -0.9635 -0.9394 0.0060 0
wbcc         0.0003  0.0003  0.0003 0.0000 0
hemo        -2.3239 -2.3328 -2.3150 0.0044 0

We can notice here that there’s a warning about the convergence of the algorithm. There is no exact rule that can be used to avoid this convergence failure but we can play with two arguments in the bigglm.ffdf function by adjusting the chunksize parameter and the maxit as we proceed here

> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, 
+                       data = big.ckd.ff, family=binomial(), 
+                       na.action = na.exclude, sandwich = TRUE, 
+                       chunksize = 20000, maxit = 40)
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
337.946  44.423 385.623 
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff, 
    family = binomial(), na.action = na.exclude, sandwich = TRUE, 
    chunksize = 20000, maxit = 40)
Sample size =  8e+06 
               Coef    (95%     CI)     SE p
(Intercept) 12.6369 12.5332 12.7406 0.0518 0
age          0.0239  0.0235  0.0243 0.0002 0
bp           0.1228  0.1223  0.1233 0.0002 0
bgr          0.0793  0.0789  0.0796 0.0002 0
bu           0.0027  0.0023  0.0030 0.0002 0
rbcc        -0.9519 -0.9589 -0.9448 0.0035 0
wbcc         0.0003  0.0003  0.0003 0.0000 0
hemo        -2.3247 -2.3375 -2.3118 0.0064 0
Sandwich (model-robust) standard errors

Example: Netflix data

Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor’s Pragmatic Chaos team. Here’s is the dataset that was used in that competition. It can be downloaded from Kaggle website

This data is stored in 4 txt files ~ 400-500 Mb each and our aim now is to import these files into R, merge them all in only one dataset and in order to be able to make some statistical operations.

Here’s how the data looks like. The first row is composed of the code movie and the next rows are composed with the customer ID, its review which is a score from 1 to 5 and the data of the watching the movie. All the fields are separated by a comma. Indeed we can’t use use sep="," because of the rows containing the movie IDs. We will then import each dataset, we will merge them in one dataset and we will then write an R code where we will detect the rows containing the Movie IDs and separate them from the file.

1:                  
1488844,3,2005-09-06
822109,5,2005-05-13 
885013,4,2005-10-19 
30878,4,2005-12-26  
823519,3,2004-05-03 
893988,3,2005-11-17 
124105,4,2004-08-05 

This time we won’t use ff package but we will use data.table package which is, as we will see, faster than ff package. We will also show later how can we make data manipulations using this package to create only one dataset.

> library(data.table)
> rm(list=ls())
> ptm<-proc.time()
> d1<- fread("./Netflix_data/combined_data_1.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 13.614   0.330  14.044 
> ptm<-proc.time()
> d2<- fread("./Netflix_data/combined_data_2.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 23.246   1.890  29.776 
> 
> 
> ptm<-proc.time()
> d3<- fread("./Netflix_data/combined_data_3.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 28.087   1.494  34.298 
> ptm<-proc.time()
> d4<- fread("./Netflix_data/combined_data_4.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 22.398   0.702  23.462 

Let us now merge all these datasets together using rbind command

> ptm<-proc.time()
> data_net=do.call(rbind,list(d1,d2,d3,d4))
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
 18.205   0.352  18.638 
> 
> ptm<-proc.time()
> dim(data_net)
[1] 100498277         1
> x_time=proc.time()-ptm
> x_time
   user  system elapsed 
  0.003   0.000   0.003 

Here’s the code that can help us to create one dataset with columns: Movie ID, Customer ID, Rating and Date. Unfortunately this code is too slow and it won’t be useful for us. We will show later how we can improve using other massive data tools.

data_net=data.frame()
i=1
p=nrow(d1)
while(i < (p+1)){
  Sys.sleep(.1) # A counter for the loop
  cat("\r", i, "of", (p+1) )
  flush.console()
  zz=F
  x=d1[i,1]
  ztt=data.frame()
  while(zz==F){
    i=i+1
    zt=unlist(strsplit(x = as.character(data_net[i,1]),","))
    ztt=rbind.data.frame(ztt,zt,stringsAsFactors = F,make.row.names = F)
    zz=grepl(":",data_net[i,1])
  }
  ztt$MovieID=rep(x,nrow(ztt))
  colnames(ztt)=c("CostumerID","Rate","Date","MovieID")
  dd1=rbind.data.frame(dd1,ztt,stringsAsFactors = F,make.row.names = F)
}