Large nd Massive Data with R

## Introduction

What’s a large and massive data?

• R is not well-suited for working with data structures larger than about 10-20% of a computer’s RAM.

• Data exceeding 50% of available RAM are essentially unusable.

• We consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%

In this first lesson we will show how you can use powerful mathematical and data modeling R packages in large datasets, without the need for distributed computing.

In lesson we will see how

You can use R packages such as ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory performance

You can apply statistical methods to large R objects through the biglm and ffbase packages.

You can use data.table package for faster faster data manipulation methods

## The ff, ffbase and ffbase2 packages

The title of the ff package is Memory-efficient storage of large data on disk and fast access functions. Indeed this packages works by chunking the dataset and storing it in your hard drive and ff structure in your R a mapping to the the partitioned dataset.

• The chunks of raw data are simply binary flat files in native encoding,

• The ff objects keep the metadata, which describe and link to the created binary files.

• Installation

>  install.packages("ff")
>  install.packages("ffbase")
>  install.packages("ffbase2")
• Example: Let us consider the data extracted from the website Bureau of Transportation Statistics a dataset on the flights in US in the month of September. The size of the csv file is ~5Mb. It contains about ~66,000 records. It’s considered small file but we will use it as an example to show how the ff package works.

Let’s show how can we import it into R

1. First I load the package and create the directory where the data will be stored.
> system("mkdir air_ffdf")
mkdir: air_ffdf: File exists
1. I will indicates the path to the newly created directory
> dir_air=paste0(getwd(),"/air_ffdf")
> dir_air
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf"
> options(fftempdir = dir_air)
1. Now I can import the airline dataset
> ptm<-proc.time()
> x_time=proc.time()-ptm
> x_time
user  system elapsed
0.353   0.035   0.424

Let’s recall that the size of the file is ~ 5Mb. However we can easily check that the new R object airline.ff is with a small size

> format(object.size(airline.ff),"Mb")
[1] "0.1 Mb"

We can notice in our working directory a new directory called air_fdff and we can see that it contains files like that

> list.files(dir_air)
[1] "ffdf7051495570f.ff" "ffdf705151999fb.ff" "ffdf70516c0c4f7.ff" "ffdf7051cb88764.ff"
[5] "ffdf70531918fe9.ff" "ffdf7053616d57b.ff" "ffdf70536e939fe.ff" "ffdf70543cc388f.ff"
... Truncated output
> length(list.files(dir_air))
[1] 15

Every binary file corresponds to one variable in the data

> basename(filename(airline.ff$YEAR)) [1] "ffdf70536e939fe.ff" Example Let’s now try a largest dataset with a size ~ 150Mb. This file contains ~ 950,000 of records about flight in US in September 2005. > system("mkdir air_ffdf_l") > dir_air_l=paste0(getwd(),"/air_ffdf_l") > dir_air_l [1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_l" > options(fftempdir = dir_air_l) > file.size("flight_large.csv")/1024/1024 [1] 149.0488 > ptm<-proc.time() > airline.ff_l<- read.csv.ffdf(file = "flight_large.csv",VERBOSE = TRUE,first.rows = -1,header=T) read.table.ffdf 1..951111 (951111) csv-read=8.92sec ffdf-write=0.694sec csv-read=8.92sec ffdf-write=0.694sec TOTAL=9.614sec > x_time=proc.time()-ptm > x_time user system elapsed 8.887 0.553 9.616 > dim(airline.ff_l) [1] 951111 28 Let us compare with read_csv of the readr package > library(readr) > ptm<-proc.time() > airline_l<-read_csv("flight_large.csv") Parsed with column specification: cols( .default = col_integer(), FL_DATE = col_date(format = ""), UNIQUE_CARRIER = col_character(), TAIL_NUM = col_character(), ORIGIN = col_character(), ORIGIN_CITY_NAME = col_character(), ORIGIN_STATE_NM = col_character(), DEST = col_character(), DEST_CITY_NAME = col_character(), DEST_STATE_NM = col_character(), CANCELLATION_CODE = col_character() ) See spec(...) for full column specifications. |===================================================================================| 100% 149 MB > x_time=proc.time()-ptm > x_time user system elapsed 2.924 0.306 3.179  But, we occupy less memory with ff objects: > format(object.size(airline_l),"Mb") [1] "138.2 Mb" > format(object.size(airline.ff_l),"Mb") [1] "0.5 Mb" Example 3 We will try now a largest dataset. The airline_20MM.csv is also extracted from the same website but it’s a data that contains 20 millions records and its size is 1.5Gb. It’s not possible then to use the classical way that consists of using read_csv command because your compute may crash. We will show below how much takes times and memory-space to import this data into R. > system("mkdir air_ffdf_20mm") > dir_air_20mm=paste0(getwd(),"/air_ffdf_20mm") > dir_air_20mm [1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_20mm" > options(fftempdir = dir_air_20mm) > ptm<-proc.time() > airline.ff_20mm<- read.csv.ffdf(file = "airline_20MM.csv",VERBOSE = TRUE,first.rows = -1,header=T) read.table.ffdf 1..20000000 (20000000) csv-read=175.176sec ffdf-write=8.6sec csv-read=175.176sec ffdf-write=8.6sec TOTAL=183.776sec > x_time=proc.time()-ptm > x_time user system elapsed 163.075 13.505 183.835 ## Data manipulations with ff objects Computing cross tabs > ptm<-proc.time() > t1.ff<-table.ff(airline.ff_l$ORIGIN_STATE_NM,airline.ff_l$DEST_STATE_NM) > x_time=proc.time()-ptm > x_time user system elapsed 0.439 0.034 0.478 > > format(object.size(t1.ff),"Kb") [1] "18.6 Kb" and we consider the classic way > ptm<-proc.time() > t1<-table(airline_l$ORIGIN_STATE_NM,airline_l$DEST_STATE_NM) > x_time=proc.time()-ptm > x_time user system elapsed 0.135 0.024 0.158 > > format(object.size(t1),"Kb") [1] "18.6 Kb" We will show then using the packages ffbase and ffbase2 how to • discretize a numeric variable • use the function ffdfdply • extract subset of the data • Save, load ff objects and export them in csv files ### Discretize a numeric variable Let’s first show how can we describe one numeric variable > library(Hmisc) > ptm<-proc.time() > dd=describe(as.data.frame.ffdf(airline.ff_20mm$Distance))
> x_time=proc.time()-ptm
> x_time
user  system elapsed
19.006   2.602  22.462
> dd
as.data.frame.ffdf(airline.ff_20mm$Distance) n missing distinct Info Mean Gmd .05 .10 .25 .50 2e+07 0 1531 1 726 582.6 144 199 321 576 .75 .90 .95 952 1522 1964 lowest : 11 16 17 18 21, highest: 4243 4431 4433 4502 4962 > max(airline.ff_20mm$Distance)
[1] 4962

Let me recall that Gmd stands for Gini’s mean difference. It is computed for numeric variables and it is a robust measure of dispersion that is the mean absolute difference between any pairs of observations.

> ptm<-proc.time()
> dist_cat=cut.ff(airline.ff_20mm$Distance,breaks = c(0,150,300,450,600,750,900,1050,1200,2000,5000),right=F) > x_time=proc.time()-ptm > x_time user system elapsed 2.418 0.586 3.035  Let’s then compute the frequency of each category > ptm<-proc.time() > zz=table.ff(dist_cat) > zz=as.data.frame(zz) > zz Var1 Freq 1 [0,150) 1045424 2 [150,300) 3429694 3 [300,450) 3582526 4 [450,600) 2512883 5 [600,750) 2174993 6 [750,900) 1663260 7 [900,1.05e+03) 1602199 8 [1.05e+03,1.2e+03) 953791 9 [1.2e+03,2e+03) 2103486 10 [2e+03,5e+03) 931744 > x_time=proc.time()-ptm > x_time user system elapsed 10.105 0.506 10.640  I will change the names of the last 4 categories and plot the Bar chart using googleViz package > zz$Var1=as.character(zz$Var1) > zz$Var1[7:10]=c("[900,1050)","[1050,1200)","[1200,2000)","More than 2000")
> zz
Var1    Freq
1         [0,150) 1045424
2       [150,300) 3429694
3       [300,450) 3582526
4       [450,600) 2512883
5       [600,750) 2174993
6       [750,900) 1663260
7      [900,1050) 1602199
8     [1050,1200)  953791
9     [1200,2000) 2103486
10 More than 2000  931744
> colnames(zz)[2]="# flights"
> Column <- gvisColumnChart(zz,options=list(legend="none",width=1000, height=600,
+                                           title="Number of flights by distance category (miles)"))
> plot(Column)

### How to use ffdfdply function?

We will show on an example how we can use the ffdfdply function. It allows us to carry out several operations on data. We will in the example below compute from the newly imported data from the dataset in the file airline_20MM.csv the average of the departure delays by the the origin city.

We will use also the summaryBy function for doBy package to calculate group-wise summary statistics, much like the summary procedure of SAS.

> library(doBy)
> ptm<-proc.time()
> DepDelayByDayofMonth <- ffdfdply(airline.ff_20mm,
+                                split = airline.ff_20mm$DayofMonth, + FUN=function(x) { + summaryBy(DepDelay~DayofMonth, + data=x, FUN=mean, na.rm=TRUE)}) 2018-10-29 09:55:19, calculating split sizes 2018-10-29 09:55:25, building up split locations 2018-10-29 09:55:38, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB 2018-10-29 09:55:42, ... applying FUN to selected data ...Truncated Output > x_time=proc.time()-ptm > x_time user system elapsed 92.310 22.068 120.041 > ptm<-proc.time() > ArrDelayByDayofMonth <- ffdfdply(airline.ff_20mm, + split = airline.ff_20mm$DayofMonth,
+                                  FUN=function(x) {
+                                    summaryBy(ArrDelay~DayofMonth,
+                                              data=x, FUN=mean, na.rm=TRUE)})
2018-10-29 10:13:56, calculating split sizes
2018-10-29 10:14:02, building up split locations
2018-10-29 10:14:14, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB
...Truncated Output

x_time=proc.time()-ptm
x_time
user  system elapsed
96.844  20.907 120.721 

Let us now construct a dataframe with three columns: The day of the Month, the Average of the Departure Delay and the Average of the Arrival Delay.

> d1=as.data.frame(DepDelayByDayofMonth)
> d1=d1[order(d1$DayofMonth),] > d2=as.data.frame(ArrDelayByDayofMonth) > d2=d2[order(d2$DayofMonth),]
> dd=cbind.data.frame(d1$DayofMonth,d1$DepDelay.mean,d2$ArrDelay.mean) > colnames(dd)=c("Day of the Month","Departure Delay","Arrival Delay") And here’s the graph with googleVis > gvisline1 <- gvisAreaChart(dd,"Day of the Month", c("Departure Delay","Arrival Delay"), + options=list(width=1200, height=600, + series="[{targetAxisIndex: 0}, + {targetAxisIndex:1}]" + )) > plot(gvisline1) ### Extracting subset of dataset We will extract from the ff object airline.ff_20mm the subset of data of flights with delays only on arrival and departure. We will will also some other variables like Distance, FlightNum, Year.. > unique(airline.ff_20mm$IsDepDelayed)
ff (open) integer length=2 (2)
[1] [2]
0   1
> unique(airline.ff_20mm$IsArrDelayed) ff (open) integer length=2 (2) [1] [2] 0 1 > table.ff(airline.ff_20mm$IsArrDelayed,airline.ff_20mm$IsDepDelayed) 0 1 0 8788725 1827306 1 2733591 6650378 > ptm<-proc.time() > subs1.ff <- subset.ffdf(airline.ff_20mm, IsDepDelayed == 1 & IsArrDelayed==1, + select = c(Distance, ArrDelay, + DepDelay, + FlightNum, + Year, + Month, + DayofMonth)) > x_time=proc.time()-ptm > x_time user system elapsed 2.139 1.030 3.226 > dim(subs1.ff) [1] 6650378 7 Exercice Write an R code that gives the averages of delays at arrival and departure by every flight number and year? Draw a bubble chart using GoogleVis. ### Saving, laoding ff objects Now we may need to save the newly created data and load it later. Exactly like before we need to create a new directory to store the new binary files. > system("mkdir subs_ff") mkdir: subs_ff: File exists > dir_subs1=paste0(getwd(),"/subs1_ff") > dir_subs1 [1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff" > save.ffdf(subs1.ff,dir = dir_subs1) > list.files(dir_subs1) [1] "subs1.ff$ArrDelay.ff"   "subs1.ff$DayofMonth.ff" "subs1.ff$DepDelay.ff"
[4] "subs1.ff$Distance.ff" "subs1.ff$FlightNum.ff"  "subs1.ff$Month.ff" [7] "subs1.ff$Year.ff"

Let us now remove the newly created data from our R environment and load it again into R

> rm(subs1.ff)
> ls()
... Truncated output

[13] "dir_subs1"            "dist_cat"             "gvisline1"
[16] "ptm"                  "subs1.ff"             "t1"
....

To end this section we will show now how to save the data in a csv format file.

> ptm<-proc.time()
> write.csv.ffdf(subs1.ff, "subset1.csv", VERBOSE = TRUE)
write.table.ffdf 1..opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Distance.ff opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$ArrDelay.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DepDelay.ff opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$FlightNum.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Year.ff opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Month.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DayofMonth.ff 599186 (599186, 9%) ffdf-read=0.281sec csv-write=1.748sec write.table.ffdf 599187..1198372 (599186, 18%) ffdf-read=0.246sec csv-write=1.718sec .... Truncated Output write.table.ffdf 4194303..4793488 (599186, 72.1%) ffdf-read=0.423sec csv-write=1.674sec write.table.ffdf 4793489..5392674 (599186, 81.1%) ffdf-read=0.474sec csv-write=1.695sec write.table.ffdf 5392675..5991860 (599186, 90.1%) ffdf-read=0.235sec csv-write=1.733sec .... Truncated output > x_time=proc.time()-ptm > x_time user system elapsed 21.566 0.987 23.170  ## Performing a GLM with ff objects I will be showing in this section an example of performing Generalized Linear Models on massive dataset. This example can be found in the book “Big Data Analytics with R”. The data will be used is from available from the Machine Learning Depository maintained by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html. It’s the Chronic Kidney Disease Dataset. The data is in a file in .arff format (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. To import this data we need to install and load RWeka package. We import the data into R > library(RWeka) > ckd <- read.arff("ckd_full.arff") > class(ckd) [1] "data.frame" > colnames(ckd) [1] "age" "bp" "sg" "al" "su" "rbc" "pc" "pcc" "ba" "bgr" [11] "bu" "sc" "sod" "pot" "hemo" "pcv" "wbcc" "rbcc" "htn" "dm" [21] "cad" "appet" "pe" "ane" "class" > dim(ckd) [1] 400 25 We need to recode the target variable class to a binary variable > library(ETLUtils) > ckd$class <- recoder(ckd$class, from = c(1,2), to=c(1,0)) > unique(ckd$class)
[1] 1 0
> table(ckd$class) 0 1 150 250  First Generalized Linear Model > model0 <- glm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd, + family=binomial(link = "logit"), + na.action = na.omit) > summary(model0) Call: glm(formula = class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, family = binomial(link = "logit"), data = ckd, na.action = na.omit) Deviance Residuals: Min 1Q Median 3Q Max -1.3531 -0.0441 -0.0025 0.0016 3.2946 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 12.6368996 6.2190363 2.032 0.042157 * age 0.0238911 0.0327356 0.730 0.465500 bp 0.1227932 0.0591324 2.077 0.037840 * bgr 0.0792933 0.0212243 3.736 0.000187 *** bu 0.0026834 0.0296290 0.091 0.927838 rbcc -0.9518575 0.8656291 -1.100 0.271501 wbcc 0.0002601 0.0002432 1.070 0.284725 hemo -2.3246914 0.6404712 -3.630 0.000284 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 325.910 on 236 degrees of freedom Residual deviance: 35.121 on 229 degrees of freedom (163 observations deleted due to missingness) AIC: 51.121 Number of Fisher Scoring iterations: 9 Now we will use the bigglm.ffdf() function from the ffbase package to perform a GLM. But bigglm.ffdf can be applied only on ff objects that’s why we will transform the ckd data to an ff package. We need first to create a directory where the binary variables will be stored. > library(ffbase) > system("mkdir ckd_dir") > dir_ckd=paste0(getwd(),"/ckd_dir") > dir_ckd [1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/ckd_dir" > options(fftempdir = dir_ckd) > ckd.ff <- as.ffdf(ckd) > list.files(dir_ckd) [1] "ffdf705103d9817.ff" "ffdf70514602150.ff" "ffdf70514675b1.ff" [4] "ffdf705166a2bc7.ff" "ffdf7051b3fd709.ff" "ffdf7051c1fd6ba.ff" [7] "ffdf7051db22782.ff" ... Truncated output  We use then bigglm.ff function. > model1 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, + data = ckd.ff, family=binomial(link = "logit"), + na.action = na.exclude) > summary(model1) Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff, family = binomial(link = "logit"), na.action = na.exclude) Sample size = 400 failed to converge after 8 iterations Coef (95% CI) SE p (Intercept) 12.6330 0.3275 24.9384 6.1527 0.0401 age 0.0239 -0.0410 0.0888 0.0324 0.4616 bp 0.1228 0.0061 0.2394 0.0583 0.0353 bgr 0.0793 0.0375 0.1210 0.0209 0.0001 bu 0.0027 -0.0558 0.0611 0.0292 0.9268 rbcc -0.9515 -2.6587 0.7557 0.8536 0.2650 wbcc 0.0003 -0.0002 0.0007 0.0002 0.2780 hemo -2.3239 -3.5804 -1.0675 0.6282 0.0002 We will now show how this previous function works with massive data. We will then expand the size of the data to 8,000,000 observations according to the following process > x=ffseq_len(4) > x ff (open) integer length=4 (4) [1] [2] [3] [4] 1 2 3 4 > expand.ffgrid(x, ff(1:2)) ffdf (all open) dim=c(8,2), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix Var1 Var1 integer integer FALSE FALSE FALSE Var2 Var2 integer integer FALSE FALSE FALSE PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen Var1 1 1 1 TRUE Var2 2 1 1 TRUE ffdf data Var1 Var2 1 1 1 2 2 1 3 3 1 4 4 1 5 1 2 6 2 2 7 3 2 8 4 2 Let’s do the same with ckd by expanding it to 8,000,000= 400 x 20,000 > ckd.ff$id <- ffseq_len(nrow(ckd.ff))
> ckd.ff$id ff (open) integer length=400 (400) [1] [2] [3] [4] [5] [6] [7] [8] [393] [394] [395] [396] [397] 1 2 3 4 5 6 7 8 : 393 394 395 396 397 [398] [399] [400] 398 399 400 > ptm<-proc.time() > big.ckd.ff <- expand.ffgrid(ckd.ff$id, ff(1:20000))
> x_time=proc.time()-ptm
> x_time
user  system elapsed
0.423   0.277   0.815
> colnames(big.ckd.ff) <- c("id","expand")
ffdf (all open) dim=c(8000000,2), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix
id             Var1      integer       integer FALSE           FALSE            FALSE
expand         Var2      integer       integer FALSE           FALSE            FALSE
PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
id                     1                1               1           TRUE
expand                 2                1               1           TRUE
ffdf data
id expand
1           1      1
2           2      1
3           3      1
4           4      1
5           5      1
6           6      1
7           7      1
8           8      1
:           :      :
7999993   393  20000
7999994   394  20000
7999995   395  20000
7999996   396  20000
7999997   397  20000
7999998   398  20000
7999999   399  20000
8000000   400  20000

And finally we merge big.ckd.ff with ckd.ff in order to obtain an expansion of ckd data to a data with 8,000,000

> ptm<-proc.time()
> big.ckd.ff <- merge.ffdf(big.ckd.ff, ckd.ff,
+                          by.x="id", by.y="id",
+                          all.x=TRUE, all.y=FALSE)
> x_time=proc.time()-ptm
> x_time
user  system elapsed
3.860   2.611   7.100
> dim(big.ckd.ff)
[1] 8000000      27

Let us then perform the logistic model on the newly created data

> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
+                       data = big.ckd.ff, family=binomial(),
+                       na.action = na.exclude)
Warning message:
In bigglm.function(formula, data = ffchunk, family = family, ...) :
ran out of iterations and failed to converge
> x_time=proc.time()-ptm
> x_time
user  system elapsed
97.958   6.797 105.977
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff,
family = binomial(), na.action = na.exclude)
Sample size =  8e+06
failed to converge after 8  iterations
Coef    (95%     CI)     SE p
(Intercept) 12.6330 12.5460 12.7200 0.0435 0
age          0.0239  0.0234  0.0243 0.0002 0
bp           0.1228  0.1219  0.1236 0.0004 0
bgr          0.0793  0.0790  0.0796 0.0001 0
bu           0.0027  0.0023  0.0031 0.0002 0
rbcc        -0.9515 -0.9635 -0.9394 0.0060 0
wbcc         0.0003  0.0003  0.0003 0.0000 0
hemo        -2.3239 -2.3328 -2.3150 0.0044 0

We can notice here that there’s a warning about the convergence of the algorithm. There is no exact rule that can be used to avoid this convergence failure but we can play with two arguments in the bigglm.ffdf function by adjusting the chunksize parameter and the maxit as we proceed here

> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
+                       data = big.ckd.ff, family=binomial(),
+                       na.action = na.exclude, sandwich = TRUE,
+                       chunksize = 20000, maxit = 40)
> x_time=proc.time()-ptm
> x_time
user  system elapsed
337.946  44.423 385.623
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff,
family = binomial(), na.action = na.exclude, sandwich = TRUE,
chunksize = 20000, maxit = 40)
Sample size =  8e+06
Coef    (95%     CI)     SE p
(Intercept) 12.6369 12.5332 12.7406 0.0518 0
age          0.0239  0.0235  0.0243 0.0002 0
bp           0.1228  0.1223  0.1233 0.0002 0
bgr          0.0793  0.0789  0.0796 0.0002 0
bu           0.0027  0.0023  0.0030 0.0002 0
rbcc        -0.9519 -0.9589 -0.9448 0.0035 0
wbcc         0.0003  0.0003  0.0003 0.0000 0
hemo        -2.3247 -2.3375 -2.3118 0.0064 0
Sandwich (model-robust) standard errors

## Example: Netflix data

Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor’s Pragmatic Chaos team. Here’s is the dataset that was used in that competition. It can be downloaded from Kaggle website This data is stored in 4 txt files ~ 400-500 Mb each and our aim now is to import these files into R, merge them all in only one dataset and in order to be able to make some statistical operations. Here’s how the data looks like. The first row is composed of the code movie and the next rows are composed with the customer ID, its review which is a score from 1 to 5 and the data of the watching the movie. All the fields are separated by a comma. Indeed we can’t use use sep="," because of the rows containing the movie IDs. We will then import each dataset, we will merge them in one dataset and we will then write an R code where we will detect the rows containing the Movie IDs and separate them from the file. 1: 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 893988,3,2005-11-17 124105,4,2004-08-05  This time we won’t use ff package but we will use data.table package which is, as we will see, faster than ff package. We will also show later how can we make data manipulations using this package to create only one dataset. > library(data.table) > rm(list=ls()) > ptm<-proc.time() > d1<- fread("./Netflix_data/combined_data_1.txt", header = F,sep = "\t") |--------------------------------------------------| |==================================================| > x_time=proc.time()-ptm > x_time user system elapsed 13.614 0.330 14.044 > ptm<-proc.time() > d2<- fread("./Netflix_data/combined_data_2.txt", header = F,sep = "\t") |--------------------------------------------------| |==================================================| > x_time=proc.time()-ptm > x_time user system elapsed 23.246 1.890 29.776 > > > ptm<-proc.time() > d3<- fread("./Netflix_data/combined_data_3.txt", header = F,sep = "\t") |--------------------------------------------------| |==================================================| > x_time=proc.time()-ptm > x_time user system elapsed 28.087 1.494 34.298 > ptm<-proc.time() > d4<- fread("./Netflix_data/combined_data_4.txt", header = F,sep = "\t") |--------------------------------------------------| |==================================================| > x_time=proc.time()-ptm > x_time user system elapsed 22.398 0.702 23.462  Let us now merge all these datasets together using rbind command > ptm<-proc.time() > data_net=do.call(rbind,list(d1,d2,d3,d4)) > x_time=proc.time()-ptm > x_time user system elapsed 18.205 0.352 18.638 > > ptm<-proc.time() > dim(data_net) [1] 100498277 1 > x_time=proc.time()-ptm > x_time user system elapsed 0.003 0.000 0.003  Here’s the code that can help us to create one dataset with columns: Movie ID, Customer ID, Rating and Date. Unfortunately this code is too slow and it won’t be useful for us. We will show later how we can improve using other massive data tools. data_net=data.frame() i=1 p=nrow(d1) while(i < (p+1)){ Sys.sleep(.1) # A counter for the loop cat("\r", i, "of", (p+1) ) flush.console() zz=F x=d1[i,1] ztt=data.frame() while(zz==F){ i=i+1 zt=unlist(strsplit(x = as.character(data_net[i,1]),",")) ztt=rbind.data.frame(ztt,zt,stringsAsFactors = F,make.row.names = F) zz=grepl(":",data_net[i,1]) } ztt$MovieID=rep(x,nrow(ztt))
colnames(ztt)=c("CostumerID","Rate","Date","MovieID")
dd1=rbind.data.frame(dd1,ztt,stringsAsFactors = F,make.row.names = F)
}