Large and Massive Data with R
What’s a large and massive data?
R is not well-suited for working with data structures larger than about 10-20% of a computer’s RAM.
Data exceeding 50% of available RAM are essentially unusable.
We consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%
In this first lesson we will show how you can use powerful mathematical and data modeling R
packages in large datasets, without the need for distributed computing.
In lesson we will see how
You can use
R
packages such asff
,ffbase
,ffbase2
, andbigmemory
to enhance out-of-memory performance
You can apply statistical methods to large
R
objects through thebiglm
andffbase
packages.
You can use
data.table
package for faster faster data manipulation methods
ff
, ffbase
and ffbase2
packagesThe title of the ff
package is Memory-efficient storage of large data on disk and fast access functions. Indeed this packages works by chunking the dataset and storing it in your hard drive and ff
structure in your R
a mapping to the the partitioned dataset.
The chunks of raw data are simply binary flat files in native encoding,
The ff
objects keep the metadata, which describe and link to the created binary files.
Installation
> install.packages("ff")
> install.packages("ffbase")
> install.packages("ffbase2")
ff
package works.Let’s show how can we import it into R
> system("mkdir air_ffdf")
mkdir: air_ffdf: File exists
> dir_air=paste0(getwd(),"/air_ffdf")
> dir_air
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf"
> options(fftempdir = dir_air)
airline
dataset> ptm<-proc.time()
> airline.ff<- read.csv.ffdf(file = "flight_ff.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..65499 (65499) csv-read=0.351sec ffdf-write=0.067sec
csv-read=0.351sec ffdf-write=0.067sec TOTAL=0.418sec
> x_time=proc.time()-ptm
> x_time
user system elapsed
0.353 0.035 0.424
Let’s recall that the size of the file is ~ 5Mb. However we can easily check that the new R
object airline.ff
is with a small size
> format(object.size(airline.ff),"Mb")
[1] "0.1 Mb"
We can notice in our working directory a new directory called air_fdff
and we can see that it contains files like that
> list.files(dir_air)
[1] "ffdf7051495570f.ff" "ffdf705151999fb.ff" "ffdf70516c0c4f7.ff" "ffdf7051cb88764.ff"
[5] "ffdf70531918fe9.ff" "ffdf7053616d57b.ff" "ffdf70536e939fe.ff" "ffdf70543cc388f.ff"
... Truncated output
> length(list.files(dir_air))
[1] 15
Every binary file corresponds to one variable in the data
> basename(filename(airline.ff$YEAR))
[1] "ffdf70536e939fe.ff"
Example Let’s now try a largest dataset with a size ~ 150Mb. This file contains ~ 950,000 of records about flight in US in September 2005.
> system("mkdir air_ffdf_l")
> dir_air_l=paste0(getwd(),"/air_ffdf_l")
> dir_air_l
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_l"
> options(fftempdir = dir_air_l)
> file.size("flight_large.csv")/1024/1024
[1] 149.0488
> ptm<-proc.time()
> airline.ff_l<- read.csv.ffdf(file = "flight_large.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..951111 (951111) csv-read=8.92sec ffdf-write=0.694sec
csv-read=8.92sec ffdf-write=0.694sec TOTAL=9.614sec
> x_time=proc.time()-ptm
> x_time
user system elapsed
8.887 0.553 9.616
> dim(airline.ff_l)
[1] 951111 28
Let us compare with read_csv
of the readr
package
> library(readr)
> ptm<-proc.time()
> airline_l<-read_csv("flight_large.csv")
Parsed with column specification:
cols(
.default = col_integer(),
FL_DATE = col_date(format = ""),
UNIQUE_CARRIER = col_character(),
TAIL_NUM = col_character(),
ORIGIN = col_character(),
ORIGIN_CITY_NAME = col_character(),
ORIGIN_STATE_NM = col_character(),
DEST = col_character(),
DEST_CITY_NAME = col_character(),
DEST_STATE_NM = col_character(),
CANCELLATION_CODE = col_character()
)
See spec(...) for full column specifications.
|===================================================================================| 100% 149 MB
> x_time=proc.time()-ptm
> x_time
user system elapsed
2.924 0.306 3.179
But, we occupy less memory with ff
objects:
> format(object.size(airline_l),"Mb")
[1] "138.2 Mb"
> format(object.size(airline.ff_l),"Mb")
[1] "0.5 Mb"
Example 3 We will try now a largest dataset. The airline_20MM.csv
is also extracted from the same website but it’s a data that contains 20 millions records and its size is 1.5Gb. It’s not possible then to use the classical way that consists of using read_csv
command because your compute may crash. We will show below how much takes times and memory-space to import this data into R
.
> system("mkdir air_ffdf_20mm")
> dir_air_20mm=paste0(getwd(),"/air_ffdf_20mm")
> dir_air_20mm
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/air_ffdf_20mm"
> options(fftempdir = dir_air_20mm)
> ptm<-proc.time()
> airline.ff_20mm<- read.csv.ffdf(file = "airline_20MM.csv",VERBOSE = TRUE,first.rows = -1,header=T)
read.table.ffdf 1..20000000 (20000000) csv-read=175.176sec ffdf-write=8.6sec
csv-read=175.176sec ffdf-write=8.6sec TOTAL=183.776sec
> x_time=proc.time()-ptm
> x_time
user system elapsed
163.075 13.505 183.835
ff
objectsComputing cross tabs
> ptm<-proc.time()
> t1.ff<-table.ff(airline.ff_l$ORIGIN_STATE_NM,airline.ff_l$DEST_STATE_NM)
> x_time=proc.time()-ptm
> x_time
user system elapsed
0.439 0.034 0.478
>
> format(object.size(t1.ff),"Kb")
[1] "18.6 Kb"
and we consider the classic way
> ptm<-proc.time()
> t1<-table(airline_l$ORIGIN_STATE_NM,airline_l$DEST_STATE_NM)
> x_time=proc.time()-ptm
> x_time
user system elapsed
0.135 0.024 0.158
>
> format(object.size(t1),"Kb")
[1] "18.6 Kb"
We will show then using the packages ffbase
and ffbase2
how to
ffdfdply
ff
objects and export them in csv filesLet’s first show how can we describe one numeric variable
> library(Hmisc)
> ptm<-proc.time()
> dd=describe(as.data.frame.ffdf(airline.ff_20mm$Distance))
> x_time=proc.time()-ptm
> x_time
user system elapsed
19.006 2.602 22.462
> dd
as.data.frame.ffdf(airline.ff_20mm$Distance)
n missing distinct Info Mean Gmd .05 .10 .25 .50
2e+07 0 1531 1 726 582.6 144 199 321 576
.75 .90 .95
952 1522 1964
lowest : 11 16 17 18 21, highest: 4243 4431 4433 4502 4962
> max(airline.ff_20mm$Distance)
[1] 4962
Let me recall that Gmd stands for Gini’s mean difference. It is computed for numeric variables and it is a robust measure of dispersion that is the mean absolute difference between any pairs of observations.
> ptm<-proc.time()
> dist_cat=cut.ff(airline.ff_20mm$Distance,breaks = c(0,150,300,450,600,750,900,1050,1200,2000,5000),right=F)
> x_time=proc.time()-ptm
> x_time
user system elapsed
2.418 0.586 3.035
Let’s then compute the frequency of each category
> ptm<-proc.time()
> zz=table.ff(dist_cat)
> zz=as.data.frame(zz)
> zz
Var1 Freq
1 [0,150) 1045424
2 [150,300) 3429694
3 [300,450) 3582526
4 [450,600) 2512883
5 [600,750) 2174993
6 [750,900) 1663260
7 [900,1.05e+03) 1602199
8 [1.05e+03,1.2e+03) 953791
9 [1.2e+03,2e+03) 2103486
10 [2e+03,5e+03) 931744
> x_time=proc.time()-ptm
> x_time
user system elapsed
10.105 0.506 10.640
I will change the names of the last 4 categories and plot the Bar chart using googleViz
package
> zz$Var1=as.character(zz$Var1)
> zz$Var1[7:10]=c("[900,1050)","[1050,1200)","[1200,2000)","More than 2000")
> zz
Var1 Freq
1 [0,150) 1045424
2 [150,300) 3429694
3 [300,450) 3582526
4 [450,600) 2512883
5 [600,750) 2174993
6 [750,900) 1663260
7 [900,1050) 1602199
8 [1050,1200) 953791
9 [1200,2000) 2103486
10 More than 2000 931744
> colnames(zz)[2]="# flights"
> library(googleVis)
> Column <- gvisColumnChart(zz,options=list(legend="none",width=1000, height=600,
+ title="Number of flights by distance category (miles)"))
> plot(Column)
ffdfdply
function?We will show on an example how we can use the ffdfdply
function. It allows us to carry out several operations on data. We will in the example below compute from the newly imported data from the dataset in the file airline_20MM.csv
the average of the departure delays by the the origin city.
We will use also the summaryBy
function for doBy
package to calculate group-wise summary statistics, much like the summary procedure of SAS.
> library(doBy)
> ptm<-proc.time()
> DepDelayByDayofMonth <- ffdfdply(airline.ff_20mm,
+ split = airline.ff_20mm$DayofMonth,
+ FUN=function(x) {
+ summaryBy(DepDelay~DayofMonth,
+ data=x, FUN=mean, na.rm=TRUE)})
2018-10-29 09:55:19, calculating split sizes
2018-10-29 09:55:25, building up split locations
2018-10-29 09:55:38, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB
2018-10-29 09:55:42, ... applying FUN to selected data
...Truncated Output
> x_time=proc.time()-ptm
> x_time
user system elapsed
92.310 22.068 120.041
> ptm<-proc.time()
> ArrDelayByDayofMonth <- ffdfdply(airline.ff_20mm,
+ split = airline.ff_20mm$DayofMonth,
+ FUN=function(x) {
+ summaryBy(ArrDelay~DayofMonth,
+ data=x, FUN=mean, na.rm=TRUE)})
2018-10-29 10:13:56, calculating split sizes
2018-10-29 10:14:02, building up split locations
2018-10-29 10:14:14, working on split 1/31, extracting data in RAM of 1 split elements, totalling, 0.06439 GB, while max specified data specified using BATCHBYTES is 0.01562 GB
...Truncated Output
x_time=proc.time()-ptm
x_time
user system elapsed
96.844 20.907 120.721
Let us now construct a dataframe with three columns: The day of the Month, the Average of the Departure Delay and the Average of the Arrival Delay.
> d1=as.data.frame(DepDelayByDayofMonth)
> d1=d1[order(d1$DayofMonth),]
> d2=as.data.frame(ArrDelayByDayofMonth)
> d2=d2[order(d2$DayofMonth),]
> dd=cbind.data.frame(d1$DayofMonth,d1$DepDelay.mean,d2$ArrDelay.mean)
> colnames(dd)=c("Day of the Month","Departure Delay","Arrival Delay")
And here’s the graph with googleVis
> gvisline1 <- gvisAreaChart(dd,"Day of the Month", c("Departure Delay","Arrival Delay"),
+ options=list(width=1200, height=600,
+ series="[{targetAxisIndex: 0},
+ {targetAxisIndex:1}]"
+ ))
> plot(gvisline1)
We will extract from the ff
object airline.ff_20mm
the subset of data of flights with delays only on arrival and departure. We will will also some other variables like Distance
, FlightNum
, Year
..
> unique(airline.ff_20mm$IsDepDelayed)
ff (open) integer length=2 (2)
[1] [2]
0 1
> unique(airline.ff_20mm$IsArrDelayed)
ff (open) integer length=2 (2)
[1] [2]
0 1
> table.ff(airline.ff_20mm$IsArrDelayed,airline.ff_20mm$IsDepDelayed)
0 1
0 8788725 1827306
1 2733591 6650378
> ptm<-proc.time()
> subs1.ff <- subset.ffdf(airline.ff_20mm, IsDepDelayed == 1 & IsArrDelayed==1,
+ select = c(Distance, ArrDelay,
+ DepDelay,
+ FlightNum,
+ Year,
+ Month,
+ DayofMonth))
> x_time=proc.time()-ptm
> x_time
user system elapsed
2.139 1.030 3.226
> dim(subs1.ff)
[1] 6650378 7
Exercice Write an R
code that gives the averages of delays at arrival and departure by every flight number and year? Draw a bubble chart using GoogleVis.
ff
objectsNow we may need to save the newly created data and load it later. Exactly like before we need to create a new directory to store the new binary files.
> system("mkdir subs_ff")
mkdir: subs_ff: File exists
> dir_subs1=paste0(getwd(),"/subs1_ff")
> dir_subs1
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff"
> save.ffdf(subs1.ff,dir = dir_subs1)
> list.files(dir_subs1)
[1] "subs1.ff$ArrDelay.ff" "subs1.ff$DayofMonth.ff" "subs1.ff$DepDelay.ff"
[4] "subs1.ff$Distance.ff" "subs1.ff$FlightNum.ff" "subs1.ff$Month.ff"
[7] "subs1.ff$Year.ff"
Let us now remove the newly created data from our R
environment and load it again into R
> rm(subs1.ff)
> load.ffdf(dir_subs1)
> ls()
... Truncated output
[13] "dir_subs1" "dist_cat" "gvisline1"
[16] "ptm" "subs1.ff" "t1"
....
To end this section we will show now how to save the data in a csv format file.
> ptm<-proc.time()
> write.csv.ffdf(subs1.ff, "subset1.csv", VERBOSE = TRUE)
write.table.ffdf 1..opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Distance.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$ArrDelay.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DepDelay.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$FlightNum.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Year.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$Month.ff
opening ff /Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/subs1_ff/subs1.ff$DayofMonth.ff
599186 (599186, 9%) ffdf-read=0.281sec csv-write=1.748sec
write.table.ffdf 599187..1198372 (599186, 18%) ffdf-read=0.246sec csv-write=1.718sec
.... Truncated Output
write.table.ffdf 4194303..4793488 (599186, 72.1%) ffdf-read=0.423sec csv-write=1.674sec
write.table.ffdf 4793489..5392674 (599186, 81.1%) ffdf-read=0.474sec csv-write=1.695sec
write.table.ffdf 5392675..5991860 (599186, 90.1%) ffdf-read=0.235sec csv-write=1.733sec
.... Truncated output
> x_time=proc.time()-ptm
> x_time
user system elapsed
21.566 0.987 23.170
ff
objectsI will be showing in this section an example of performing Generalized Linear Models on massive dataset. This example can be found in the book “Big Data Analytics with R
”.
The data will be used is from available from the Machine Learning Depository maintained by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html. It’s the Chronic Kidney Disease Dataset.
The data is in a file in .arff format (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. To import this data we need to install and load RWeka
package.
We import the data into R
> library(RWeka)
> ckd <- read.arff("ckd_full.arff")
> class(ckd)
[1] "data.frame"
> colnames(ckd)
[1] "age" "bp" "sg" "al" "su" "rbc" "pc" "pcc" "ba" "bgr"
[11] "bu" "sc" "sod" "pot" "hemo" "pcv" "wbcc" "rbcc" "htn" "dm"
[21] "cad" "appet" "pe" "ane" "class"
> dim(ckd)
[1] 400 25
We need to recode the target variable class
to a binary variable
> library(ETLUtils)
> ckd$class <- recoder(ckd$class, from = c(1,2), to=c(1,0))
> unique(ckd$class)
[1] 1 0
> table(ckd$class)
0 1
150 250
First Generalized Linear Model
> model0 <- glm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd,
+ family=binomial(link = "logit"),
+ na.action = na.omit)
> summary(model0)
Call:
glm(formula = class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
family = binomial(link = "logit"), data = ckd, na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3531 -0.0441 -0.0025 0.0016 3.2946
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.6368996 6.2190363 2.032 0.042157 *
age 0.0238911 0.0327356 0.730 0.465500
bp 0.1227932 0.0591324 2.077 0.037840 *
bgr 0.0792933 0.0212243 3.736 0.000187 ***
bu 0.0026834 0.0296290 0.091 0.927838
rbcc -0.9518575 0.8656291 -1.100 0.271501
wbcc 0.0002601 0.0002432 1.070 0.284725
hemo -2.3246914 0.6404712 -3.630 0.000284 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 325.910 on 236 degrees of freedom
Residual deviance: 35.121 on 229 degrees of freedom
(163 observations deleted due to missingness)
AIC: 51.121
Number of Fisher Scoring iterations: 9
Now we will use the bigglm.ffdf()
function from the ffbase
package to perform a GLM. But bigglm.ffdf
can be applied only on ff
objects that’s why we will transform the ckd
data to an ff
package. We need first to create a directory where the binary variables will be stored.
> library(ffbase)
> system("mkdir ckd_dir")
> dir_ckd=paste0(getwd(),"/ckd_dir")
> dir_ckd
[1] "/Users/dhafermalouche/Documents/Teaching_2018_2019/BigData_2018/ckd_dir"
> options(fftempdir = dir_ckd)
> ckd.ff <- as.ffdf(ckd)
> list.files(dir_ckd)
[1] "ffdf705103d9817.ff" "ffdf70514602150.ff" "ffdf70514675b1.ff"
[4] "ffdf705166a2bc7.ff" "ffdf7051b3fd709.ff" "ffdf7051c1fd6ba.ff"
[7] "ffdf7051db22782.ff"
... Truncated output
We use then bigglm.ff
function.
> model1 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
+ data = ckd.ff, family=binomial(link = "logit"),
+ na.action = na.exclude)
> summary(model1)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff,
family = binomial(link = "logit"), na.action = na.exclude)
Sample size = 400
failed to converge after 8 iterations
Coef (95% CI) SE p
(Intercept) 12.6330 0.3275 24.9384 6.1527 0.0401
age 0.0239 -0.0410 0.0888 0.0324 0.4616
bp 0.1228 0.0061 0.2394 0.0583 0.0353
bgr 0.0793 0.0375 0.1210 0.0209 0.0001
bu 0.0027 -0.0558 0.0611 0.0292 0.9268
rbcc -0.9515 -2.6587 0.7557 0.8536 0.2650
wbcc 0.0003 -0.0002 0.0007 0.0002 0.2780
hemo -2.3239 -3.5804 -1.0675 0.6282 0.0002
We will now show how this previous function works with massive data. We will then expand the size of the data to 8,000,000 observations according to the following process
> x=ffseq_len(4)
> x
ff (open) integer length=4 (4)
[1] [2] [3] [4]
1 2 3 4
> expand.ffgrid(x, ff(1:2))
ffdf (all open) dim=c(8,2), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix
Var1 Var1 integer integer FALSE FALSE FALSE
Var2 Var2 integer integer FALSE FALSE FALSE
PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
Var1 1 1 1 TRUE
Var2 2 1 1 TRUE
ffdf data
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 4 2
Let’s do the same with ckd
by expanding it to 8,000,000= 400 x 20,000
> ckd.ff$id <- ffseq_len(nrow(ckd.ff))
> ckd.ff$id
ff (open) integer length=400 (400)
[1] [2] [3] [4] [5] [6] [7] [8] [393] [394] [395] [396] [397]
1 2 3 4 5 6 7 8 : 393 394 395 396 397
[398] [399] [400]
398 399 400
> ptm<-proc.time()
> big.ckd.ff <- expand.ffgrid(ckd.ff$id, ff(1:20000))
> x_time=proc.time()-ptm
> x_time
user system elapsed
0.423 0.277 0.815
> colnames(big.ckd.ff) <- c("id","expand")
> head(big.ckd.ff)
ffdf (all open) dim=c(8000000,2), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix
id Var1 integer integer FALSE FALSE FALSE
expand Var2 integer integer FALSE FALSE FALSE
PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
id 1 1 1 TRUE
expand 2 1 1 TRUE
ffdf data
id expand
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
: : :
7999993 393 20000
7999994 394 20000
7999995 395 20000
7999996 396 20000
7999997 397 20000
7999998 398 20000
7999999 399 20000
8000000 400 20000
And finally we merge big.ckd.ff
with ckd.ff
in order to obtain an expansion of ckd
data to a data with 8,000,000
> ptm<-proc.time()
> big.ckd.ff <- merge.ffdf(big.ckd.ff, ckd.ff,
+ by.x="id", by.y="id",
+ all.x=TRUE, all.y=FALSE)
> x_time=proc.time()-ptm
> x_time
user system elapsed
3.860 2.611 7.100
> dim(big.ckd.ff)
[1] 8000000 27
Let us then perform the logistic model on the newly created data
> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
+ data = big.ckd.ff, family=binomial(),
+ na.action = na.exclude)
Warning message:
In bigglm.function(formula, data = ffchunk, family = family, ...) :
ran out of iterations and failed to converge
> x_time=proc.time()-ptm
> x_time
user system elapsed
97.958 6.797 105.977
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff,
family = binomial(), na.action = na.exclude)
Sample size = 8e+06
failed to converge after 8 iterations
Coef (95% CI) SE p
(Intercept) 12.6330 12.5460 12.7200 0.0435 0
age 0.0239 0.0234 0.0243 0.0002 0
bp 0.1228 0.1219 0.1236 0.0004 0
bgr 0.0793 0.0790 0.0796 0.0001 0
bu 0.0027 0.0023 0.0031 0.0002 0
rbcc -0.9515 -0.9635 -0.9394 0.0060 0
wbcc 0.0003 0.0003 0.0003 0.0000 0
hemo -2.3239 -2.3328 -2.3150 0.0044 0
We can notice here that there’s a warning about the convergence of the algorithm. There is no exact rule that can be used to avoid this convergence failure but we can play with two arguments in the bigglm.ffdf
function by adjusting the chunksize
parameter and the maxit
as we proceed here
> ptm<-proc.time()
> model2 <- bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo,
+ data = big.ckd.ff, family=binomial(),
+ na.action = na.exclude, sandwich = TRUE,
+ chunksize = 20000, maxit = 40)
> x_time=proc.time()-ptm
> x_time
user system elapsed
337.946 44.423 385.623
> summary(model2)
Large data regression model: bigglm(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = big.ckd.ff,
family = binomial(), na.action = na.exclude, sandwich = TRUE,
chunksize = 20000, maxit = 40)
Sample size = 8e+06
Coef (95% CI) SE p
(Intercept) 12.6369 12.5332 12.7406 0.0518 0
age 0.0239 0.0235 0.0243 0.0002 0
bp 0.1228 0.1223 0.1233 0.0002 0
bgr 0.0793 0.0789 0.0796 0.0002 0
bu 0.0027 0.0023 0.0030 0.0002 0
rbcc -0.9519 -0.9589 -0.9448 0.0035 0
wbcc 0.0003 0.0003 0.0003 0.0000 0
hemo -2.3247 -2.3375 -2.3118 0.0064 0
Sandwich (model-robust) standard errors
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor’s Pragmatic Chaos team. Here’s is the dataset that was used in that competition. It can be downloaded from Kaggle website
This data is stored in 4 txt files ~ 400-500 Mb each and our aim now is to import these files into R
, merge them all in only one dataset and in order to be able to make some statistical operations.
Here’s how the data looks like. The first row is composed of the code movie and the next rows are composed with the customer ID, its review which is a score from 1 to 5 and the data of the watching the movie. All the fields are separated by a comma. Indeed we can’t use use sep=","
because of the rows containing the movie IDs. We will then import each dataset, we will merge them in one dataset and we will then write an R
code where we will detect the rows containing the Movie IDs and separate them from the file.
1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
823519,3,2004-05-03
893988,3,2005-11-17
124105,4,2004-08-05
This time we won’t use ff
package but we will use data.table
package which is, as we will see, faster than ff
package. We will also show later how can we make data manipulations using this package to create only one dataset.
> library(data.table)
> rm(list=ls())
> ptm<-proc.time()
> d1<- fread("./Netflix_data/combined_data_1.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
user system elapsed
13.614 0.330 14.044
> ptm<-proc.time()
> d2<- fread("./Netflix_data/combined_data_2.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
user system elapsed
23.246 1.890 29.776
>
>
> ptm<-proc.time()
> d3<- fread("./Netflix_data/combined_data_3.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
user system elapsed
28.087 1.494 34.298
> ptm<-proc.time()
> d4<- fread("./Netflix_data/combined_data_4.txt", header = F,sep = "\t")
|--------------------------------------------------|
|==================================================|
> x_time=proc.time()-ptm
> x_time
user system elapsed
22.398 0.702 23.462
Let us now merge all these datasets together using rbind
command
> ptm<-proc.time()
> data_net=do.call(rbind,list(d1,d2,d3,d4))
> x_time=proc.time()-ptm
> x_time
user system elapsed
18.205 0.352 18.638
>
> ptm<-proc.time()
> dim(data_net)
[1] 100498277 1
> x_time=proc.time()-ptm
> x_time
user system elapsed
0.003 0.000 0.003
Here’s the code that can help us to create one dataset with columns: Movie ID, Customer ID, Rating and Date. Unfortunately this code is too slow and it won’t be useful for us. We will show later how we can improve using other massive data tools.
data_net=data.frame()
i=1
p=nrow(d1)
while(i < (p+1)){
Sys.sleep(.1) # A counter for the loop
cat("\r", i, "of", (p+1) )
flush.console()
zz=F
x=d1[i,1]
ztt=data.frame()
while(zz==F){
i=i+1
zt=unlist(strsplit(x = as.character(data_net[i,1]),","))
ztt=rbind.data.frame(ztt,zt,stringsAsFactors = F,make.row.names = F)
zz=grepl(":",data_net[i,1])
}
ztt$MovieID=rep(x,nrow(ztt))
colnames(ztt)=c("CostumerID","Rate","Date","MovieID")
dd1=rbind.data.frame(dd1,ztt,stringsAsFactors = F,make.row.names = F)
}