How to Import … with

Data from pdf files (8/2018)?

The aim of this tutorial is to show you how you can import data from a pdf document and how you can use this data in R.

Data to be converted

  • Our aim is to obtain the population data in the delegations of Tunis.

  • We download a pdf document published by the General Commission of the Regional Development of Tunisia in the following link

http://www.cgdr.nat.tn/upload/files/gouvchiffres/gech2015/Tunis.pdf

  • After downloading the report, save the file “Tunis.pdf” in your working directory.

  • We would like then to extract the data the table of page 8.

Using pdftables package

  • This package is easy to install
> install.packages("pdftables")
  • First thing to do is to save the page containing the table in a separate pdf file. We named here Tunis_pop.pdf

  • You need also to sign up at https://pdftables.com and ask for an API key

> library(pdftables)
> convert_pdf("Tunis_pop.pdf", "Tunis_pop.csv",api_key = "***********")
Converted Tunis_pop.pdf to Tunis_pop.csv
  • Import the data into R now.
> library(readr)
> Tunis_pop <- read_csv("Tunis_pop.csv")
Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4]Parsed with column specification:
cols(
  `Le Gouvernorat de Tunis en chiffres` = col_character(),
  X2 = col_character(),
  X3 = col_character(),
  X4 = col_character(),
  `ماقرأ يف سنوت ةيلاو` = col_character()
)
> head(Tunis_pop)
# A tibble: 6 x 5
  `Le Gouvernorat de Tunis en chiffres` X2                                        X3    X4    `ماقرأ يف سنوت ةيلاو`
  <chr>                                 <chr>                                     <chr> <chr> <chr>                
1 NA                                    ةيدمتعملا بسح ناكسلا ددعروطت              NA    NA    NA                   
2 NA                                    EVOLUTION DE LA POPULATION PAR DELEGATION NA    NA    NA                   
3 DELEGATION                            2015                                      2014  2013  ةيدمتعملا            
4 Carthage                              24906                                     24216 24100 جاطرق                
5 Tunis Medina                          22009                                     21400 21298 ةنيدملا              
6 Bab Bhar                              37241                                     36210 36037 رحب باب              
  • Now your data is ready to be used in R.

G-Econ data? (10/2018)

It was announced in 10/8/2018 that Professor William Nordhaus from Yale University was awarded the 2018 Nobel Memorial Prize in Economic Sciences. He shared this prize with Professor Paul M. Romer, an economist at New York University

Professor William Nordhaus has spent the better part of four decades trying to persuade governments to address climate change, preferably by imposing a tax on carbon emissions.

His careful work has long since convinced most members of his own profession, and on Monday he was awarded the 2018 Nobel Memorial Prize in Economic Sciences in recognition of that achievement.

He has also founded the G-Econ project (https://gecon.yale.edu) which purpose is to develop a geophysically scaled economic data set (hence, G-Econ). The result is a global data set on economic activity for all terrestrial grid cells. As of September 2009, version G-Econ 3.1 is available, which includes 27,500 terrestrial grid cells and includes four years (1990, 1995, 2000, and 2005). The G-Econ project is gridded data set at a 1 degree longitude by 1 degree latitude resolution. This is approximately 100 km by 100 km, which is somewhat smaller than the size of the major sub-national political entities for most large countries (e.g., states in the United States, Laender in Germany, or oblasts in Russia) and approximately the same size as the second level political entities in most countries (e.g., counties in the United States).

The main effort of this research is to create data on gross cell product. In addition, we have merged the economic data with other important demographic and geophysical data such as climate, physical attributes, location indicators, population, and luminosity. This data set is publicly available to all not-for-profit researchers. It will be helpful in environmental and economic studies of energy, environment, and global warming.

I’m presenting below an R code that helps to extract the whole dataset collected by Professor William Nordhaus and his collaborator Professor Xi Chen from Quinnipiac University in the G-Econ research project.

Here’s then my R code on how can you extract the data in your computer. It’s important to open a new project in RStudio in an empty forlder before running the code below.

If you’re not a good R programmer you can still try to download the 92 files from G-Econ website in your computer and import them, one by one, into your R environment.

  1. You need to extract the names of the countries listed in the G-Econ Data
> library(rvest)
> library(purrr)
> library(XML)
> library(RCurl)
> url<-"https://gecon.yale.edu/country-listing"
> 
> doc <- getURL(url)
> 
> 
> html <- htmlTreeParse(doc, useInternal = TRUE)
> txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
> txt<-unlist(txt)
> i=grep("\n",txt)
> txt=txt[-i]
> txt=txt[-c(1:23)]
> txt=txt[1:274]
> txt
> i=grepl("^\\s*$", txt)
> countries=txt[i==F]
> 
> countries
  1. We once we have the names of the countries, we will extract the links of every file containing the data
> library(stringr)
> links=vector('list',length(countries))
> zc=gsub(pattern = " ",replacement = "_",tolower(countries))
> zc[85]="uk"
> zc[86]="usa"
> for(i in 1:length(countries)){
+   print(i)
+   u1=paste0("https://gecon.yale.edu/",zc[i])
+   d1<- getURL(u1)
+   
+   matched <- str_match_all(d1, "<a href=\"(.*?)\"")
+   r=grep("_upload",matched[[1]])
+   x=gsub("<a href=\"https://", "", matched[[1]][r[1],1])
+   xpath=paste0("https://",x)
+   xpath=gsub(pattern = "\"",replacement = "",xpath)
+   links[[i]]=xpath
+ }
> links
> links[[85]]="https://gecon.yale.edu/sites/default/files/uk_upload_xi_101509.xls"
> links[[15]]="https://gecon.yale.edu/sites/default/files/central_africa_upload_mm_051905_0.xls"
> links[[86]]="https://gecon.yale.edu/sites/default/files/upload_us_xi_090110.xls"
> links[[75]]="https://gecon.yale.edu/sites/default/files/southafrica_upload_xi_091510.xls"
> links[[75]]="https://gecon.yale.edu/sites/default/files/south_korea_upload_qa_061005.xls"
> links[[21]]="https://gecon.yale.edu/sites/default/files/czech_republic_upload_xi_101509.xls"
> links[[76]]="https://gecon.yale.edu/sites/default/files/south_korea_upload_qa_061005.xls"
> links[[71]]="https://gecon.yale.edu/sites/default/files/saudi_arabia_upload_qa_050205_0.xls"
> links[[64]]="https://gecon.yale.edu/sites/default/files/papua_new_guinea_upload_qa_061305.xls"
> links[[44]]="https://gecon.yale.edu/sites/default/files/lao_pdr_upload_qa_061505_0.xls"
> links[[61]]="https://gecon.yale.edu/sites/default/files/north_korea_upload_qa_051905.xls"
> links[[58]]="https://gecon.yale.edu/sites/default/files/new_zealand_upload_mm_071805.xls"

Notice that some links need to be changed because the names of the files are not with the same standards as the other files.

  1. You need now to download all excel files:
> for(j in 1:length(links)){
+   u1=links[[j]]
+   download.file(url = u1,destfile = paste0(zc[j],".xls"))
+ }
  1. Final you can imports all these files into R:
> library(readxl)
> list.files()
> i=grep("xls",list.files())
> length(i)
> length(zc)
> files_xls=list.files()[i]
> 
> data_list=vector('list',length(zc))
> for(i in 1:length(zc)){
+   data_list[[i]]<- read_excel(files_xls[i])
+   data_list[[i]]<-data_list[[i]][-1,]
+ }
> 
> names(data_list)=zc
> 
> for(i in 1:length(zc)){
+   colnames(data_list[[i]])=tolower(colnames(data_list[[i]]))
+ }

Enjoy then the data :)

GDP data at the level of grid cells? (10/2018)

Downloading the data

The data on the global is available in the following link:

https://datadryad.org/bitstream/handle/10255/dryad.153899/GDP_PPP_1990_2015_5arcmin_v2.nc?sequence=1

This dataset represents the gross domestic production (GDP) of each grid cell. GDP is given in 2011 international US dollars. The data is derived from GDP per capita (PPP) which is multiplied by gridded population data HYDE 3.2 (the years of population data not available (1991-1999) were linearly interpolated at grid scale based on data from years 1990 and 2000). Dataset has global extent at 5 arc-min resolution for the 26-year period of 1990-2015. Detail description is given in a linked article and metadata is provided as an attribute in the NetCDF file itself.

Importing the data into R

  • Since the data is in nc format then we need RNetCDF package to import it into R
> library(RNetCDF)
> f1<-open.nc("GDP_PPP_1990_2015_5arcmin_v2.nc",write=TRUE)
  • The name of the variables in f1
> library(ncdf.tools)
> infoNcdfVars(f1)
  id    name                                  unit n.dims     type n.values range     1.dim
4  3 GDP_PPP constant 2011 international US dollar      3 NC_FLOAT        0       longitude
     2.dim 3.dim dim.id.1 dim.id.2 dim.id.3
4 latitude  time        0        1        2

Extracting the GDP of Tunisia

  • We first extract the GDP_PPP content variable.
> x<-var.get.nc(f1, "GDP_PPP")
> dim(x)
[1] 4320 2160   26
  • Extract longitude, lattitude and time variables
> lon <-var.get.nc(f1, "longitude")
> lat <-var.get.nc(f1, "latitude")
> time <- var.get.nc(f1, "time")
  • we will use this function to extract the points only inside Tunisia:
> library(sp)
> library(rworldmap)
> coords2country = function(points)
+ {  
+   countriesSP <- getMap(resolution='low')
+   pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))  
+  
+   indices = over(pointsSP, countriesSP)
+   indices$ADMIN  
+  }
  • Now we write an R function to extract the data
> extract_gdp<-function(j){
+   x1=x[,,j]
+   rownames(x1)=lon
+   colnames(x1)=lat
+   x1l=melt(x1)
+   colnames(x1l)=c("lon","lat","gdp_ppp")
+   i.lon=which(x1l$lon>=5.707273 & x1l$lon<=15.228166)
+   x1l=x1l[i.lon,]
+   
+   i.lat=which(x1l$lat>=25.78700 & x1l$lat<=43.30721)
+   x1l=x1l[i.lat,]
+   ct<-coords2country(x1l[,1:2])
+   i.tun=which(ct=="Tunisia")
+   x1tun=x1l[i.tun,]
+   x1tun
+ }
  • Applying the previous function
> x_gdp_ppp=vector('list',26)
> names(x_gdp_ppp)=time
> ptm=proc.time()
> for(j in 1:26){
+   Sys.sleep(.1)     
+   cat(j, "of 26\r") 
+   flush.console()
+   x_gdp_ppp[[j]]=extract_gdp(j)
+ }
> proc.time()-ptm
   user  system elapsed 
 85.259  11.394  99.761 
> library(plyr)
> x_gdp_ppp_dt=ldply(x_gdp_ppp)
> colnames(x_gdp_ppp_dt)[1]="year"

Download the R code