How to Import … with
The aim of this tutorial is to show you how you can import data from a pdf document and how you can use this data in R
.
Our aim is to obtain the population data in the delegations of Tunis.
We download a pdf document published by the General Commission of the Regional Development of Tunisia in the following link
http://www.cgdr.nat.tn/upload/files/gouvchiffres/gech2015/Tunis.pdf
After downloading the report, save the file “Tunis.pdf” in your working directory.
We would like then to extract the data the table of page 8.
pdftables
package> install.packages("pdftables")
First thing to do is to save the page containing the table in a separate pdf file. We named here Tunis_pop.pdf
You need also to sign up at https://pdftables.com and ask for an API key
> library(pdftables)
> convert_pdf("Tunis_pop.pdf", "Tunis_pop.csv",api_key = "***********")
Converted Tunis_pop.pdf to Tunis_pop.csv
R
now.> library(readr)
> Tunis_pop <- read_csv("Tunis_pop.csv")
Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4]Parsed with column specification:
cols(
`Le Gouvernorat de Tunis en chiffres` = col_character(),
X2 = col_character(),
X3 = col_character(),
X4 = col_character(),
`ماقرأ يف سنوت ةيلاو` = col_character()
)
> head(Tunis_pop)
# A tibble: 6 x 5
`Le Gouvernorat de Tunis en chiffres` X2 X3 X4 `ماقرأ يف سنوت ةيلاو`
<chr> <chr> <chr> <chr> <chr>
1 NA ةيدمتعملا بسح ناكسلا ددعروطت NA NA NA
2 NA EVOLUTION DE LA POPULATION PAR DELEGATION NA NA NA
3 DELEGATION 2015 2014 2013 ةيدمتعملا
4 Carthage 24906 24216 24100 جاطرق
5 Tunis Medina 22009 21400 21298 ةنيدملا
6 Bab Bhar 37241 36210 36037 رحب باب
R
.It was announced in 10/8/2018 that Professor William Nordhaus from Yale University was awarded the 2018 Nobel Memorial Prize in Economic Sciences. He shared this prize with Professor Paul M. Romer, an economist at New York University
Professor William Nordhaus has spent the better part of four decades trying to persuade governments to address climate change, preferably by imposing a tax on carbon emissions.
His careful work has long since convinced most members of his own profession, and on Monday he was awarded the 2018 Nobel Memorial Prize in Economic Sciences in recognition of that achievement.
He has also founded the G-Econ project (https://gecon.yale.edu) which purpose is to develop a geophysically scaled economic data set (hence, G-Econ). The result is a global data set on economic activity for all terrestrial grid cells. As of September 2009, version G-Econ 3.1 is available, which includes 27,500 terrestrial grid cells and includes four years (1990, 1995, 2000, and 2005). The G-Econ project is gridded data set at a 1 degree longitude by 1 degree latitude resolution. This is approximately 100 km by 100 km, which is somewhat smaller than the size of the major sub-national political entities for most large countries (e.g., states in the United States, Laender in Germany, or oblasts in Russia) and approximately the same size as the second level political entities in most countries (e.g., counties in the United States).
The main effort of this research is to create data on gross cell product. In addition, we have merged the economic data with other important demographic and geophysical data such as climate, physical attributes, location indicators, population, and luminosity. This data set is publicly available to all not-for-profit researchers. It will be helpful in environmental and economic studies of energy, environment, and global warming.
I’m presenting below an R
code that helps to extract the whole dataset collected by Professor William Nordhaus and his collaborator Professor Xi Chen from Quinnipiac University in the G-Econ research project.
Here’s then my R
code on how can you extract the data in your computer. It’s important to open a new project in RStudio in an empty forlder before running the code below.
If you’re not a good R
programmer you can still try to download the 92 files from G-Econ website in your computer and import them, one by one, into your R
environment.
> library(rvest)
> library(purrr)
> library(XML)
> library(RCurl)
> url<-"https://gecon.yale.edu/country-listing"
>
> doc <- getURL(url)
>
>
> html <- htmlTreeParse(doc, useInternal = TRUE)
> txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
> txt<-unlist(txt)
> i=grep("\n",txt)
> txt=txt[-i]
> txt=txt[-c(1:23)]
> txt=txt[1:274]
> txt
> i=grepl("^\\s*$", txt)
> countries=txt[i==F]
>
> countries
> library(stringr)
> links=vector('list',length(countries))
> zc=gsub(pattern = " ",replacement = "_",tolower(countries))
> zc[85]="uk"
> zc[86]="usa"
> for(i in 1:length(countries)){
+ print(i)
+ u1=paste0("https://gecon.yale.edu/",zc[i])
+ d1<- getURL(u1)
+
+ matched <- str_match_all(d1, "<a href=\"(.*?)\"")
+ r=grep("_upload",matched[[1]])
+ x=gsub("<a href=\"https://", "", matched[[1]][r[1],1])
+ xpath=paste0("https://",x)
+ xpath=gsub(pattern = "\"",replacement = "",xpath)
+ links[[i]]=xpath
+ }
> links
> links[[85]]="https://gecon.yale.edu/sites/default/files/uk_upload_xi_101509.xls"
> links[[15]]="https://gecon.yale.edu/sites/default/files/central_africa_upload_mm_051905_0.xls"
> links[[86]]="https://gecon.yale.edu/sites/default/files/upload_us_xi_090110.xls"
> links[[75]]="https://gecon.yale.edu/sites/default/files/southafrica_upload_xi_091510.xls"
> links[[75]]="https://gecon.yale.edu/sites/default/files/south_korea_upload_qa_061005.xls"
> links[[21]]="https://gecon.yale.edu/sites/default/files/czech_republic_upload_xi_101509.xls"
> links[[76]]="https://gecon.yale.edu/sites/default/files/south_korea_upload_qa_061005.xls"
> links[[71]]="https://gecon.yale.edu/sites/default/files/saudi_arabia_upload_qa_050205_0.xls"
> links[[64]]="https://gecon.yale.edu/sites/default/files/papua_new_guinea_upload_qa_061305.xls"
> links[[44]]="https://gecon.yale.edu/sites/default/files/lao_pdr_upload_qa_061505_0.xls"
> links[[61]]="https://gecon.yale.edu/sites/default/files/north_korea_upload_qa_051905.xls"
> links[[58]]="https://gecon.yale.edu/sites/default/files/new_zealand_upload_mm_071805.xls"
Notice that some links need to be changed because the names of the files are not with the same standards as the other files.
> for(j in 1:length(links)){
+ u1=links[[j]]
+ download.file(url = u1,destfile = paste0(zc[j],".xls"))
+ }
R
:> library(readxl)
> list.files()
> i=grep("xls",list.files())
> length(i)
> length(zc)
> files_xls=list.files()[i]
>
> data_list=vector('list',length(zc))
> for(i in 1:length(zc)){
+ data_list[[i]]<- read_excel(files_xls[i])
+ data_list[[i]]<-data_list[[i]][-1,]
+ }
>
> names(data_list)=zc
>
> for(i in 1:length(zc)){
+ colnames(data_list[[i]])=tolower(colnames(data_list[[i]]))
+ }
Enjoy then the data :)
The data on the global is available in the following link:
https://datadryad.org/bitstream/handle/10255/dryad.153899/GDP_PPP_1990_2015_5arcmin_v2.nc?sequence=1
This dataset represents the gross domestic production (GDP) of each grid cell. GDP is given in 2011 international US dollars. The data is derived from GDP per capita (PPP) which is multiplied by gridded population data HYDE 3.2 (the years of population data not available (1991-1999) were linearly interpolated at grid scale based on data from years 1990 and 2000). Dataset has global extent at 5 arc-min resolution for the 26-year period of 1990-2015. Detail description is given in a linked article and metadata is provided as an attribute in the NetCDF file itself.
R
RNetCDF
package to import it into R
> library(RNetCDF)
> f1<-open.nc("GDP_PPP_1990_2015_5arcmin_v2.nc",write=TRUE)
f1
> library(ncdf.tools)
> infoNcdfVars(f1)
id name unit n.dims type n.values range 1.dim
4 3 GDP_PPP constant 2011 international US dollar 3 NC_FLOAT 0 longitude
2.dim 3.dim dim.id.1 dim.id.2 dim.id.3
4 latitude time 0 1 2
GDP_PPP
content variable.> x<-var.get.nc(f1, "GDP_PPP")
> dim(x)
[1] 4320 2160 26
> lon <-var.get.nc(f1, "longitude")
> lat <-var.get.nc(f1, "latitude")
> time <- var.get.nc(f1, "time")
> library(sp)
> library(rworldmap)
> coords2country = function(points)
+ {
+ countriesSP <- getMap(resolution='low')
+ pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))
+
+ indices = over(pointsSP, countriesSP)
+ indices$ADMIN
+ }
R
function to extract the data> extract_gdp<-function(j){
+ x1=x[,,j]
+ rownames(x1)=lon
+ colnames(x1)=lat
+ x1l=melt(x1)
+ colnames(x1l)=c("lon","lat","gdp_ppp")
+ i.lon=which(x1l$lon>=5.707273 & x1l$lon<=15.228166)
+ x1l=x1l[i.lon,]
+
+ i.lat=which(x1l$lat>=25.78700 & x1l$lat<=43.30721)
+ x1l=x1l[i.lat,]
+ ct<-coords2country(x1l[,1:2])
+ i.tun=which(ct=="Tunisia")
+ x1tun=x1l[i.tun,]
+ x1tun
+ }
> x_gdp_ppp=vector('list',26)
> names(x_gdp_ppp)=time
> ptm=proc.time()
> for(j in 1:26){
+ Sys.sleep(.1)
+ cat(j, "of 26\r")
+ flush.console()
+ x_gdp_ppp[[j]]=extract_gdp(j)
+ }
> proc.time()-ptm
user system elapsed
85.259 11.394 99.761
> library(plyr)
> x_gdp_ppp_dt=ldply(x_gdp_ppp)
> colnames(x_gdp_ppp_dt)[1]="year"