Tuesday, 12 May 2015

Global Economic Maps

Introduction
In this post I am going to show how to extract data from web pages in table format, transform these data into spatial objects in R and then plot them in maps.


Procedure
For this project we need the following two packages: XML and raster.
The first package is used to extract data from HTML pages, in particular from the sections marked with the tag <table>, which marks text ordered in a table format on HTML pages.
The example below is created using a sample of code from: http://www.w3schools.com/Html/html_tables.asp. If we look at the code in plain text you can see that is included within the tag <table></table>.
<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
  <tr>
    <td>John</td>
    <td>Doe</td>
    <td>80</td>
  </tr>
</table>

In an HTML page the code above would look like this:
Jill Smith 50
Eve Jackson 94
John Doe 80


In the package XML we have a function, readHTMLTable, which is able to extract only the data written within the two tags <table> and </table>. Therefore we can use it to download data potentially from every web page.
The problem is the data are imported in R in textual string and therefor they need some processing before we can actually use them.

Before starting importing data from the web however, we are going to download and import a shapefile with the borders of all the countries in the world from this page: thematicmapping.org

The code for doing it is the following:

download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

In the first line we use the function download.file to download the zip containing the shapefile. Here we need to specify the destination, which I called with the same name of the object.
At this point we have downloaded a zip file in the working directory, so we need to extract its contents. We can do that using unzip, which require as arguments the name of the zip file and the directory where to extract its contents, in this case I used getwd() to extract everything in the working directory.
Finally we can open the shapefile using the function with the same name, creating the object polygons, which looks like this:

> polygons
class       : SpatialPolygonsDataFrame 
features    : 246 
extent      : -180, 180, -90, 83.57027  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
variables   : 11
names       : FIPS, ISO2, ISO3,  UN,           NAME,    AREA,    POP2005, REGION, SUBREGION,      LON,     LAT 
min values  :   AA,   AD,  ABW,   4, Ã…land Islands,       0,          0,      0,         0, -102.535, -10.444 
max values  :   ZI,   ZW,  ZWE, 894,       Zimbabwe, 1638094, 1312978855,    150,       155,  179.219,  78.830 

As you can see from the column NAME, some country names are not well read in R and this is something that we would need to correct later on in the exercise.

At this point we can start downloading economic data from the web pages of The World Bank. An example of the type of data and the page we are going to query is here: http://data.worldbank.org/indicator/EN.ATM.CO2E.PC

For this exrcise we are going to download the following data: CO2 emissions (metric tons per capita), Population in urban agglomerations of more than 1 million, Population density (people per sq. km of land area), Population in largest city, GDP per capita (current US$), GDP (current US$), Adjusted net national income per capita (current US$), Adjusted net national income (current US$), Electric power consumption (kWh per capita), Electric power consumption (kWh), Electricity production (kWh).

The line of code to import  this page into R is the following:

CO2_emissions_Tons.per.Capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.ATM.CO2E.PC")


The object CO2_emissions_Tons.per.Capita_HTML is a list containing two elements:

> str(CO2_emissions_Tons.per.Capita_HTML)
List of 2
 $ NULL:'data.frame':   213 obs. of  4 variables:
  ..$ 
            Country name          : Factor w/ 213 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ 
            2010                  : Factor w/ 93 levels "","0.0","0.1",..: 5 16 52 1 77 17 73 63 15 50 ...
  ..$ 
                                  : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 
                                  : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
 $ NULL:'data.frame':   10 obs. of  2 variables:
  ..$ V1: Factor w/ 10 levels "Agriculture & Rural Development",..: 1 2 3 4 5 6 7 8 9 10
  ..$ V2: Factor w/ 10 levels "Health","Infrastructure",..: 1 2 3 4 5 6 7 8 9 10

That is because at the very end of the page there is a list of topics that is also displayed using the tag <table>. For this reason if we want to import only the first table we need to subset the object like so:

CO2_emissions_Tons.per.Capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.ATM.CO2E.PC")[[1]]

In yellow is highlighted the subsetting call.
The object CO2_emissions_Tons.per.Capita_HTML is a data.frame:

> str(CO2_emissions_Tons.per.Capita_HTML)
'data.frame':   213 obs. of  4 variables:
 $ 
            Country name          : Factor w/ 213 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ 
            2010                  : Factor w/ 93 levels "","0.0","0.1",..: 5 16 52 1 77 17 73 63 15 50 ...
 $ 
                                  : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
 $ 
                                  : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...

It has 4 columns, two of which are empty; we only nee the column named "Country name" and the one with data for the year 2010. The problem is, not every dataset has the data from the same years; in some case we only have 2010, like here, in other we have more than one year. Therefore, depending on what we want to achieve the approach to do it would be slightly different. For example, if we just want to plot the most recent data we could subset each dataset keeping only the most columns with the most recent year and exclude the others. Alternatively, if we need data from a particular year we could create variable and then try to mach it to the column in each dataset.
I will try to show how to achieve both.

The first example finds the most recent year in the data and extracts only those values.
To do this we just need to find the highest number among the column names, with the following line:

CO2.year <- max(as.numeric(names(CO2_emissions_Tons.per.Capita_HTML)),na.rm=T)

This lines does several thing at once; first uses the function names() to extract the names of the columns in the data.frame, in textual format. Then converts these texts into numbers, when possible (for example the column named "Country name" would return NA), with the function as.numeric(), which forces the conversion. At this point we have a list of numbers and NAs, from which we can calculate the maximum using the function max() with the option na.rm=T that excludes the NAs for the computation. The as.numeric() call returns a warning, since it creates NAs, but you should not worry about it.
The result for CO2 emission is a single number: 2010

Once we have identified the most recent year in the dataset, we can identify its column in the data.frame. We have two approaches to do so, the first is based on the function which():

CO2.col <- which(as.numeric(names(CO2_emissions_Tons.per.Capita_HTML))==CO2.year)

This returns a warning, for the same reason explained above, but it does the job. Another possibility is to use the function grep(), which is able to search within character vectors for particular patterns.
We can use this function like so:

CO2.col <- grep(x=names(CO2_emissions_Tons.per.Capita_HTML),pattern=paste(CO2.year))

This function takes two argument: x and pattern.
The first is the character vector where to search for particular words, the second is the pattern to be identified in x. In this case we are taking the names of the columns and we are trying to identify the one string that contains the number 2010, which needs to be in a character format, hence the use of paste(). Both functions return the number of the element corresponding to the clause in the character vector. In other words, if we have for example a character vector composed by the elements: "Banana", "Pear", "Apple" and we apply these two function searching for the word "Pear", they will return the number 2, which is position of the word in the vector.


A second way of extracting data from the economic tables is by defining a particular year and then match it to the data. For this we first need to create a new variable:

YEAR = 2010

The object CO2.year would then be equal to the value assigned to the variable YEAR:

CO2.year <- YEAR

The variable YEAR can also be used to identify the column in the data.frame:

CO2.col <- grep(x=names(CO2_emissions_Tons.per.Capita_HTML),pattern=paste(YEAR))


At this point we have all we need to attach the economic data to the polygon shapefile, matching the Country names in the economic tables with the names in the shapefile. As I mentioned before, some names do not match between the two datasets and in the case of "Western Sahara" the state is not present in the economic table. For this reason, before we can proceed in creating the map we need to modify the polygons object to match the economic table using the following code:

#Some Country names need to be changed for inconsistencies between datasets
polygons[polygons$NAME=="Libyan Arab Jamahiriya","NAME"] <- "Libya"
polygons[polygons$NAME=="Democratic Republic of the Congo","NAME"] <- "Congo, Dem. Rep."
polygons[polygons$NAME=="Congo","NAME"] <- "Congo, Rep."
polygons[polygons$NAME=="Kyrgyzstan","NAME"] <- "Kyrgyz Republic"
polygons[polygons$NAME=="United Republic of Tanzania","NAME"] <- "Tanzania"
polygons[polygons$NAME=="Iran (Islamic Republic of)","NAME"] <- "Iran, Islamic Rep."
 
#The States of "Western Sahara", "French Guyana" do not exist in the World Bank Database

These are the names that we have to replace manually, otherwise there would be no way to make R understand that these names refer to the same states. However, even after these changes we cannot just link the list of countries of the shapefile with the one on the World Bank data; there are other names for which there is no direct correlation. For example, on the World Bank website The Bahamas is referred to as "Bahamas, The", while in the polygon shapefile the same country is referred to as simply "Bahamas". If we try to match the two lists we would obtain an NA. However, for such a case we can use again the function grep() to identify, in the list of countries on the World Bank, the one string than contain the word "Bahamas", using a line of code like the following:

polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),CO2.col]))

This code extracts from the World Bank data the CO2 value, for the year that we defined earlier, related to the country defined in the polygon shapefile, for a certain row, which here is defined simply as "row". This line is again rich of nested functions. The innermost is grep():

grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME))

with this line we are searching in the list of countries the one word that corresponds to the name of the country in the polygons shapefile. As mentioned before the function grep() return a number, which identify the element in the vector. We can use this number to extract, from the object CO2_emissions_Tons.per.Capita_HTML the corresponding line, and the column corresponding to the value in the object CO2.col (that we created above).

This line works in the case of the Bahamas, because there is only one element in the list of countries of the World Bank with than word. However, there are other cases where more than multiple countries have the same word in their name, for example: "China", "Hong Kong SAR, China" and "Macao SAR, China". If we use the line above with China for example, since the function grep() looks for word that contain China, it would identify all of them and it would be tricky in R to automatically identify the correct element. To resolve to such a situation we can add another line specifically created for these cases:

polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[paste(CO2_emissions_Tons.per.Capita_HTML[,1])==paste(polygons[row,]$NAME),CO2.col]))

In this case we have to match the exact words so we can just use a == clause and extract only the element that corresponds to the word "China". The other two countries, Hong Kong and Macao, would be recognized by the grep function by pattern recognition.

Now that we defined all the rules we would nee we can create a loop to automatically extract all the values of CO2 for each country in the World Bank and attach them to the polygon shapefile:

polygons$CO2 <- c()
for(row in 1:length(polygons)){
 if(any(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))&length(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))==1){
  polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),CO2.col]))
 
 }
 
 if(any(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))&length(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))>1){
  polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[paste(CO2_emissions_Tons.per.Capita_HTML[,1])==paste(polygons[row,]$NAME),CO2.col]))
 
 }
 
}

First of all we create a new column in the object polygons named CO2, which for the time being is an empty vector. The we start the loop that iterates from 1 to the number of rows of the object polygons, for each row it needs to extract the CO2 values that corresponds to the country.
Since we have the two situations described above with the country names, we need to define two if sentences to guide the looping and avoid errors. We use two functions for this: any() and length().
The function any() examines the output of the grep() function and returns TRUE only if grep() returns a number. If grep() does not find any country in the World Bank data that corresponds to the country in the polygons shapefile the loop will return an NA, because both if clauses would return FALSE. The function length() is used to discriminate between the two situations we described above. i.e. we have one line for situations in which grep() returns just one number and another when grep() identify multiple countries with the same pattern.

Once we finished the loop the object polygons has a column names CO2, which we can use to plot the map of the CO2 Emission using the function spplot():

spplot(polygons,"CO2",main=paste("CO2 Emissions - Year:",CO2.year),sub="Metric Tons per capita")

The result is the following map, where the function spplot creates also a color scale from the data:



R code snippets created by Pretty R at inside-R.org



Now we can apply the same exact approach for the other datasets we need to download from the World Bank and create maps of all the economic indexes available. The full script is available below (it was fully tested in date 12th May 2015).


 library(raster)  
 library(XML)  
 library(rgdal)
   
 #Change the following line to set the working directory  
 setwd("...")  
   
 #Download, unzip and load the polygon shapefile with the countries' borders  
 download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")  
 unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())  
 polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")  
   
   
 #Read Economic data tables from the World Bank website  
 CO2_emissions_Tons.per.Capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.ATM.CO2E.PC")[[1]]  
 Population_urban.more.1.mln_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.URB.MCTY")[[1]]  
 Population_Density_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.POP.DNST")[[1]]  
 Population_Largest_Cities_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EN.URB.LCTY")[[1]]  
 GDP_capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/NY.GDP.PCAP.CD")[[1]]  
 GDP_HTML <- readHTMLTable("http://data.worldbank.org/indicator/NY.GDP.MKTP.CD")[[1]]  
 Adj_Income_capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/NY.ADJ.NNTY.PC.CD")[[1]]  
 Adj_Income_HTML <- readHTMLTable("http://data.worldbank.org/indicator/NY.ADJ.NNTY.CD")[[1]]  
 Elect_Consumption_capita_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EG.USE.ELEC.KH.PC")[[1]]  
 Elect_Consumption_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EG.USE.ELEC.KH")[[1]]  
 Elect_Production_HTML <- readHTMLTable("http://data.worldbank.org/indicator/EG.ELC.PROD.KH")[[1]]  
   
   
 #First Approach - Find the most recent data  
 #Maximum Year  
 CO2.year <- max(as.numeric(names(CO2_emissions_Tons.per.Capita_HTML)),na.rm=T)  
 PoplUrb.year <- max(as.numeric(names(Population_urban.more.1.mln_HTML)),na.rm=T)  
 PoplDens.year <- max(as.numeric(names(Population_Density_HTML)),na.rm=T)  
 PoplLarg.year <- max(as.numeric(names(Population_Largest_Cities_HTML)),na.rm=T)  
 GDPcap.year <- max(as.numeric(names(GDP_capita_HTML)),na.rm=T)  
 GDP.year <- max(as.numeric(names(GDP_HTML)),na.rm=T)  
 AdjInc.cap.year <- max(as.numeric(names(Adj_Income_capita_HTML)),na.rm=T)  
 AdjInc.year <- max(as.numeric(names(Adj_Income_HTML)),na.rm=T)  
 EleCon.cap.year <- max(as.numeric(names(Elect_Consumption_capita_HTML)),na.rm=T)  
 EleCon.year <- max(as.numeric(names(Elect_Consumption_HTML)),na.rm=T)  
 ElecProd.year <- max(as.numeric(names(Elect_Production_HTML)),na.rm=T)  
   
   
 #Column Maximum Year  
 CO2.col <- grep(x=names(CO2_emissions_Tons.per.Capita_HTML),pattern=paste(CO2.year))  
 PoplUrb.col <- grep(x=names(Population_urban.more.1.mln_HTML),pattern=paste(PoplUrb.year))  
 PoplDens.col <- grep(x=names(Population_Density_HTML),pattern=paste(PoplDens.year))  
 PoplLarg.col <- grep(x=names(Population_Largest_Cities_HTML),pattern=paste(PoplLarg.year))  
 GDPcap.col <- grep(x=names(GDP_capita_HTML),pattern=paste(GDPcap.year))  
 GDP.col <- grep(x=names(GDP_HTML),pattern=paste(GDP.year))  
 AdjInc.cap.col <- grep(x=names(Adj_Income_capita_HTML),pattern=paste(AdjInc.cap.year))  
 AdjInc.col <- grep(x=names(Adj_Income_HTML),pattern=paste(AdjInc.year))  
 EleCon.cap.col <- grep(x=names(Elect_Consumption_capita_HTML),pattern=paste(EleCon.cap.year))  
 EleCon.col <- grep(x=names(Elect_Consumption_HTML),pattern=paste(EleCon.year))  
 ElecProd.col <- grep(x=names(Elect_Production_HTML),pattern=paste(ElecProd.year))  
   
   
   
 #Second Approach - Find data for specific Years  
 YEAR = 2010  
   
 #Year  
 CO2.year <- YEAR  
 PoplUrb.year <- YEAR  
 PoplDens.year <- YEAR  
 PoplLarg.year <- YEAR  
 GDPcap.year <- YEAR  
 GDP.year <- YEAR  
 AdjInc.cap.year <- YEAR  
 AdjInc.year <- YEAR  
 EleCon.cap.year <- YEAR  
 EleCon.year <- YEAR  
 ElecProd.year <- YEAR  
   
   
 #Column Maximum Year  
 CO2.col <- grep(x=names(CO2_emissions_Tons.per.Capita_HTML),pattern=paste(YEAR))  
 PoplUrb.col <- grep(x=names(Population_urban.more.1.mln_HTML),pattern=paste(YEAR))  
 PoplDens.col <- grep(x=names(Population_Density_HTML),pattern=paste(YEAR))  
 PoplLarg.col <- grep(x=names(Population_Largest_Cities_HTML),pattern=paste(YEAR))  
 GDPcap.col <- grep(x=names(GDP_capita_HTML),pattern=paste(YEAR))  
 GDP.col <- grep(x=names(GDP_HTML),pattern=paste(YEAR))  
 AdjInc.cap.col <- grep(x=names(Adj_Income_capita_HTML),pattern=paste(YEAR))  
 AdjInc.col <- grep(x=names(Adj_Income_HTML),pattern=paste(YEAR))  
 EleCon.cap.col <- grep(x=names(Elect_Consumption_capita_HTML),pattern=paste(YEAR))  
 EleCon.col <- grep(x=names(Elect_Consumption_HTML),pattern=paste(YEAR))  
 ElecProd.col <- grep(x=names(Elect_Production_HTML),pattern=paste(YEAR))  
   
   
   
   
 #Some Country names need to be changed for inconsistencies between datasets  
 polygons[polygons$NAME=="Libyan Arab Jamahiriya","NAME"] <- "Libya"  
 polygons[polygons$NAME=="Democratic Republic of the Congo","NAME"] <- "Congo, Dem. Rep."  
 polygons[polygons$NAME=="Congo","NAME"] <- "Congo, Rep."  
 polygons[polygons$NAME=="Kyrgyzstan","NAME"] <- "Kyrgyz Republic"  
 polygons[polygons$NAME=="United Republic of Tanzania","NAME"] <- "Tanzania"  
 polygons[polygons$NAME=="Iran (Islamic Republic of)","NAME"] <- "Iran, Islamic Rep."  
   
 #The States of "Western Sahara", "French Guyana" do not exist in the World Bank Database  
   
   
 #Now we can start the loop to add the economic data to the polygon shapefile  
 polygons$CO2 <- c()  
 polygons$PoplUrb <- c()  
 polygons$PoplDens <- c()  
 polygons$PoplLargCit <- c()  
 polygons$GDP.capita <- c()  
 polygons$GDP <- c()  
 polygons$AdjInc.capita <- c()  
 polygons$AdjInc <- c()  
 polygons$ElectConsumpt.capita <- c()  
 polygons$ElectConsumpt <- c()  
 polygons$ElectProduct <- c()  
   
   
 for(row in 1:length(polygons)){  
      if(any(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))&length(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))==1){  
           polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),CO2.col]))  
           polygons[row,"PoplUrb"] <- as.numeric(gsub(",","",paste(Population_urban.more.1.mln_HTML[grep(x=paste(Population_urban.more.1.mln_HTML[,1]),pattern=paste(polygons[row,]$NAME)),PoplUrb.col])))  
           polygons[row,"PoplDens"] <- as.numeric(paste(Population_Density_HTML[grep(x=paste(Population_Density_HTML[,1]),pattern=paste(polygons[row,]$NAME)),PoplDens.col]))  
           polygons[row,"PoplLargCit"] <- as.numeric(gsub(",","",paste(Population_Largest_Cities_HTML[grep(x=paste(Population_Largest_Cities_HTML[,1]),pattern=paste(polygons[row,]$NAME)),PoplLarg.col])))  
           polygons[row,"GDP.capita"] <- as.numeric(gsub(",","",paste(GDP_capita_HTML[grep(x=paste(GDP_capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),GDPcap.col])))  
           polygons[row,"GDP"] <- as.numeric(gsub(",","",paste(GDP_HTML[grep(x=paste(GDP_HTML[,1]),pattern=paste(polygons[row,]$NAME)),GDP.col])))  
           polygons[row,"AdjInc.capita"] <- as.numeric(gsub(",","",paste(Adj_Income_capita_HTML[grep(x=paste(Adj_Income_capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),AdjInc.cap.col])))  
           polygons[row,"AdjInc"] <- as.numeric(gsub(",","",paste(Adj_Income_HTML[grep(x=paste(Adj_Income_HTML[,1]),pattern=paste(polygons[row,]$NAME)),AdjInc.col])))  
           polygons[row,"ElectConsumpt.capita"] <- as.numeric(gsub(",","",paste(Elect_Consumption_capita_HTML[grep(x=paste(Elect_Consumption_capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)),EleCon.cap.col])))  
           polygons[row,"ElectConsumpt"] <- as.numeric(gsub(",","",paste(Elect_Consumption_HTML[grep(x=paste(Elect_Consumption_HTML[,1]),pattern=paste(polygons[row,]$NAME)),EleCon.col])))  
           polygons[row,"ElectProduct"] <- as.numeric(gsub(",","",paste(Elect_Production_HTML[grep(x=paste(Elect_Production_HTML[,1]),pattern=paste(polygons[row,]$NAME)),ElecProd.col])))  
      }  
   
      if(any(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))&length(grep(x=paste(CO2_emissions_Tons.per.Capita_HTML[,1]),pattern=paste(polygons[row,]$NAME)))>1){  
           polygons[row,"CO2"] <- as.numeric(paste(CO2_emissions_Tons.per.Capita_HTML[paste(CO2_emissions_Tons.per.Capita_HTML[,1])==paste(polygons[row,]$NAME),CO2.col]))  
           polygons[row,"PoplUrb"] <- as.numeric(gsub(",","",paste(Population_urban.more.1.mln_HTML[paste(Population_urban.more.1.mln_HTML[,1])==paste(polygons[row,]$NAME),PoplUrb.col])))  
           polygons[row,"PoplDens"] <- as.numeric(paste(Population_Density_HTML[paste(Population_Density_HTML[,1])==paste(polygons[row,]$NAME),PoplDens.col]))  
           polygons[row,"PoplLargCit"] <- as.numeric(gsub(",","",paste(Population_Largest_Cities_HTML[paste(Population_Largest_Cities_HTML[,1])==paste(polygons[row,]$NAME),PoplLarg.col])))  
           polygons[row,"GDP.capita"] <- as.numeric(gsub(",","",paste(GDP_capita_HTML[paste(GDP_capita_HTML[,1])==paste(polygons[row,]$NAME),GDPcap.col])))  
           polygons[row,"GDP"] <- as.numeric(gsub(",","",paste(GDP_HTML[paste(GDP_HTML[,1])==paste(polygons[row,]$NAME),GDP.col])))  
           polygons[row,"AdjInc.capita"] <- as.numeric(gsub(",","",paste(Adj_Income_capita_HTML[paste(Adj_Income_capita_HTML[,1])==paste(polygons[row,]$NAME),AdjInc.cap.col])))  
           polygons[row,"AdjInc"] <- as.numeric(gsub(",","",paste(Adj_Income_HTML[paste(Adj_Income_HTML[,1])==paste(polygons[row,]$NAME),AdjInc.col])))  
           polygons[row,"ElectConsumpt.capita"] <- as.numeric(gsub(",","",paste(Elect_Consumption_capita_HTML[paste(Elect_Consumption_capita_HTML[,1])==paste(polygons[row,]$NAME),EleCon.cap.col])))  
           polygons[row,"ElectConsumpt"] <- as.numeric(gsub(",","",paste(Elect_Consumption_HTML[paste(Elect_Consumption_HTML[,1])==paste(polygons[row,]$NAME),EleCon.col])))  
           polygons[row,"ElectProduct"] <- as.numeric(gsub(",","",paste(Elect_Production_HTML[paste(Elect_Production_HTML[,1])==paste(polygons[row,]$NAME),ElecProd.col])))  
 }  
   
 }  
   
   
   
 #Spatial Plots  
 spplot(polygons,"CO2",main=paste("CO2 Emissions - Year:",CO2.year),sub="Metric Tons per capita")  
 spplot(polygons,"PoplUrb",main=paste("Population - Year:",PoplUrb.year),sub="In urban agglomerations of more than 1 million")  
 spplot(polygons,"PoplDens",main=paste("Population Density - Year:",PoplDens.year),sub="People per sq. km of land area")  
 spplot(polygons,"PoplLargCit",main=paste("Population in largest city - Year:",PoplLarg.year))  
 spplot(polygons,"GDP.capita",main=paste("GDP per capita - Year:",GDPcap.year),sub="Currency: USD")  
 spplot(polygons,"GDP",main=paste("GDP - Year:",GDP.year),sub="Currency: USD")  
 spplot(polygons,"AdjInc.capita",main=paste("Adjusted net national income per capita - Year:",AdjInc.cap.year),sub="Currency: USD")  
 spplot(polygons,"AdjInc",main=paste("Adjusted net national income - Year:",AdjInc.year),sub="Currency: USD")  
 spplot(polygons,"ElectConsumpt.capita",main=paste("Electric power consumption per capita - Year:",EleCon.cap.year),sub="kWh per capita")  
 spplot(polygons,"ElectConsumpt",main=paste("Electric power consumption - Year:",EleCon.year),sub="kWh")  
 spplot(polygons,"ElectProduct",main=paste("Electricity production - Year:",ElecProd.year),sub="kWh")  
   

Sunday, 10 May 2015

Exchange data between R and the Google Maps API using Shiny

A couple of years ago I wrote a post about using Shiny to exchange data between the Google Maps API and R: http://r-video-tutorial.blogspot.ch/2013/07/interfacing-r-and-google-maps.html

Back then as far as I remember Shiny did not allow a direct exchange of data between javascript, therefore I had to improvise and extract data indirectly using an external table. In other words, that work was not really good!!

The new versions of Shiny however features a function to send data directly from javascript to R:
Shiny.onInputChange

This function can be used to communicate any data from the Google Maps API to R. Starting from this I thought about creating an example where I use the Google Maps API to draw a rectangle on the map, send the coordinates of the rectangle to R, create a grid of random point inside it and then plot them as markers on the map. This was I can exchange data back and forth from the two platforms.

For this experiment we do not need  an ui.R file, but a custom html page. Thus we need to create a folder named "www" in the shiny-server folder and add an index.html file.
Let's look at the HTML and javascript code for this page:

 <!DOCTYPE html>  
 <html>  
 <head>  
 <title>TEST</title>  
   
 <!--METADATA-->    
 <meta name="author" content="Fabio Veronesi">  
 <meta name="copyright" content="©Fabio Veronesi">  
 <meta http-equiv="Content-Language" content="en-gb">  
 <meta charset="utf-8"/>  
   
 <style type="text/css">  
   
 html { height: 100% }  
 body { height: 100%; margin: 0; padding: 0 }  
 #map-canvas { height: 100%; width:100% }  
   
 </style>  
   
      
   
 <script type="text/javascript"  
    src="https://maps.googleapis.com/maps/api/js?&sensor=false&language=en">  
   </script>  
        
 <script type="text/javascript" src="http://google-maps-utility-library-v3.googlecode.com/svn/tags/markerclusterer/1.0/src/markerclusterer.js"></script>  
   
 <script src="https://maps.googleapis.com/maps/api/js?v=3.exp&signed_in=true&libraries=drawing"></script>  
   
   
   
        
 <script type="text/javascript">  
      //We need to create the variables map and cluster before the function  
      var cluster = null;  
      var map = null;  
        
      //This function takes the variable test, which is the json we will create with R and creates markers from it  
      function Cities_Markers() {  
           if (cluster) {  
                cluster.clearMarkers();  
                }  
           var Gmarkers = [];  
           var infowindow = new google.maps.InfoWindow({ maxWidth: 500,maxHeight:500 });  
   
           for (var i = 0; i < test.length; i++) {   
                var lat = test[i][2]  
                var lng = test[i][1]  
                var marker = new google.maps.Marker({  
                     position: new google.maps.LatLng(lat, lng),  
                     title: 'test',  
                     map: map  
                });  
         
           google.maps.event.addListener(marker, 'click', (function(marker, i) {  
                return function() {  
                     infowindow.setContent('test');  
                     infowindow.open(map, marker);  
                }  
                })(marker, i));  
           Gmarkers.push(marker);  
           };  
           cluster = new MarkerClusterer(map,Gmarkers);  
           $("div#field_name").text("Showing Cities");  
      };  
   
   
      //Initialize the map  
      function initialize() {  
           var mapOptions = {  
           center: new google.maps.LatLng(54.12, -2.20),  
           zoom: 5  
      };  
   
      map = new google.maps.Map(document.getElementById('map-canvas'),mapOptions);  
        
   
      //This is the Drawing manager of the Google Maps API. This is the standard code you can find here:https://developers.google.com/maps/documentation/javascript/drawinglayer  
  var drawingManager = new google.maps.drawing.DrawingManager({  
   drawingMode: google.maps.drawing.OverlayType.MARKER,  
   drawingControl: true,  
   drawingControlOptions: {  
    position: google.maps.ControlPosition.TOP_CENTER,  
    drawingModes: [  
     google.maps.drawing.OverlayType.RECTANGLE  
    ]  
   },  
   
   rectangleOptions: {   
    fillOpacity: 0,  
    strokeWeight: 1,  
    clickable: true,  
    editable: false,  
    zIndex: 1  
   }  
        
  });  
   
//This function listen to the drawing manager and after you draw the rectangle it extract the coordinates of the NE and SW corners
  google.maps.event.addListener(drawingManager, 'rectanglecomplete', function(rectangle) {  
   var ne = rectangle.getBounds().getNorthEast();  
      var sw = rectangle.getBounds().getSouthWest();  
   
      //The following code is used to import the coordinates of the NE and SW corners of the rectangle into R  
      Shiny.onInputChange("NE1", ne.lat());  
      Shiny.onInputChange("NE2", ne.lng());  
      Shiny.onInputChange("SW1", sw.lat());  
      Shiny.onInputChange("SW2", sw.lng());  
        
 });  
   
   
  drawingManager.setMap(map);  
    
 }  
   
   
 google.maps.event.addDomListener(window, 'load', initialize);  
</script>  
        
   
        
   
 <script type="application/shiny-singletons"></script>  
 <script type="application/html-dependencies">json2[2014.02.04];jquery[1.11.0];shiny[0.11.1];bootstrap[3.3.1]</script>  
 <script src="shared/json2-min.js"></script>  
 <script src="shared/jquery.min.js"></script>  
 <link href="shared/shiny.css" rel="stylesheet" />  
 <script src="shared/shiny.min.js"></script>  
 <meta name="viewport" content="width=device-width, initial-scale=1" />  
 <link href="shared/bootstrap/css/bootstrap.min.css" rel="stylesheet" />  
 <script src="shared/bootstrap/js/bootstrap.min.js"></script>  
 <script src="shared/bootstrap/shim/html5shiv.min.js"></script>  
 <script src="shared/bootstrap/shim/respond.min.js"></script>  
   
 </head>  
   
   
 <body>  
   
  <div id="json" class="shiny-html-output"></div>  
  <div id="map-canvas"></div>  
    
 </body>  
 </html>  

As you know an HTML page has two main elements: head and body.
In the head we put all the style of the page, the metadata and the javascript code. In the body we put the elements that would be visible to the user.

After some basic metadata (written in orange), such as Title, Author and Copyright, we find a style section (in yellow) with the style of the Google Maps API. This is standard code that you can find here, where they explain how to create a simple page with google maps: Getting Started

Below we have some script calls (in blue) where we import some elements we would need to run the rest of the code. We have here the scripts to run the Google Maps API itself, plus the script to run the drawing manager, which is used to draw a rectangle onto the map, and the js script to create the clusters from the markers, otherwise we would have too many overlapping icons.

Afterward we can write the core script of the Google Maps API; here I highlighted the start and the end of the script in red and all the comments in pink so that you can work out the subdivision I made.

First of all we need to declare two variables, map and cluster as null. This is because these two variables are used in the subsequent function and if we do not declare them the function will not work. Then we can define a function, which I call Cities_Marker() because I have taken the code directly from Audioramio. This function takes a json, stored into a variable called test, loops through it and creates a mark for each pair of coordinates in the json. Then it cluster the markers.

Afterward there is the code to initialize the map and the drawing manager. The code for the drawing manager can be found here: Drawing Manager

The crucial part of the whole section is the Listener function. This code, as soon as you draw a rectangle on the map, extracts the coordinates of the NE and SW corners and store them in two variables. Then we can use the function Shiny.onInputChange to transfer these variable from javascript to R.

The final step to allow the communication back to javascript from R is create a div element in the body of the page (in blue) of the class "shiny-html-output" with the ID "json". The ID is the part that allow Shiny to identify this element.

Now we can look at the server.R script:

 # server.R  
 library(sp)  
 library(rjson)  
    
 shinyServer(function(input, output, session) {  
    
 output$json <- reactive({  
 if(length(input$NE1)>0){  
   
 #From the Google Maps API we have 4 inputs with the coordinates of the NE and SW corners  
 #using these coordinates we can create a polygon  
 pol <- Polygon(coords=matrix(c(input$NE2,input$NE1,input$NE2,input$SW1,input$SW2,input$SW1,input$SW2,input$NE1),ncol=2,byrow=T))  
 polygon <- SpatialPolygons(list(Polygons(list(pol),ID=1)))  
   
 #Then we can use the polygon to create 100 points randomly  
 grid <- spsample(polygon,n=100,type="random")  
   
 #In order to use the function toJSON we first need to create a list  
 lis <- list()  
 for(i in 1:100){  
 lis[[i]] <- list(i,grid$x[i],grid$y[i])  
 }  
   
 #This code creates the variable test directly in javascript for export the grid in the Google Maps API  
 #I have taken this part from:http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny  
    paste('<script>test=',   
       RJSONIO::toJSON(lis),  
       ';Cities_Markers();', # print 1 data line to console  
       '</script>')  
      }  
  })  
 })  

For this script we need two packages: sp and rjson.
The first is needed to create the polygon and the grid, the second to create the json that we need to export to the webpage.

Shiny communicates with the page using the IDs of the elements in the HTML body. In this case we created a div called "json", and in Shiny we use output$json to send code to this element.
Within the reactive function I first inserted an if sentence to avoid the script to start if no polygon has been drawn yet. As soon as the user draws a polygon onto the map, the four coordinates are transmitted to R and used to create a polygon (in blue). Then we can create a random grid within the polygon area with the function spSample (in orange).

Subsequently we need to create a list with the coordinates of the points, because the function toJSON takes a list as main argument.
The crucial part of the R script is the one written in red. Here we basically take the list of coordinates, we transform it into a json file and we embed it into the div element as HTML code.
This part was taken from this post: http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny

This allow R to transmit its results to the Google Maps API as a variable named test, which contains a json file. As you can see from the code, right after the json file we run the function Cities_Markers(), which takes the variable test and creates markers on the map.


Conclusion
This way we have demonstrated how to exchanged data back and forth between R and the Google Maps API using Shiny.


Friday, 8 May 2015

Run Shiny app on a Ubuntu server on the Amazon Cloud

This guide is more for self reference than anything else. 
Since I struggled for two days trying to find all the correct setting to complete this task, gathering information from several websites, I decided to write a little guide on this blog so that if I want to do it again in the future and I do not remember anything (this happens a lot!!) at least I have something to resuscitate my memory.

I found most of the information and code I used from these websites:
http://tylerhunt.co/2014/03/amazon-web-services-rstudio/
http://www.howtogeek.com/howto/41560/how-to-get-ssh-command-line-access-to-windows-7-using-cygwin/
http://www.rstudio.com/products/shiny/download-server/


Preface
First I would like to point out that this guide assumes you (I am talking to myself of the future) remember how to open an instance in the Amazon Cloud. It is not that difficult, you go to this page:
http://aws.amazon.com/ec2/

you log in (if you remember the credentials) and you should see the "Amazon Web Services" page, here you can select EC2 and launch an instance. Remember to select the correct server from the menu on the top right corner since the last time you run all the instances from Oregon, and you live in freaking Switzerland!!


Guide
NOTE
Instead of installing cygwin and cover step 1 and 2 we can just first cover step 4 and connect to the ubuntu server using WinSCP, then start putty from WinSCP and it will be already connected.


1) Install Cygwin
This software is needed to communicate with the Ubuntu server.
It is important to follow the instructions on this page (http://www.howtogeek.com/howto/41560/how-to-get-ssh-command-line-access-to-windows-7-using-cygwin/) to install the software correctly.
In particular during the installation process a "select packages" windows appears where we need to select openssh and click on "skip", until there is a cross on the column bin.

When Cygwin is installed we need to click with the right button on the icon and select "run as administrator", then open it.
Now we can run the following line to install ssh:

ssh-host-config

During the process several questions will be asked, the following answers apply:
- Should privilege separation be used? YES
- New local account sshd? YES
- Run ssh as a service? YES
- Enter a value for daemon:  ntsec
- Do you want to use a different name? NO
- Create a new privilege account user? YES  -> then insert a password


After the installation we need to insert the following line to start the sshd service:

net start sshd

Then this line to configure the service:

ssh-user-config

Again it will ask a series of questions. There is a difference between the new version and what is written on the website.
Now it asks only about an SSH2 RSA identity file to be created, the answer is YES.
Then it asks other two questions regarding DSA files and another thing, the answers here are two NO.



2) Connect to the Amazon Server
Open Cygwin.
Go to the folder where the .pem file is saved, using the following line:

cd D:/<folder>/<folder>



NOTE:
Cygwin does not like folder names with spaces!


Now we need to be sure that the .pem key will not be publicly available using the following line

chmod 400 <NAME>.pem

and then we can connect to the ubuntu server using the following line:

ssh -i <NAME>.pem ubuntu@<PUBLIC IP>

These information are provided in Amazon if we click on "Connect" once the instance has properly been launched.
Once we are in we can installing R and Shiny.


3) Install R and Shiny
The first thing to do is set up the root user with the following line:

sudo passwd root

The system will ask to input a password.
Then we can log in using the following line:

su

Now we are logged in as root users.

Now we need to update everything with the following:

apt-get update

At this point we can install R with the following line:

apt-get install r-base


NOTE:
It may be that during the installation process an older version of R is installed and this may create problems with some packages.
To solve this problem we need to modify the file sources.list located in /etc/apt/sources.list
We can do this by using WinSCP, but first we need to be sure that we have access to the folder.
We should run the following two lines:

cd /etc/

chmod 777 apt

This gives us access to modify the files in the folder apt via WinSCP (see point 4).

This line of code gives indiscriminate access to the folder, so it is not super secure.

Now we can connect and add the following line at the end:

deb http://cran.stat.ucla.edu/bin/linux/ubuntu trusty/

Then we need to first remove the old version of R using:

apt-get remove r-base

or

apt-get remove r-base-dev

Then we need to run once again both the update and the installation calls.




We can check the amount of disk space left on the server using the following command:

df -h


Then we can start R just by typing R in the console.
At this point we need to install all the packages we would need to run shiny using standard R code:

install.packages("raster")


Now we can exit from R with q() and install shiny suing the following line in ubuntu:

sudo su - \
-c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""


Now we need to install gdebi with the following lines (check here for any update:http://www.rstudio.com/products/shiny/download-server/):

apt-get install gdebi-core
wget http://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.3.0.403-amd64.deb
gdebi shiny-server-1.3.0.403-amd64.deb




4) Transfer file between windows and Ubuntu
We can use WinSCP (http://winscp.net/eng/index.php) for this task.

First of all we need to import the .pem file that is need for the authentication.
From the "New Site" window we can go to "advanced", then click on "SSH -> Authentication".
From the "Private key file" we can browse and open the .pem file. We need to transform it into a ppk file but we can do that using the default settings.
We can just click on "Save private key" to save the ppk file and then import it again on the same field, and click OK.

Now the Host name is the name of the instance, for example:
ec2-xx-xx-xxx-xxx.eu-west-1.compute.amazonaws.com

The user name is ubuntu and the password is left blank. The protocol is SFTP and the port is 22.

The shiny server is located in the folder /srv/shiny-server

We need to give WinSCP access to this folder using again the command

cd srv
chmod 777 shiny-server






5) Transfer shiny app files on the server
In WinSCP open the folder /srv/shiny-server and create a new folder with the name of your shiny app.
Then transfer the files from your PC to the folder.
Remember to change the address of the files or working directory in the R script with the new links.


6) Allow the port 3838 to access the web
To do this we need to change the rule in the "security groups" menu.
This menu is visible in the main window where all your instances are shown. However, do not access the menu from the panel of the left, that area may be a bit confusing.
Instead select the instance in which you need to add the rule, look at the window at the bottom of the page (the one that shows the name of the instance) and click on the name in light blue near "security group".
This will open the security group menu specific for the instance you selected.
Here we need to add a rule by clicking on "add rule", select custom TCP, write the port number 3838 in the correct area and the select "anywhere" in the IP section.

Now if you go to the following page you should see the app:
<PUBLIC IP>:3838/<FOLDER>

It is sometimes necessary to open port 22 as well, using the same procedure to keep connecting to the server with cygwin or putty.


7) Stop, start and restart Shiny server

sudo start shiny-server

sudo stop shiny-server

sudo restart shiny-server


8) Installing rgdal
This package require some tweaks before its installations. On this site I found what I needed: http://askubuntu.com/questions/206593/how-to-install-rgdal-on-ubuntu-12-10

Basically from the Ubuntu console just run these three lines of code:
sudo apt-get install aptitude
sudo aptitude install libgdal-dev
sudo aptitude install libproj-dev


Then go back to R and install rgdal normally (with install.packages)



9) Installing rCharts
Before installing rCharts we need to run the line above to install rgdal (I also run the first two lines suggested in the comments, but I do not know if it helped). I also run the following lines from here: http://stackoverflow.com/questions/16363144/install-rcharts-package-on-r-2-15-2

sudo apt-get install libcurl4-openssl-dev
sudo apt-get install openjdk-6-jdk
export LD_LIBRARY_PATH=/usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/server
R CMD javareconf 
 
but I do not know if they helped or not. 

If we not do that "devtools" will not install and therefore we will not be able to install rCharts from github:

require(devtools)
install_github('rCharts', 'ramnathv')


To run rCharts from the ubuntu server we need to make sure that the links to the javascript library are not referring to


NOTE:
You may need to add packages in R after the installation. For doing that you always need to remember to access ubuntu as root user, so first thing to do is write the code:

su

and insert the password. Now you can start R and install the packages. Otherwise the lib folder where the packages are installed would not be accessible.




#UPDATE from Mike Rutter
To make the install easier, add the following PPAs:

sudo apt-add-repository ppa:marutter/rrutter
sudo apt-add-repository ppa:marutter/c2d4u

The first is the same as the CRAN repository, but you don't need to edit "sources.list". The second has over 2,500 R packages ready to install. For example:

sudo apt-get install r-cran-raster r-cran-rgdal r-cran-shiny

will install the R packages mentioned in the the post. No need to install the "dev" packages either, as that will be taken care of by apt. And the will be updated via the regular Ubuntu update process.

NOTE
Without updating the file in /etc/apt we can just run the first two lines suggested by Mike, then remove r-base and re-install it to have the updated version.

Thursday, 7 May 2015

Extract values from numerous rasters in less time

These days I was working with a Shiny app for which the computation time is a big problem.
Basically this app takes some coordinates, extract values from 1036 rasters for these coordinates and make some computations. 
As far as I can (and please correct me if I'm wrong!) tell there are two ways of doing this task:
1) load all the 1036 rasters and extract the values from each of them in a loop
2) create a stack raster and extract the values only one time

In the first approach it helps if I have all my rasters in one single folder, since in that case I can run the following code:

f <- list.files(getwd()) 
ras <- lapply(f,raster) 
ext <- lapply(ras,extract,MapUTM) 
ext2 <- unlist(ext)
t1 <- Sys.time()


The first line creates a list of all the raster file in the working directory, then with the second line I can read them in R using the package raster.
The third line extracts from each raster the values that corresponds to the coordinates of the SpatialPoints object named MapUTM. The object ext is a list, therefore I have to change it into a numeric vector for the computations I will do later in the script.
This entire operation takes 1.835767 mins.

Since this takes too much time I thought of using a stack raster. I can just run the following line to create
a RasterStack object with 1036 layers. This is almost instantaneous.

STACK <- stack(ras)
 
The object looks like this:
> STACK
class       : RasterStack 
dimensions  : 1217, 658, 800786, 1036  (nrow, ncol, ncell, nlayers)
resolution  : 1000, 1000  (x, y)
extent      : 165036.8, 823036.8, 5531644, 6748644  (xmin, xmax, ymin, ymax)
coord. ref. : NA 
names       :      Dir_10,      Dir_11,      Dir_12,      Dir_13,      Dir_14,      Dir_15,      Dir_16,      Dir_17,      Dir_18,      Dir_19,      Dir_20,      Dir_21,      Dir_22,      Dir_23,      Dir_24, ... 
min values  :   59.032657,  141.913933,   84.781970,  147.634633,   39.723591,  154.615133,   45.868360,  197.306633,   85.839959,  272.336367,   93.234409,  339.732100,   79.106781,  566.522933,  175.075968, ... 
max values  :  685.689288, 2579.985700,  840.835621, 3575.341167, 1164.557067, 5466.193933, 2213.728126, 5764.541400, 2447.792437, 4485.639133, 1446.003349, 5308.407167, 1650.665136, 5910.945967, 2038.332471, ... 


At this point I can extract the coordinates from all the rasters in one go, with the following line:
ext <- extract(STACK,MapUTM)
  

This has the advantage of creating a numeric vector, but unfortunately this operation is only slightly faster than the previous one, with a total time of 1.57565 mins

At this point, from a suggestion of a colleague Kirill Müller (http://www.ivt.ethz.ch/people/muelleki), I tested ways of translating the RasterStack into a huge matrix and then query it to extract values.
I encountered two problems with this approach, first is the amount of RAM needed to create the matrix and second is identify the exact row to extract from it.
In the package raster I can transform a Raster object into a matrix simply by calling the function as.matrix. However my RasterStack object has 800786 cells and 1036 layers, meaning that I would need to create a 800786x1036 matrix and I do have enough RAM for that.
I solved this problem using the package ff. I can create a matrix object in R that is associated with a physical object on disk. This approach allowed me to use a minimum amount of RAM and achieve the same results. This is the code I used:

mat <- ff(vmode="double",dim=c(ncell(STACK),nlayers(STACK)),filename=paste0(getwd(),"/stack.ffdata"))
 
for(i in 1:nlayers(STACK)){
mat[,i] <- STACK[[i]][]
}
save(mat,file=paste0(getwd(),"/data.RData"))

With the first line I create an empty matrix with the characteristics above (800786 rows and 1036 columns) and place it on disk.
Then in the loop I fill the matrix row by row. There is probably a better way of doing it but this does the job and that is all I actually care about. finally I save the ff object into an RData object on disk, simply because I had difficulties loading the ff object from disk.
This process takes 5 minutes to complete, but it is something you need to do just once and then you can load the matrix from disk and do the rest.

At this point I had the problem of identifying the correct cell from which to extract all the values. I solved it by creating a new raster and fill it with integers from 1 to the maximum number of cells. I did this using the following two lines:

ID_Raster <- raster(STACK[[1]])
ID_Raster[] <- 1:ncell(STACK[[1]])

Now I can use the extract function on this raster to identify the correct cell and the extract the corresponding values from the ff matrix, with the following lines:

ext_ID <- extract(ID_Raster,MapUTM)
ext2 <- mat[as.numeric(ext_ID),]

If I do the extract this way I complete the process in 2.671 secs, which is of great importance for the Shiny interface.


R code snippets created by Pretty R at inside-R.org

Tuesday, 28 April 2015

Downloading and Visualizing Seismic Events from USGS

The unlucky events that took place in Nepal have flooded the web with visualization of the earthquakes from USGS. They normally visualize earthquakes with a colour scale that depends on the age of the event and a marker size that depends on magnitude. I remembered that some time ago I tested ways for downloading and visualizing data from USG in the same way in R. So I decided to take those tests back, clean them up and publish them. I hope this will not offend anyone, I do not want to disrespect the tragedy, just share my work.

The USGS provides access to csv files for seismic events recording in several time frames: past hour, past day, past week, and in the past 30 days. For each of these, several choices of significance are provided, user can download all the events in the time frame or limit their request to events with magnitude higher than: 1.0, 2.5, 4.5 and significant events. The data are provided in csv files with standard names so that they are always accessible and updated every 15 minutes with new data.
USGS provides the csv files in links with standard names. For example in this case we are downloading all the data in the last month, so the csv file’s name is: all_month.csv. If we wanted to download only the earthquakes in the last day and with a magnitude above 4.5, we would have used the file name: 4.5_day.csv. The links to all the csv provided by USGS are available here: http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

For this experiment we need the following packages: sp, plotrix, and raster
In R we can easily import the data by simply calling the read.table function and reading the csv file from the server:
URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
Earthquake_30Days <- read.table(URL, sep = ",", header = T)

This will download all the seismic events in the past 30 days.
Now we can transform the data, which are stored in a data.frame, into a spatial object using the following two lines:
coordinates(Earthquake_30Days)=~longitude+latitude
projection(Earthquake_30Days)=CRS("+init=epsg:4326")

The first line transforms the object Earthquake_30Days into a SpatialPointsDataFrame. The second gives it its proper projection, which is a geographical projection like Google Maps.
At this point I want to download the borders of all the countries in the world so that I can plot the seismic events with some geographical references:
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

These three lines can download the border shapefile from the web, unzip it into the working directory and load it.
In this example I will visualize the earthquakes using the same technique used by the USGS, with a colour that varies with the age of the event and a size that depends on magnitude. So the first thing to do is take care of the time. If we check the format of the time variable in the USGS file we see that it is a bit uncommon:
Earthquake_30Days$time[1]
[1] 2015-04-28T12:20:43.410Z

For this reason I created a function to transform this format into something we can use:
conv.time <- function(vector){
split1 <- strsplit(paste(vector),"T")
split2 <- strsplit(split1[[1]][2],"Z")
fin <- paste0(split1[[1]][1],split2[[1]][1])
paste(as.POSIXlt(fin,formate="%Y-%m-%d%H:%M:%OS3"))
}

If I apply this function to the previous example I obtain the following:
conv.time(Earthquake_30Days$time[1])
[1] "2015-04-28 12:20:43"

Now I can create a new variable in the object with this time-stamp:
DT <- sapply(Earthquake_30Days$time,FUN=conv.time)
Earthquake_30Days$DateTime <- as.POSIXlt(DT)

Now we can start the tricky part. For plotting the events with a custom colour scale and a custom size scale, we first need to create them. Moreover, we also need to create the thresholds needed for the legend.
For the colour scale we can do all that using the following lines:
days.from.today <- round(c(Sys.time()-Earthquake_30Days$DateTime)/60,0)
colour.scale <- color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue"),alpha=0.5)
colors.DF <- data.frame(days.from.today,color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue")))
colors.DF <- colors.DF[with(colors.DF, order(colors.DF[,1])), ]
colors.DF$ID <- 1:nrow(colors.DF)
breaks <- seq(1,nrow(colors.DF),length.out=10)

In the first line I calculate the age of the event as the difference between the system time and the event time-stamp. In the second line I create the colour scale with the function in plotrix, from red to blue and with a certain transparency.
Then I need to create the thresholds for the legend. I first create a data.frame with age and colours, then I order it by age and insert an ID column. At this point I can create the thresholds by simply using the seq function.
I do the same thing with the size thresholds:
size.DF <- data.frame(Earthquake_30Days$mag,Earthquake_30Days$mag/5)
size.DF <- size.DF[with(size.DF, order(size.DF[,1])), ]
size.DF$ID <- 1:nrow(size.DF)
breaks.size <- seq(0,max(Earthquake_30Days$mag/5),length.out=5)

Then I can plot the whole thing:

tiff(filename="Earthquake_Map.tif",width=7000,height=4000, res=300)

#Plot
plot(polygons)
plot(Earthquake_30Days, col= colour.scale, cex=Earthquake_30Days$mag/5, pch=16, add=T)

#Title and Legend
title("Earthquakes in the last 30 days",cex.main=3)
legend.pos <- list(x=-28.52392,y=-20.59119)
rect(xleft=legend.pos$x-5, ybottom=legend.pos$y-30, xright=legend.pos$x+30, ytop=legend.pos$y+10, col="white", border=NA)
legendg(legend.pos,legend=c(round(colors.DF[colors.DF$ID %in% round(breaks,0),1],2)),fill=paste(colors.DF[colors.DF$ID %in% round(breaks,0),2]),bty="n",bg=c("white"),y.intersp=0.75,title="Age",cex=0.8) 
text(x=legend.pos$x+5,y=legend.pos$y+5,"Legend:")
legend(x=legend.pos$x+15,y=legend.pos$y,legend=breaks.size[2:5]*5,pch=points(rep(legend.pos$x+15,4),c(legend.pos$y-6,legend.pos$y-9,legend.pos$y-12,legend.pos$y-15),pch=16,cex=breaks.size[2:5]),cex=0.8,bty="n",bg=c("white"),y.intersp=1.1,title="Magnitude") 

dev.off()

I divided the magnitude by 5, so that the bubbles are not too big. The position of the legends is something that depends of the image, if you decrease the area plotted on the map their location will change and you can use geographical coordinates to change it.
The result is the following image:
Download the image here: Earthquake_Map.jpg

The full script is available here: Script

Thursday, 11 December 2014

Accessing, cleaning and plotting NOAA Temperature Data

In my previous post I said that my students are using data from NOAA for their research.
NOAA in fact provides daily averages of several environmental parameters for thousands of weather stations scattered across the globe, completely free.
In details, NOAA provides the following data:
  • Mean temperature for the day in degrees Fahrenheit to tenths
  • Mean dew point for the day in degrees Fahrenheit to tenths
  • Mean sea level pressure for the day in millibars to tenths
  • Mean station pressure for the day in millibars to tenths
  • Mean visibility for the day in miles to tenths
  • Mean wind speed for the day in knots to tenths
  • Maximum sustained wind speed reported for the day in knots to tenths
  • Maximum wind gust reported for the day in knots to tenths
  • Maximum temperature reported during the day in Fahrenheit to tenths
  • Minimum temperature reported during the day in Fahrenheit to tenths
  • Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths
  • Snow depth in inches to tenths

A description of the dataset can be found here: GSOD_DESC.txt
All these data are available from 1929 to 2014.

The problem with these data is that they require some processing before they can be used for any sort of computation.
For example, the stations ID and coordinates are available in a text file external to the data file, so each data file needs to be cross referenced to this text file to extract the coordinates of the weather station.
Moreover, the data files are supplied as either one single .tar file or as a series of .gz files, that if extract give a series of .op textual files, which can then be opened in R.
All these steps would require some sort of processing before they can be used in R. For example one could download the data file and extract them with 7zip.

In this blog post I would provide a way of downloading the NOAA data, cleaning them (meaning define outliers and extract the coordinates for each location), and then plot them to observe their location and spatial pattern.

The first thing we need to do is loading the necessary packages:

library(raster)
library(XML)
library(plotrix)


Then I can define my working directory:

setwd("C:/")

At this point I can download and read the coordinates file from this address: isd-history.txt

The only way of reading it is by using the function read.fwt, which is used to read "fixed width tables" in R.
Typical entries of this table are provided as examples below:

024160 99999 NO DATA NO DATA
024284 99999 MORA SW +60.958 +014.511 +0193.2 20050116 20140330
024530 99999 GAVLE/SANDVIKEN AIR FORCE BAS SW +60.717 +017.167 +0016.0 20050101 20140403


Here the columns are named as follow:
  • USAF = Air Force station ID. May contain a letter in the first position
  • WBAN = NCDC WBAN number
  • CTRY = FIPS country ID
  • ST = State for US stations
  • LAT = Latitude in thousandths of decimal degrees
  • LON = Longitude in thousandths of decimal degrees
  • ELEV = Elevation in meters
  • BEGIN = Beginning Period Of Record (YYYYMMDD)
  • END = Ending Period Of Record (YYYYMMDD)

As you can see, the length of each line is identical but its content varies substantially from one line to the other. Therefore there is no way of reading this file with the read.table function.
However, in read.fwt we can define the length of each column in the dataset so that we can create a data.frame out of it.
I used the following widths:
  • 6 - USAF
  • 1 - White space
  • 5 - WBAN
  • 1 - White space
  • 38 - CTRY and ST
  • 7 - LAT
  • 1 - White space
  • 7 - LON
  • 9 - White space, plus Elevation, plus another White space
  • 8 - BEGIN
  • 1 - White space
  • 8 - END

After this step I created a data.frame with just USAF, WBAN, LAT and LON.
The two lines of code for achieving all this are presented below:


coords.fwt <- read.fwf("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/isd-history.txt",widths=c(6,1,5,1,38,7,1,8,9,8,1,8),sep=";",skip=21,fill=T)<br />
coords <- data.frame(ID=paste(as.factor(coords.fwt[,1])),WBAN=paste(as.factor(coords.fwt[,3])),Lat=as.numeric(paste(coords.fwt$V6)),Lon=as.numeric(paste(coords.fwt$V8)))

As you can see, I can work directly using the data link, there is actually no need to have this file local.

After this step I can download the data files for a particular year and extract them. I can use the function download.file from XML to download the .tar file, then the function untar to extract its contents.
The code for doing so is the following:

#Download Measurements
year = 2013
download.file(paste("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",year,"/gsod_",year,".tar",sep=""),destfile=paste(getwd(),"/Data/","gsod_2013.tar",sep=""),method="auto",mode="wb")

#Extract Measurements
dir.create(paste(getwd(),"/Data/Files",sep=""))
untar(paste(getwd(),"/Data/","gsod_2013.tar",sep=""),exdir=paste(getwd(),"/Data/Files",sep=""))

I created a variable year so that I can change the year I want to investigate and the script will work in the same way.
I also included a call to dir.create to create a new folder to store the 12'511 data files included in the .tar file.
At this point I created a loop to iterate through the file names.


#Create Data Frame
files <- list.files(paste(getwd(),"/Data/Files/",sep=""))

classes <- c(rep("factor",3),rep("numeric",14),rep("factor",3),rep("numeric",2))

t0 <- Sys.time()
station <- data.frame(Lat=numeric(),Lon=numeric(),TempC=numeric())
for(i in 1:length(files)){

data <- read.table(gzfile(paste(getwd(),"/Data/Files/",files[i],sep=""),open="rt"),sep="",header=F,skip=1,colClasses=classes)
if(paste(unique(data$V1))!=paste("999999")){
coords.sub <- coords[paste(coords$ID)==paste(unique(data$V1)),]

 if(nrow(data)>(365/2)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }
} else {
coords.sub <- coords[paste(coords$WBAN)==paste(unique(data$V2)),]

 if(nrow(data)>(365/2)&coords.sub$Lat!=0&!is.na(coords.sub$Lat)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }

}
print(i)
flush.console()
}
t1 <- Sys.time()
t1-t0

I also included a time indication so that I could check the total time required for executing the loop, which is around 8 minutes.

Now let's look at the loop a bit more in details. I start it by opening the data file from the file.list named files using the function read.table coupled with the function gzip. This function can read the content of the .gz file as text so that it can be read as a table.
Then I set up an if statement because in some cases the station can be identified by the USAF value, in other cases USAF is equal to 999999 and therefore it needs to be identified by its WBAN value.
Inside this statement I put another if statement to exclude values without coordinates or with less than 6 months of data.
Then I also included another quality check, because after each measurements NOAA provides also the number of observation used to calculate each daily average. So I used this information to exclude the daily averages computed from less than 12 hours.
In this case I was only interested in Temperature data, so all my data.frames extracted only that property from the data files (and converted it to Celsius). However, it would be fairly easy to focus on different environmental data.
The loop fills the empty data.frame named station I create just before starting the loop.
At this point I can exclude the NAs from the data.frame, use the package sp to assign coordinates and plot them on top of country polygons:


Temperature.data <- na.omit(station)
coordinates(Temperature.data)=~Lon+Lat

dir.create(paste(getwd(),"/Data/Polygons",sep=""))
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile=paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""))
unzip(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""),exdir=paste(getwd(),"/Data/Polygons",sep=""))
polygons <- shapefile(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.shp",sep=""))

plot(polygons)
points(Temperature.data,pch="+",cex=0.5,col=color.scale(Temperature.data$TempC,color.spec="hsv"))

The results is the plot is presented below:


The nice thing about this script is that I just need to chance the variable called year at the beginning of the script to change the year of the investigation. 
For example, let's say we want to look at how many stations meet the criteria I set here in 1940. I can just change the year to 1940, run the script and wait for the results. In this case the waiting is not long, because at that time there were only few stations in USA, see below:

However, if we skip forward 10 years to 1950, the pictures changes. 



The second world war was just over and now we have weather stations all across countries that were highly involved in it. We have several stations in UK, Germany, Australia, Japan, Turkey, Palestine, Iran and Iraq.
A similar pattern can be seen if we take a look at the dataset available in 1960:


Here the striking addition are the numerous stations in Vietnam and surroundings. It is probably easy to understand the reason.

Sometimes we forget that our work can be used for good, but it can also be very useful in bad times!!

Here is the complete code to reproduce this experiment:







library(raster)
library(gstat)
library(XML)
library(plotrix)


setwd("D:/Wind Speed Co-Kriging - IPA Project")

#Extract Coordinates for each Station
coords.fwt <- read.fwf("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/isd-history.txt",widths=c(6,1,5,1,38,7,1,8,9,8,1,8),sep=";",skip=21,fill=T)
coords <- data.frame(ID=paste(as.factor(coords.fwt[,1])),WBAN=paste(as.factor(coords.fwt[,3])),Lat=as.numeric(paste(coords.fwt$V6)),Lon=as.numeric(paste(coords.fwt$V8)))


#Download Measurements
year = 2013
dir.create(paste(getwd(),"/Data",sep=""))
download.file(paste("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",year,"/gsod_",year,".tar",sep=""),destfile=paste(getwd(),"/Data/","gsod_",year,".tar",sep=""),method="auto",mode="wb")
#Extract Measurements
dir.create(paste(getwd(),"/Data/Files",sep=""))
untar(paste(getwd(),"/Data/","gsod_",year,".tar",sep=""),exdir=paste(getwd(),"/Data/Files",sep=""))


#Create Data Frame
files <- list.files(paste(getwd(),"/Data/Files/",sep=""))

classes <- c(rep("factor",3),rep("numeric",14),rep("factor",3),rep("numeric",2))

t0 <- Sys.time()
station <- data.frame(Lat=numeric(),Lon=numeric(),TempC=numeric())
for(i in 1:length(files)){

data <- read.table(gzfile(paste(getwd(),"/Data/Files/",files[i],sep=""),open="rt"),sep="",header=F,skip=1,colClasses=classes)
if(paste(unique(data$V1))!=paste("999999")){
coords.sub <- coords[paste(coords$ID)==paste(unique(data$V1)),]

 if(nrow(data)>(365/2)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }
} else {
coords.sub <- coords[paste(coords$WBAN)==paste(unique(data$V2)),]

 if(nrow(data)>(365/2)&coords.sub$Lat!=0&!is.na(coords.sub$Lat)){
  ST1 <- data.frame(TempC=(data$V4-32)/1.8,Tcount=data$V5)
  ST2 <- ST1[ST1$Tcount>12,]
  ST3 <- data.frame(Lat=coords.sub$Lat,Lon=coords.sub$Lon,TempC=round(mean(ST2$TempC,na.rm=T),2))
  station[i,] <- ST3
 }

}
print(i)
flush.console()
}
t1 <- Sys.time()
t1-t0

Temperature.data <- na.omit(station)
coordinates(Temperature.data)=~Lon+Lat

dir.create(paste(getwd(),"/Data/Polygons",sep=""))
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile=paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""))
unzip(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.zip",sep=""),exdir=paste(getwd(),"/Data/Polygons",sep=""))
polygons <- shapefile(paste(getwd(),"/Data/Polygons/","TM_WORLD_BORDERS_SIMPL-0.3.shp",sep=""))


jpeg(paste(getwd(),"/Data/",year,".jpg",sep=""),2000,1500,res=300)
plot(polygons,main=paste("NOAA Database ",year,sep=""))
points(Temperature.data,pch="+",cex=0.5,col=color.scale(Temperature.data$TempC,color.spec="hsv"))
dev.off()

Wednesday, 10 December 2014

World Point Grid

These days I am following a couple of master projects dealing with renewable energy potentials at the global scale. We wanted to try and compute these potential using only free data and see how far we could go.
For wind speed there are plenty of resources freely available on the web. In particular my students downloaded the daily averages from NOAA: http://gis.ncdc.noaa.gov/geoportal/catalog/search/resource/details.jsp?id=gov.noaa.ncdc%3AC00516

One of the problem they encountered was that obviously these data are available for discrete locations, where meteorological stations are located, and if we want to create a map out of them we need to use some form of estimation. The time was limited so we opted for one of the simplest out there, i.e. kriging. Using other form of estimations optimized for wind, like dispersion or other physical modelling would have been almost impossible to pull off at the global scale. These models need too much time, we are talking about month of computations for estimating wind speed in one single country.
Anyway, ETH is an ESRI development centre and therefore the students normally do their work in ArcGIS, this project was no exception. For this reason we used Geostatistical Analyst in ArcGIS for the interpolation. The process was fast considering that we were dealing with a global scale interpolation; even when we used two covariates and we tested co-kriging the entire process was completed in less that 10 minutes, more or less.
The problem is that the output of Geostatistical Analyst is a sort of contours raster, which cannot be directly used to export its value onto a point grid. In this project we were planning to work on a 1Km scale, meaning that we would have need a mean wind speed value for each point on a 1Km grid. In theory this can be done directly from ArcGIS, you can use the function "Create Fishnet" to create the point grid and then the prediction function in Geostatistical Analyst to estimate an interpolated value for each point in the grid. I said in theory because in practice creating a point grid for the whole world is almost impossible with standard PCs. In our students labs we have 4Gb of RAM in each PC and with that there is no way to do the create the grid. We also tried a sort of "raw parallellization", meaning that we used 10 PCs to create one small parts of the grid, then extract just the part that overlay with land (and exclude oceans). Even with this process (which is by no means practical, elegant or fast) we were unable to finish the process. Even 1/10 of the world grid is enough to fill 4Gb of RAM.
In the end we decided to work on a 10Km scale because we simply did not have time to try alternative routes. However, the whole thing make me think about possible ways of creating a point grid for the whole World in R. We could use the function spsample to create point grids using polygon shapefiles, but unless you have a ton of RAM there is no way of doing it in one go. So I thought about iterating through the country polygons, create small grid and export their coordinates in an external txt. Even this process did not work, as soon as the loop reached Argentina the process filled the 8Gb of RAM of my PC and it stopped.
The solution was the use of even smaller scale polygons. I downloaded the Natural Earth 1:10m cultural dataset with States and Provinces and iterated through that. The process takes quite a long time, using one CPU, but it can probably be parallelized. The final dimension of the file with just two columns with the coordinates is 4.8Gb. This can be used directly by loading it as an ff data.frame (read.table.ffdf) or by loading only small chucks of it in an iterative fashion once you need to use it for estimations.
The R script I created takes care of the whole process. It downloads the dataset from Natural Earth, excludes Antartica, which is too big, run the loop and saves a TXT file in the working directory.

This is my code, I hope it can help you somehow:


library(raster)
library(ff)


setwd("C:/")

download.file("http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_1_states_provinces.zip",destfile="ne_10m_admin_1_states_provinces.zip")
unzip("ne_10m_admin_1_states_provinces.zip",exdir="StateProvinces")

setwd("C:/StateProvinces")

polygons <- shapefile(list.files(pattern=".shp"))
polygons <- polygons[! paste(polygons$name) %in% paste("Antarctica"),]

names(polygons)

i=1
while(i<=nrow(polygons)){
if(paste(polygons[i,]$name)!=paste("NA")){
 grids <- try(spsample(polygons[i,],cellsize=0.01,type="regular"))

 if(inherits(grids, "try-error"))
    {
      i = i+1
    } else {

 write.table(as.data.frame(grids),"Grid_1Km.txt",sep=" ",row.names=F,col.names=F,append=T)
 
 print(i)
 flush.console()
 i=i+1 }
} else { i = i+1 }
}

#file.remove("Grid_1Km.txt")

grid <- read.table.ffdf(file="Grid_1Km.txt",nrows=1000,sep=",")
names(grid)