R tutorial for Spatial Statistics

Thursday, 21 February 2019

Shiny App to access NOAA data

Now that the US Government shutdown is over, it is time to download NOAA weather daily summaries in bulk and store them somewhere safe so that at the next shutdown we do not need to worry.

Below is the code to download data for a series of years:

NOAA_BulkDownload <- function(Year, Dir){
  URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",Year,"/gsod_",Year,".tar")
  download.file(URL, destfile=paste0(Dir,"/gsod_",Year,".tar"),
                method="auto",mode="wb")
  
  if(dir.exists(paste0(Dir,"/NOAA Data"))==FALSE){dir.create(paste0(Dir,"/NOAA Data"))}
  
  untar(paste0(Dir,"/gsod_",Year,".tar"), 
        exdir=paste0(Dir,"/NOAA Data"))
}

An example on how to use this function is below:

Date <- 1980:2019
lapply(Date, NOAA_BulkDownload, Dir="C:/Users/fabio.veronesi/Desktop/New folder")

Theoretically, the process can be parallelized using parLappy, but I have not tested it.

Once we have all the file in one folder we can create the Shiny app to query these data.
The app will have a dashboard look with two tabs: one with a Leaflet map showing the location of the weather stations (markers are shown only at a certain zoom level to decrease loading time and RAM usage), below:

The other tab will allow the creation of the time-series (each file represents only 1 year, so we need to bind several files together to get the full period we are interested in) and it will also do some basic data cleaning, e.g. turn T from F to C, or snow depth from inches to mm. Finally, from this tab users can view the final product and download a cleaned csv.

The code for ui and server scripts is on my GitHub:
https://github.com/fveronesi/NOAA_ShinyApp

Saturday, 16 February 2019

Weather Forecast from MET Office

This is another function I wrote to access the MET office API and obtain a 5-day ahead weather forecast:

METDataDownload <- function(stationID, product, key){
  library("RJSONIO") #Load Library
  library("plyr")
  library("dplyr")
  library("lubridate")
  connectStr <- paste0("http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/json/",stationID,"?res=",product,"&key=",key)
  
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  
  #Station
  LocID <- data.json$SiteRep$DV$Location$`i`
  LocName <- data.json$SiteRep$DV$Location$name
  Country <- data.json$SiteRep$DV$Location$country
  Lat <- data.json$SiteRep$DV$Location$lat
  Lon <- data.json$SiteRep$DV$Location$lon
  Elev <- data.json$SiteRep$DV$Location$elevation
  
  Details <- data.frame(LocationID = LocID,
                        LocationName = LocName,
                        Country = Country,
                        Lon = Lon,
                        Lat = Lat,
                        Elevation = Elev)
  #Parameters
  param <- do.call("rbind",data.json$SiteRep$Wx$Param)
  
  #Forecast
  if(product == "daily"){
    dates <- unlist(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$value}))
    DayForecast <- do.call("rbind", lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep[[1]]}))
    NightForecast <- do.call("rbind", lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep[[2]]}))
    colnames(DayForecast)[ncol(DayForecast)] <- "Type"
    colnames(NightForecast)[ncol(NightForecast)] <- "Type"
    
    ForecastDF <- plyr::rbind.fill.matrix(DayForecast, NightForecast) %>%
      as_tibble() %>%
      mutate(Date = as.Date(rep(dates, 2))) %>%
      mutate(Gn = as.numeric(Gn),
             Hn = as.numeric(Hn),
             PPd = as.numeric(PPd),
             S = as.numeric(S),
             Dm = as.numeric(Dm),
             FDm = as.numeric(FDm),
             W = as.numeric(W),
             U = as.numeric(U),
             Gm = as.numeric(Gm),
             Hm = as.numeric(Hm),
             PPn = as.numeric(PPn),
             Nm = as.numeric(Nm),
             FNm = as.numeric(FNm))
    
    
  } else {
    dates <- unlist(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$value}))
    Forecast <- do.call("rbind", lapply(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep}), function(x){do.call("rbind",x)}))
    colnames(Forecast)[ncol(Forecast)] <- "Hour"
    
    DateTimes <- seq(ymd_hms(paste0(as.Date(dates[1])," 00:00:00")),ymd_hms(paste0(as.Date(dates[length(dates)])," 21:00:00")), "3 hours")
    
    if(nrow(Forecast)<length(DateTimes)){
      extra_lines <- length(DateTimes)-nrow(Forecast)
      for(i in 1:extra_lines){
        Forecast <- rbind(rep("0", ncol(Forecast)), Forecast)
      }
    }
    
    ForecastDF <- Forecast %>%
      as_tibble() %>%
      mutate(Hour = DateTimes) %>%
      filter(D != "0") %>%
      mutate(F = as.numeric(F),
             G = as.numeric(G),
             H = as.numeric(H),
             Pp = as.numeric(Pp),
             S = as.numeric(S),
             T = as.numeric(T),
             U = as.numeric(U),
             W = as.numeric(W))
    
  }
  
  
  list(Details, param, ForecastDF)
  
}

The API key can be obtained for free at this link:
https://www.metoffice.gov.uk/datapoint/api

Once we have an API key we can simply insert the station ID and the type of product we want to obtain the forecast. We can select between two products: daily and 3hourly

To obtain the station ID we need to use another query and download an XML with all stations names and ID:

library(xml2)

url = paste0("http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/daily/sitelist?key=",key)
XML_StationList <- read_xml(url)

write_xml(XML_StationList, "StationList.xml")

This will save an XML, which we can then open with a txt editor (e.g. Notepad++).

The function can be used as follows:

METDataDownload(stationID=3081, product="daily", key)

It will return a list with 3 elements:

Station info: Name, ID, Lon, Lat, Elevation
Parameter explanation
Weather forecast: tibble format

I have not tested it much, so if you find any bug you are welcome to tweak it on GitHub:

https://github.com/fveronesi/METOfficeForecast

Geocoding function

This is a very simple function to perform geocoding using the Google Maps API:

getGeoCode <- function(gcStr, key)  {
  library("RJSONIO") #Load Library
  gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
  #Open Connection
  connectStr <- paste0('https://maps.googleapis.com/maps/api/geocode/json?address=',gcStr, "&key=",key) 
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")   {
    lat <- data.json["results.geometry.location.lat"]
    lng <- data.json["results.geometry.location.lng"]
    gcodes <- c(lat, lng)
    names(gcodes) <- c("Lat", "Lng")
    return (gcodes)
  }
}

Essentially, users need to get an API key from google and then use as an input (string) for the function. The function itself is very simple, and it is an adaptation of some code I found on-line (unfortunately I did not write down where I found the original version so I do not have a way to reference the source, sorry!!).

geoCodes <- getGeoCode(gcStr="11 via del piano, empoli", key)

To use the function we simply need to include an address, and it will return its coordinates in WGS84.
It can be used in a mutate call within dplyr and it is reasonably fast.

The repository is here:
https://github.com/fveronesi/RGeocode.r

Friday, 15 June 2018

Spreadsheet Data Manipulation in R

Today I decided to create a new repository on GitHub where I am sharing code to do spreadsheet data manipulation in R.

The first version of the repository and R script is available here: SpreadsheetManipulation_inR

As an example I am using a csv freely available from the IRS, the US Internal Revenue Service.
https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi

This spreadsheet has around 170'000 rows and 131 columns.

Please feel free to request new functions to be added or add functions and code yourself directly on GitHub.

Sunday, 25 March 2018

Data Visualization Website with Shiny

My second Shiny app is dedicated to data visualization.
Here users can simply upload any csv or txt file and create several plots:

Histograms (with option for faceting)

Barchart (with error bars, and option for color with dodging and faceting)

BoxPlots (with option for faceting)

Scatterplots (with options for color, size and faceting)

TimeSeries

Error bars in barcharts are computed with the mean_se function in ggplot2, which computes error bars as mean ± standard error. When the color option is set, barcharts are plotted one next to the other for each color (option dodging).

For scatterplots, if the option for faceting is provided each plot will include a linear regression lines.

Some examples are below:

For the time being there is no option for saving plots, apart from saving the image from the screen. However, I would like to implement an option to have plots in tiff at 300dpi, but all the code I tried so far did not work. I will keep trying.

The app can be accessed here: https://fveronesi.shinyapps.io/DataViz/

The R code is available here: https://github.com/fveronesi/Shiny_DataViz

Sunday, 11 March 2018

Street Crime UK - Shiny App

Introduction

This is a shiny app to visualize heat maps of Street Crimes across Britain from 2010-12 to 2018-01 and test their spatial pattern.

The code for both ui.R and server.R is available from my GitHub at: https://github.com/fveronesi/StreetCrimeUK_Shiny

Usage

Please be aware that this apps downloads data from my personal Dropbox once it starts and every time the user changes some of the settings. This was the only work-around I could think of to use external data in shinyapps.io for free. However, this also makes the app a bit slow, so please be patient.

Users can select a date with two sliders (I personally do not like the dateInput tool), then a crime type and click Draw Map to update the map with new data. I also included a option to plot the Ripley K-function (function Kest in package spatstat) and the p-value of the quadrat.test (again from spatstat). Both tools work using the data shown within the screen area, so their results change as users interact with the map. The Ripley K function shows a red dashed line with the expected nearest neighbour distribution of points that are randomly distributed in space (i.e. follow a Poisson distribution). The black line is the one computed from the points shown on screen. If the black line is above the red means the observations shown on the map are clustered, while if it is below the red line means the crimes are scattered regularly in space. A more complete overview of the Ripley K function is available at this link from ESRI.

The p-value from the quadrat test is testing a null hypothesis that the crimes are scattered randomly in space, against an alternative that they are clustered. If the p-value is below 0.05 (significance level of 5%) we can accept the alternative hypothesis that our data are clustered. Please be aware that this test does not account for regularly space crimes.

NOTE

Please not that the code here is not reproducible straight away. The app communicates with my Dropbox, though the package rdrop2, which requires a token to download data from Dropbox. More info github.com/karthik/rdrop2.

I am sharing the code to potentially use a taken downloaded from elsewhere, but the url that points to my Dropbox will clearly not be shared.

Preparing the dataset

Csv files with crime data can be downloaded directly from the data.police.uk website. Please check the dates carefully, since each of these files contains more that one years of monthly data. The main issue with these data is that they are divided by local police forces, so for example we will have a csv for each month from the Bedfordshire Police, which only covers that part of the country. Moreover, these csv contain a lot of data, not only coordinates; they also contain the type of crimes, plus other details, which we do not need and which makes the full collection a couple of Gb in size.

For these reasons I did some pre-processing, First of all I extracted all csv files into a folder named "CrimeUK" and then I ran the code below:

lista = list.files("E:/CrimesUK",pattern="street",recursive=T,include.dirs=T,full.names=T,ignore.case = T)

for(i in lista){
  DF = read.csv(i)

   write.table(data.frame(LAT=DF$Latitude, LON=DF$Longitude, TYPE=DF$Crime.type),
               file=paste0("E:/CrimesUK/CrimesUK",substr(paste(DF$Month[1]),1,4),"_",substr(paste(DF$Month[1]),6,7),".csv"),
               sep=",",row.names=F,col.names=F, append=T)
   print(i)
}

Here I first create a list of all csv files, with full link, searching inside all sub directory. Then I started a for loop to iterate through the files. The loop simply loads each file and than save part of its contents (namely coordinates and crime type) into new csv named after using year and month. This will help me identify which files to download from Dropbox, based on user inputs.

Once I had these files I simply uploded them to my Dropbox.

The link to test the app is:

fveronesi.shinyapps.io/CrimeUK/

A snapshot of the screen is below:

Tuesday, 25 July 2017

Experiment designs for Agriculture

This post is more for personal use than anything else. It is just a collection of code and functions to produce some of the most used experimental designs in agriculture and animal science.

I will not go into details about these designs. If you want to know more about what to use in which situation you can find material at the following links:

Design of Experiments (Penn State): https://onlinecourses.science.psu.edu/stat503/node/5

Statistical Methods for Bioscience (Wisconsin-Madison): http://www.stat.wisc.edu/courses/st572-larget/Spring2007/

R Packages to create several designs are presented here: https://cran.r-project.org/web/views/ExperimentalDesign.html

A very good tutorial about the use of the package Agricolae can be found here:
https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf

Complete Randomized Design

This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions.

In R we can create a simple CRD with the function expand.grid and then with some randomization:

 TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
 Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]  
   
 Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])  
   
 write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)

The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:

 > TR.Structure  
   rep Treatment1 Treatment2  
 1  1     A     A  
 2  2     A     A  
 3  3     A     A  
 4  1     B     A  
 5  2     B     A  
 6  3     B     A  
 7  1     A     B  
 8  2     A     B  
 9  3     A     B  
 10  1     B     B  
 11  2     B     B  
 12  3     B     B  
 13  1     A     C  
 14  2     A     C  
 15  3     A     C  
 16  1     B     C  
 17  2     B     C  
 18  3     B     C

The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.

Add Control

To add a Control we need to write two separate lines, one for the treatment structure and the other for the control:

 TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
 CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))  
   
 Data.CCRD = rbind(TR.Structure, CR.Structure)

This will generate the following table:

 > Data.CCRD  
   rep Treatment1 Treatment2  
 1  1     A     A  
 2  2     A     A  
 3  3     A     A  
 4  1     B     A  
 5  2     B     A  
 6  3     B     A  
 7  1     A     B  
 8  2     A     B  
 9  3     A     B  
 10  1     B     B  
 11  2     B     B  
 12  3     B     B  
 13  1     A     C  
 14  2     A     C  
 15  3     A     C  
 16  1     B     C  
 17  2     B     C  
 18  3     B     C  
 19  1  Control  Control  
 20  2  Control  Control  
 21  3  Control  Control

As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:

 Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]  
   
 Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])  
   
 write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)

Block Design with Control

The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block.

 TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
 CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))  
   
 Data.CBD = rbind(TR.Structure, CR.Structure)  
   
 Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]  
 Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]  
 Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]  
   
 Data.CBD = rbind(Block1, Block2, Block3)  
   
 BlockID = rep(1:nrow(Block1),3)  
   
 Data.CBD = cbind(Block = BlockID, Data.CBD)  
   
 write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)

As you can see from the code above, we've created three objects, one for each block, where we used the function sample to randomize.

Other Designs with Agricolae

The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments.

We will look at some of them, so first let's install the package:

 install.packages("agricolae")  
 library(agricolae)

The main syntax for design in agricolae is the following:

 Trt1 = c("A","B","C")  
 design.crd(trt=Trt1, r=3)

The result is the output below:

 > design.crd(trt=Trt1, r=3)  
 $parameters  
 $parameters$design  
 [1] "crd"  
   
 $parameters$trt  
 [1] "A" "B" "C"  
   
 $parameters$r  
 [1] 3 3 3  
   
 $parameters$serie  
 [1] 2  
   
 $parameters$seed  
 [1] 1572684797  
   
 $parameters$kinds  
 [1] "Super-Duper"  
   
 $parameters[[7]]  
 [1] TRUE  
   
   
 $book  
  plots r Trt1  
 1  101 1  A  
 2  102 1  B  
 3  103 2  B  
 4  104 2  A  
 5  105 1  C  
 6  106 3  A  
 7  107 2  C  
 8  108 3  C  
 9  109 3  B

As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:

 Trt1 = c("A","B","C")  
 Trt2 = c("1","2")  
 Trt3 = c("+","-")  
   
 TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))  
 TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))  
 TRT.Control = c(TRT, rep("Control", 3))

As you can see we have now three treatments, which are merged into unique strings within the function sapply:

 > TRT  
  [1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:

 > design.crd(trt=TRT.Control, r=3)$book  
   plots r TRT.Control  
 1  101 1     A2+  
 2  102 1     B1+  
 3  103 1   Control  
 4  104 1     B2+  
 5  105 1     A1+  
 6  106 1     C2+  
 7  107 2     A2+  
 8  108 1     C2-  
 9  109 2   Control  
 10  110 1     B2-  
 11  111 3   Control  
 12  112 1   Control  
 13  113 2     C2-  
 14  114 2   Control  
 15  115 1     C1+  
 16  116 2     C1+  
 17  117 2     B2-  
 18  118 1     C1-  
 19  119 2     C2+  
 20  120 3     C2-  
 21  121 1     A2-  
 22  122 2     C1-  
 23  123 2     A1+  
 24  124 3     C1+  
 25  125 1     B1-  
 26  126 3   Control  
 27  127 3     A1+  
 28  128 2     B1+  
 29  129 2     B2+  
 30  130 3     B2+  
 31  131 1     A1-  
 32  132 2     B1-  
 33  133 2     A2-  
 34  134 1   Control  
 35  135 3     C2+  
 36  136 2   Control  
 37  137 2     A1-  
 38  138 3     B1+  
 39  139 3   Control  
 40  140 3     A2-  
 41  141 3     A1-  
 42  142 3     A2+  
 43  143 3     B2-  
 44  144 3     C1-  
 45  145 3     B1-

A note about this design is that, since we repeated the string "Control" 3 times when creating the treatment structure, the design basically has additional repetition for the control. If this is what you want to do fine, otherwise you need to change from:

TRT.Control = c(TRT, rep("Control", 3))

to:

TRT.Control = c(TRT, "Control")

This will create a design with 39 lines, and 3 controls.

Other possible designs are:

 #Random Block Design  
 design.rcbd(trt=TRT.Control, r=3)$book  
   
 #Incomplete Block Design  
 design.bib(trt=TRT.Control, r=7, k=3)  
   
 #Split-Plot Design  
 design.split(Trt1, Trt2, r=3, design=c("crd"))  
   
 #Latin Square  
 design.lsd(trt=TRT.tmp)$sketch

Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco - latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design

Update 26/07/2017 - Plotting your Design

Today I received an email from Kevin Wright, creator of the package desplot.

This is a very cool package that allows you to plot your design with colors and text, so that it becomes quite informative for the reader. On the link above you will find several examples on how to plot designs for existing datasets. In this paragraph I would like to focus on how to create cool plots when we are designing our experiments.

Let's look at some code:

 install.packages("desplot")  
 library(desplot)

#Complete Randomized Design  
 CRD = design.crd(trt=TRT.Control, r=3)$book  
 CRD = CRD[order(CRD$r),]  
 CRD$col = CRD$r  
 CRD$row = rep(1:13,3)

desplot(form=TRT.Control ~ col+row, data=CRD, text=TRT.Control, out1=col, out2=row,   
      cex=1, main="Complete Randomized Design")

After installing the package desplot I created an example for plotting the comple randomized design we created above.

To use the function desplot we first need to include in the design columns and rows, so that the function knows what to plot and where. For this I first ordered the data.frame based on the column r, which stands for replicates. Then I added a column named col, with values equal to r (I could use the column r, but I wanted to make clear the procedure), and another named row. Here I basically repeated a vector from 1 to 13 (which is the total number of treatments per replicate), 3 times (i.e. the number of replicates).

The function desplot returns the following plot, which I think is very informative:

We could do the same with the random block design:

 #Random Block Design  
 RBD = design.rcbd(trt=TRT.Control, r=6)$book  
 RBD = RBD[order(RBD$block),]  
 RBD$col = RBD$block  
 RBD$row = rep(1:13,6)  
   
 desplot(form=block~row+col, data=RBD, text=TRT.Control, col=TRT.Control, out1=block, out2=row, cex=1, main="Randomized Block Design")

thus obtaining the following image:

Final Note

For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about Power Analysis. However, there is also the package Crossover that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: https://cran.r-project.org/web/views/ClinicalTrials.html