tag:blogger.com,1999:blog-14423025631716635002024-03-13T18:46:02.869+01:00R tutorial for Spatial StatisticsI’m Dr. Fabio Veronesi, data scientist at WRC plc. This is my personal Blog, where I share R code regarding plotting, descriptive statistics, inferential statistics, Shiny apps, and spatio-temporal statistics with an eye to the GIS world.Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.comBlogger53125tag:blogger.com,1999:blog-1442302563171663500.post-83944021474462825462019-02-21T17:09:00.000+01:002019-02-21T17:09:29.831+01:00Shiny App to access NOAA dataNow that the US Government shutdown is over, it is time to download NOAA weather daily summaries in bulk and store them somewhere safe so that at the next shutdown we do not need to worry.<br />
<br />
Below is the code to download data for a series of years:<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid blue; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">NOAA_BulkDownload <- function(Year, Dir){
URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",Year,"/gsod_",Year,".tar")
download.file(URL, destfile=paste0(Dir,"/gsod_",Year,".tar"),
method="auto",mode="wb")
if(dir.exists(paste0(Dir,"/NOAA Data"))==FALSE){dir.create(paste0(Dir,"/NOAA Data"))}
untar(paste0(Dir,"/gsod_",Year,".tar"),
exdir=paste0(Dir,"/NOAA Data"))
}
</pre>
</div>
<br />
<br />
An example on how to use this function is below:
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid blue; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">Date <- 1980:2019
lapply(Date, NOAA_BulkDownload, Dir="C:/Users/fabio.veronesi/Desktop/New folder")
</pre>
</div>
<br />
Theoretically, the process can be parallelized using parLappy, but I have not tested it.
<br />
<br />
Once we have all the file in one folder we can create the Shiny app to query these data.
<br />
The app will have a dashboard look with two tabs: one with a Leaflet map showing the location of the weather stations (markers are shown only at a certain zoom level to decrease loading time and RAM usage), below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVlR937p5LGJxkxC3up05-RZ3A4b0q_wGw1itG4AcTzJteAisUUmzzcq0GvT0o4u78AcA0vKUIh9iYWdRCFa01b1ZL0fbL8eMSGEKjRCgH2kZfF5ca0mH76Az8kzHk4J2bJQW122hwClIc/s1600/Noaa_Screenshot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="787" data-original-width="1600" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVlR937p5LGJxkxC3up05-RZ3A4b0q_wGw1itG4AcTzJteAisUUmzzcq0GvT0o4u78AcA0vKUIh9iYWdRCFa01b1ZL0fbL8eMSGEKjRCgH2kZfF5ca0mH76Az8kzHk4J2bJQW122hwClIc/s640/Noaa_Screenshot.jpg" width="640" /></a></div>
<br />
<br />
<br />
The other tab will allow the creation of the time-series (each file represents only 1 year, so we need to bind several files together to get the full period we are interested in) and it will also do some basic data cleaning, e.g. turn T from F to C, or snow depth from inches to mm. Finally, from this tab users can view the final product and download a cleaned csv.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh78Wo_0lBm3apHnFVH5kxyV9-0sAp6SP9vFDM8fh9rusJ4zHNu6e2OBvEdlz3rAFhyphenhyphenv5yM2NA8S6QdskIT1oMh_egWmrGmvGR9HJPMf3HSE6Z11uXQdMQd66NRgRfO5ZBHa_bD_cJCOrMx/s1600/Noaa_Screenshot_2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="787" data-original-width="1600" height="313" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh78Wo_0lBm3apHnFVH5kxyV9-0sAp6SP9vFDM8fh9rusJ4zHNu6e2OBvEdlz3rAFhyphenhyphenv5yM2NA8S6QdskIT1oMh_egWmrGmvGR9HJPMf3HSE6Z11uXQdMQd66NRgRfO5ZBHa_bD_cJCOrMx/s640/Noaa_Screenshot_2.jpg" width="640" /></a></div>
<br />
<br />
The code for ui and server scripts is on my GitHub:<br />
<a href="https://github.com/fveronesi/NOAA_ShinyApp" target="_blank">https://github.com/fveronesi/NOAA_ShinyApp</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com1tag:blogger.com,1999:blog-1442302563171663500.post-91510654493135074762019-02-16T10:00:00.002+01:002019-02-16T10:00:59.107+01:00Weather Forecast from MET OfficeThis is another function I wrote to access the MET office API and obtain a 5-day ahead weather forecast:<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">METDataDownload <- function(stationID, product, key){
library("RJSONIO") #Load Library
library("plyr")
library("dplyr")
library("lubridate")
connectStr <- paste0("http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/json/",stationID,"?res=",product,"&key=",key)
con <- url(connectStr)
data.json <- fromJSON(paste(readLines(con), collapse=""))
close(con)
#Station
LocID <- data.json$SiteRep$DV$Location$`i`
LocName <- data.json$SiteRep$DV$Location$name
Country <- data.json$SiteRep$DV$Location$country
Lat <- data.json$SiteRep$DV$Location$lat
Lon <- data.json$SiteRep$DV$Location$lon
Elev <- data.json$SiteRep$DV$Location$elevation
Details <- data.frame(LocationID = LocID,
LocationName = LocName,
Country = Country,
Lon = Lon,
Lat = Lat,
Elevation = Elev)
#Parameters
param <- do.call("rbind",data.json$SiteRep$Wx$Param)
#Forecast
if(product == "daily"){
dates <- unlist(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$value}))
DayForecast <- do.call("rbind", lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep[[1]]}))
NightForecast <- do.call("rbind", lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep[[2]]}))
colnames(DayForecast)[ncol(DayForecast)] <- "Type"
colnames(NightForecast)[ncol(NightForecast)] <- "Type"
ForecastDF <- plyr::rbind.fill.matrix(DayForecast, NightForecast) %>%
as_tibble() %>%
mutate(Date = as.Date(rep(dates, 2))) %>%
mutate(Gn = as.numeric(Gn),
Hn = as.numeric(Hn),
PPd = as.numeric(PPd),
S = as.numeric(S),
Dm = as.numeric(Dm),
FDm = as.numeric(FDm),
W = as.numeric(W),
U = as.numeric(U),
Gm = as.numeric(Gm),
Hm = as.numeric(Hm),
PPn = as.numeric(PPn),
Nm = as.numeric(Nm),
FNm = as.numeric(FNm))
} else {
dates <- unlist(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$value}))
Forecast <- do.call("rbind", lapply(lapply(data.json$SiteRep$DV$Location$Period, function(x){x$Rep}), function(x){do.call("rbind",x)}))
colnames(Forecast)[ncol(Forecast)] <- "Hour"
DateTimes <- seq(ymd_hms(paste0(as.Date(dates[1])," 00:00:00")),ymd_hms(paste0(as.Date(dates[length(dates)])," 21:00:00")), "3 hours")
if(nrow(Forecast)<length(DateTimes)){
extra_lines <- length(DateTimes)-nrow(Forecast)
for(i in 1:extra_lines){
Forecast <- rbind(rep("0", ncol(Forecast)), Forecast)
}
}
ForecastDF <- Forecast %>%
as_tibble() %>%
mutate(Hour = DateTimes) %>%
filter(D != "0") %>%
mutate(F = as.numeric(F),
G = as.numeric(G),
H = as.numeric(H),
Pp = as.numeric(Pp),
S = as.numeric(S),
T = as.numeric(T),
U = as.numeric(U),
W = as.numeric(W))
}
list(Details, param, ForecastDF)
}
</pre>
</div>
<br />
<br />
The API key can be obtained for free at this link:<br />
<a href="https://www.metoffice.gov.uk/datapoint/api" target="_blank">https://www.metoffice.gov.uk/datapoint/api</a><br />
<br />
Once we have an API key we can simply insert the station ID and the type of product we want to obtain the forecast. We can select between two products: daily and 3hourly<br />
<br />
To obtain the station ID we need to use another query and download an XML with all stations names and ID:<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">library(xml2)
url = paste0("http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/daily/sitelist?key=",key)
XML_StationList <- read_xml(url)
write_xml(XML_StationList, "StationList.xml")
</pre>
</div>
<br />
<br />
This will save an XML, which we can then open with a txt editor (e.g. Notepad++).
<br />
<br />
The function can be used as follows:<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">METDataDownload(stationID=3081, product="daily", key)
</pre>
</div>
<br />
It will return a list with 3 elements:
<br />
<br />
<ol>
<li>Station info: Name, ID, Lon, Lat, Elevation</li>
<li>Parameter explanation</li>
<li>Weather forecast: tibble format</li>
</ol>
<div>
I have not tested it much, so if you find any bug you are welcome to tweak it on GitHub:</div>
<div>
<a href="https://github.com/fveronesi/METOfficeForecast" target="_blank">https://github.com/fveronesi/METOfficeForecast</a></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com0tag:blogger.com,1999:blog-1442302563171663500.post-17977523744284896512019-02-16T09:51:00.000+01:002019-02-16T09:51:06.059+01:00Geocoding functionThis is a very simple function to perform geocoding using the Google Maps API:
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">getGeoCode <- function(gcStr, key) {
library("RJSONIO") #Load Library
gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
#Open Connection
connectStr <- paste0('https://maps.googleapis.com/maps/api/geocode/json?address=',gcStr, "&key=",key)
con <- url(connectStr)
data.json <- fromJSON(paste(readLines(con), collapse=""))
close(con)
#Flatten the received JSON
data.json <- unlist(data.json)
if(data.json["status"]=="OK") {
lat <- data.json["results.geometry.location.lat"]
lng <- data.json["results.geometry.location.lng"]
gcodes <- c(lat, lng)
names(gcodes) <- c("Lat", "Lng")
return (gcodes)
}
}
</pre>
</div>
<br />
Essentially, users need to get an API key from google and then use as an input (string) for the function. The function itself is very simple, and it is an adaptation of some code I found on-line (unfortunately I did not write down where I found the original version so I do not have a way to reference the source, sorry!!).<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">geoCodes <- getGeoCode(gcStr="11 via del piano, empoli", key)
</pre>
</div>
<br />
To use the function we simply need to include an address, and it will return its coordinates in WGS84.<br />
It can be used in a mutate call within dplyr and it is reasonably fast.<br />
<br />
The repository is here:<br />
<a href="https://github.com/fveronesi/RGeocode.r" target="_blank">https://github.com/fveronesi/RGeocode.r</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com4tag:blogger.com,1999:blog-1442302563171663500.post-29548099500174191062018-06-15T12:18:00.001+02:002018-06-15T12:19:29.574+02:00Spreadsheet Data Manipulation in RToday I decided to create a new repository on GitHub where I am sharing code to do spreadsheet data manipulation in R.<br />
<br />
The first version of the repository and R script is available here: <a href="https://github.com/fveronesi/SpreadsheetManipulation_inR" target="_blank">SpreadsheetManipulation_inR</a><br />
<br />
As an example I am using a csv freely available from the IRS, the US Internal Revenue Service.<br />
<a href="https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi" target="_blank">https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi</a><br />
<br />
This spreadsheet has around 170'000 rows and 131 columns.<br />
<br />
Please feel free to request new functions to be added or add functions and code yourself directly on GitHub.<br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com1tag:blogger.com,1999:blog-1442302563171663500.post-51463769220571500032018-03-25T10:03:00.000+02:002018-03-25T10:04:20.915+02:00Data Visualization Website with ShinyMy second Shiny app is dedicated to data visualization.<br />
Here users can simply upload any csv or txt file and create several plots:<br />
<br />
<li style="box-sizing: border-box;">Histograms (with option for faceting)</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Barchart (with error bars, and option for color with dodging and faceting)</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">BoxPlots (with option for faceting)</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Scatterplots (with options for color, size and faceting)</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">TimeSeries</li>
<br />
<div>
<br /></div>
<div>
Error bars in barcharts are computed with the mean_se function in ggplot2, which computes error bars as mean ± standard error. When the color option is set, barcharts are plotted one next to the other for each color (option dodging).</div>
<div>
<br /></div>
<div>
For scatterplots, if the option for faceting is provided each plot will include a linear regression lines.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Some examples are below:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEia-aYHZbm3riYEiwRf7HlYhAp1bXddYdsVuBWwZocwHWCgmyxVmorWF9Fg5JS2O7ylBqAS-BNM19yEtofq3f03VjZeV-F8rgSPS_807DdYDhHSOdm5_yV48vblkUt7ZufMn6YKAsCjaDZj/s1600/Untitled-1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="603" data-original-width="1351" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEia-aYHZbm3riYEiwRf7HlYhAp1bXddYdsVuBWwZocwHWCgmyxVmorWF9Fg5JS2O7ylBqAS-BNM19yEtofq3f03VjZeV-F8rgSPS_807DdYDhHSOdm5_yV48vblkUt7ZufMn6YKAsCjaDZj/s320/Untitled-1.jpg" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiISVfmR9daHCsVJf-CNJ5G21eKZfnaMQBm3AJQL8buRDsro78yedqZr8XMJT67ONWuMV6NK3TpJKxBez0xsF0yQzRBsn58rL5nUj10BDdJVVukWtD29-okE2BPE6zdOiJ5r-NTK1h0lKNP/s1600/Untitled-2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="606" data-original-width="1336" height="145" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiISVfmR9daHCsVJf-CNJ5G21eKZfnaMQBm3AJQL8buRDsro78yedqZr8XMJT67ONWuMV6NK3TpJKxBez0xsF0yQzRBsn58rL5nUj10BDdJVVukWtD29-okE2BPE6zdOiJ5r-NTK1h0lKNP/s320/Untitled-2.jpg" width="320" /></a></div>
<div>
<br /></div>
<div>
For the time being there is no option for saving plots, apart from saving the image from the screen. However, I would like to implement an option to have plots in tiff at 300dpi, but all the code I tried so far did not work. I will keep trying.</div>
<div>
<br /></div>
<div>
The app can be accessed here: <a href="https://fveronesi.shinyapps.io/DataViz/" target="_blank">https://fveronesi.shinyapps.io/DataViz/</a></div>
<div>
<br /></div>
<div>
The R code is available here: <a href="https://github.com/fveronesi/Shiny_DataViz" target="_blank">https://github.com/fveronesi/Shiny_DataViz</a></div>
<div>
<br /></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com2tag:blogger.com,1999:blog-1442302563171663500.post-6127355375029051342018-03-11T14:39:00.003+01:002018-03-11T14:41:20.215+01:00Street Crime UK - Shiny App<h3 style="border-bottom: 1px solid rgb(234, 236, 239); box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; line-height: 1.25; margin-bottom: 16px; margin-top: 24px; padding-bottom: 0.3em;">
Introduction</h3>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
This is a shiny app to visualize heat maps of Street Crimes across Britain from 2010-12 to 2018-01 and test their spatial pattern.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
The code for both <i><u>ui.R</u></i> and <i><u>server.R</u></i> is available from my <b>GitHub </b>at: <a href="https://github.com/fveronesi/StreetCrimeUK_Shiny" target="_blank">https://github.com/fveronesi/StreetCrimeUK_Shiny</a></div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<br /></div>
<h3 style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Usage</h3>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Please be aware that this apps downloads data from my personal Dropbox once it starts and every time the user changes some of the settings. This was the only work-around I could think of to use external data in shinyapps.io for free. However, <b>this also makes the app a bit slow, so please be patient.</b></div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Users can select a date with two sliders (I personally do not like the <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">dateInput</code> tool), then a crime type and click <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">Draw Map</code> to update the map with new data. I also included a option to plot the Ripley K-function (function <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">Kest</code> in package <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">spatstat</code>) and the <em style="box-sizing: border-box;">p-value</em> of the <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">quadrat.test</code> (again from <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">spatstat</code>). Both tools work using the data shown within the screen area, so their results change as users interact with the map. The Ripley K function shows a red dashed line with the expected nearest neighbour distribution of points that are randomly distributed in space (i.e. follow a Poisson distribution). The black line is the one computed from the points shown on screen. If the black line is above the red means the observations shown on the map are clustered, while if it is below the red line means the crimes are scattered regularly in space. A more complete overview of the Ripley K function is available at this <a href="http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/h-how-multi-distance-spatial-cluster-analysis-ripl.htm" rel="nofollow" style="box-sizing: border-box; color: #0366d6; text-decoration-line: none;">link from ESRI</a>.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
The <em style="box-sizing: border-box;">p-value</em> from the <a href="https://www.rdocumentation.org/packages/spatstat/versions/1.55-0/topics/quadrat.test" rel="nofollow" style="box-sizing: border-box; color: #0366d6; text-decoration-line: none;">quadrat test</a> is testing a null hypothesis that the crimes are scattered randomly in space, against an alternative that they are clustered. If the <em style="box-sizing: border-box;">p-value</em> is below 0.05 (significance level of 5%) we can accept the alternative hypothesis that our data are clustered. Please be aware that this test does not account for regularly space crimes.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<br /></div>
<h3 style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<i>NOTE</i></h3>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Please not that the code here is not reproducible straight away. The app communicates with my Dropbox, though the package <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">rdrop2</code>, which requires a token to download data from Dropbox. More info <a href="https://github.com/karthik/rdrop2" style="box-sizing: border-box; color: #0366d6; text-decoration-line: none;">github.com/karthik/rdrop2</a>.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
I am sharing the code to potentially use a taken downloaded from elsewhere, but the <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">url</code> that points to my Dropbox will clearly not be shared.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<br /></div>
<h3 style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Preparing the dataset</h3>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Csv files with crime data can be downloaded directly from the <a href="https://data.police.uk/data/archive/" rel="nofollow" style="box-sizing: border-box; color: #0366d6; text-decoration-line: none;">data.police.uk</a> website. Please check the dates carefully, since each of these files contains more that one years of monthly data. The main issue with these data is that they are divided by local police forces, so for example we will have a csv for each month from the Bedfordshire Police, which only covers that part of the country. Moreover, these csv contain a lot of data, not only coordinates; they also contain the type of crimes, plus other details, which we do not need and which makes the full collection a couple of Gb in size.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
For these reasons I did some pre-processing, First of all I extracted all csv files into a folder named "CrimeUK" and then I ran the code below:</div>
<pre lang="{r}" style="background-color: #f6f8fa; border-radius: 3px; box-sizing: border-box; color: #24292e; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; line-height: 1.45; margin-bottom: 16px; overflow: auto; padding: 16px; word-wrap: normal;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; line-height: inherit; margin: 0px; overflow: visible; padding: 0px; word-break: normal; word-wrap: normal;">lista = list.files("E:/CrimesUK",pattern="street",recursive=T,include.dirs=T,full.names=T,ignore.case = T)
for(i in lista){
DF = read.csv(i)
write.table(data.frame(LAT=DF$Latitude, LON=DF$Longitude, TYPE=DF$Crime.type),
file=paste0("E:/CrimesUK/CrimesUK",substr(paste(DF$Month[1]),1,4),"_",substr(paste(DF$Month[1]),6,7),".csv"),
sep=",",row.names=F,col.names=F, append=T)
print(i)
}
</code></pre>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Here I first create a list of all csv files, with full link, searching inside all sub directory. Then I started a <code style="background-color: rgba(27, 31, 35, 0.05); border-radius: 3px; box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; margin: 0px; padding: 0.2em 0.4em;">for</code> loop to iterate through the files. The loop simply loads each file and than save part of its contents (namely coordinates and crime type) into new csv named after using year and month. This will help me identify which files to download from Dropbox, based on user inputs.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
Once I had these files I simply uploded them to my Dropbox.</div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<br /></div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
The link to test the app is:</div>
<h3>
<a href="https://fveronesi.shinyapps.io/CrimeUK/" target="_blank">fveronesi.shinyapps.io/CrimeUK/</a></h3>
<br />
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
A snapshot of the screen is below:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://github.com/fveronesi/StreetCrimeUK_Shiny/raw/master/Screenshot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="450" data-original-width="800" height="358" src="https://github.com/fveronesi/StreetCrimeUK_Shiny/raw/master/Screenshot.jpg" width="640" /></a></div>
<div style="box-sizing: border-box; color: #24292e; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px;">
<br /></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com12tag:blogger.com,1999:blog-1442302563171663500.post-79534338131909158822017-07-25T15:33:00.000+02:002017-07-26T13:35:41.382+02:00Experiment designs for Agriculture<div dir="ltr">
This post is more for personal use than anything else. It is just a collection of code and functions to produce some of the most used experimental designs in agriculture and animal science. </div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
I will not go into details about these designs. If you want to know more about what to use in which situation you can find material at the following links:</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Design of Experiments (Penn State): <a href="https://onlinecourses.science.psu.edu/stat503/node/5" target="_blank">https://onlinecourses.science.psu.edu/stat503/node/5</a><br />
<br />
Statistical Methods for Bioscience (Wisconsin-Madison): <a href="http://www.stat.wisc.edu/courses/st572-larget/Spring2007/" target="_blank">http://www.stat.wisc.edu/courses/st572-larget/Spring2007/</a></div>
<div dir="ltr">
<br />
R Packages to create several designs are presented here: <a href="https://cran.r-project.org/web/views/ExperimentalDesign.html" target="_blank">https://cran.r-project.org/web/views/ExperimentalDesign.html</a></div>
<div dir="ltr">
<br />
A very good tutorial about the use of the package Agricolae can be found here:<br />
<a href="https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf" target="_blank">https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf</a><br />
<br />
<br />
<h4>
Complete Randomized Design</h4>
</div>
<div>
This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions. </div>
<div>
In R we can create a simple CRD with the function expand.grid and then with some randomization:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))
Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]
Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])
write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)
</code></pre>
<div>
<br /></div>
<div>
The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > TR.Structure
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C
</code></pre>
<br />
The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.<br />
<br />
<br />
<h3>
Add Control</h3>
</div>
<div>
To add a Control we need to write two separate lines, one for the treatment structure and the other for the control:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))
CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))
Data.CCRD = rbind(TR.Structure, CR.Structure)
</code></pre>
<div>
<br /></div>
<div>
This will generate the following table:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > Data.CCRD
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C
19 1 Control Control
20 2 Control Control
21 3 Control Control
</code></pre>
<br />
As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]
Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])
write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)
</code></pre>
<br />
<br />
<h3>
Block Design with Control</h3>
</div>
<div>
The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block.</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))
CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))
Data.CBD = rbind(TR.Structure, CR.Structure)
Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Data.CBD = rbind(Block1, Block2, Block3)
BlockID = rep(1:nrow(Block1),3)
Data.CBD = cbind(Block = BlockID, Data.CBD)
write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)
</code></pre>
<div>
<br /></div>
<div>
As you can see from the code above, we've created three objects, one for each block, where we used the function sample to randomize.<br />
<br />
<br />
<h3>
Other Designs with Agricolae</h3>
</div>
<div>
The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments.</div>
<div>
We will look at some of them, so first let's install the package:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("agricolae")
library(agricolae)
</code></pre>
<div>
<br /></div>
<div>
The main syntax for design in agricolae is the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Trt1 = c("A","B","C")
design.crd(trt=Trt1, r=3)
</code></pre>
<br />
The result is the output below:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > design.crd(trt=Trt1, r=3)
$parameters
$parameters$design
[1] "crd"
$parameters$trt
[1] "A" "B" "C"
$parameters$r
[1] 3 3 3
$parameters$serie
[1] 2
$parameters$seed
[1] 1572684797
$parameters$kinds
[1] "Super-Duper"
$parameters[[7]]
[1] TRUE
$book
plots r Trt1
1 101 1 A
2 102 1 B
3 103 2 B
4 104 2 A
5 105 1 C
6 106 3 A
7 107 2 C
8 108 3 C
9 109 3 B
</code></pre>
<br />
As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Trt1 = c("A","B","C")
Trt2 = c("1","2")
Trt3 = c("+","-")
TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))
TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))
TRT.Control = c(TRT, rep("Control", 3))
</code></pre>
<br />
As you can see we have now three treatments, which are merged into unique strings within the function sapply:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > TRT
[1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"
</code></pre>
<br />
Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > design.crd(trt=TRT.Control, r=3)$book
plots r TRT.Control
1 101 1 A2+
2 102 1 B1+
3 103 1 Control
4 104 1 B2+
5 105 1 A1+
6 106 1 C2+
7 107 2 A2+
8 108 1 C2-
9 109 2 Control
10 110 1 B2-
11 111 3 Control
12 112 1 Control
13 113 2 C2-
14 114 2 Control
15 115 1 C1+
16 116 2 C1+
17 117 2 B2-
18 118 1 C1-
19 119 2 C2+
20 120 3 C2-
21 121 1 A2-
22 122 2 C1-
23 123 2 A1+
24 124 3 C1+
25 125 1 B1-
26 126 3 Control
27 127 3 A1+
28 128 2 B1+
29 129 2 B2+
30 130 3 B2+
31 131 1 A1-
32 132 2 B1-
33 133 2 A2-
34 134 1 Control
35 135 3 C2+
36 136 2 Control
37 137 2 A1-
38 138 3 B1+
39 139 3 Control
40 140 3 A2-
41 141 3 A1-
42 142 3 A2+
43 143 3 B2-
44 144 3 C1-
45 145 3 B1-
</code></pre>
<br />
A note about this design is that, since we repeated the string "Control" 3 times when creating the treatment structure, the design basically has additional repetition for the control. If this is what you want to do fine, otherwise you need to change from:<br />
<br />
<pre style="background: rgb(240, 240, 240); border: 1px dashed rgb(204, 204, 204); font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; width: 646.469px;"><code style="word-wrap: normal;">TRT.Control = c(TRT, rep("Control", 3)) </code></pre>
<br />
to:<br />
<br />
<pre style="background: rgb(240, 240, 240); border: 1px dashed rgb(204, 204, 204); font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; width: 646.469px;"><code style="word-wrap: normal;">TRT.Control = c(TRT, "Control") </code></pre>
<br />
This will create a design with 39 lines, and 3 controls.<br />
<br />
<br />
Other possible designs are:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #Random Block Design
design.rcbd(trt=TRT.Control, r=3)$book
#Incomplete Block Design
design.bib(trt=TRT.Control, r=7, k=3)
#Split-Plot Design
design.split(Trt1, Trt2, r=3, design=c("crd"))
#Latin Square
design.lsd(trt=TRT.tmp)$sketch
</code></pre>
<br />
Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco - latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design<br />
<br />
<br />
<h3>
Update 26/07/2017 - Plotting your Design</h3>
<div>
Today I received an email from <a href="https://github.com/kwstat/desplot" target="_blank">Kevin Wright</a>, creator of the package <a href="https://rawgit.com/kwstat/desplot/master/vignettes/desplot_examples.html" target="_blank">desplot</a>.</div>
<div>
This is a very cool package that allows you to plot your design with colors and text, so that it becomes quite informative for the reader. On the link above you will find several examples on how to plot designs for existing datasets. In this paragraph I would like to focus on how to create cool plots when we are designing our experiments.</div>
<div>
Let's look at some code:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("desplot")
library(desplot)
</code></pre>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;">#Complete Randomized Design
CRD = design.crd(trt=TRT.Control, r=3)$book
CRD = CRD[order(CRD$r),]
CRD$col = CRD$r
CRD$row = rep(1:13,3)
</code></pre>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;">desplot(form=TRT.Control ~ col+row, data=CRD, text=TRT.Control, out1=col, out2=row,
cex=1, main="Complete Randomized Design")
</code></pre>
<div>
<br />
After installing the package desplot I created an example for plotting the comple randomized design we created above.<br />
<br /></div>
To use the function desplot we first need to include in the design columns and rows, so that the function knows what to plot and where. For this I first ordered the data.frame based on the column r, which stands for replicates. Then I added a column named col, with values equal to r (I could use the column r, but I wanted to make clear the procedure), and another named row. Here I basically repeated a vector from 1 to 13 (which is the total number of treatments per replicate), 3 times (i.e. the number of replicates).<br />
<br />
The function desplot returns the following plot, which I think is very informative:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVAYk7QVFogEj_3GMaRT_aSO4VRvVHC66vVae9Q3ahePIgdn6e90Pskm-tuv-T0ladMdLsZ4WbMYgwMZMJjZ-qB82ubALhyphenhyphen7ktanoAdlWQp35CAMXy-FuG1LRLycEcKZXpuVfUJB-GNLwo/s1600/CRD.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVAYk7QVFogEj_3GMaRT_aSO4VRvVHC66vVae9Q3ahePIgdn6e90Pskm-tuv-T0ladMdLsZ4WbMYgwMZMJjZ-qB82ubALhyphenhyphen7ktanoAdlWQp35CAMXy-FuG1LRLycEcKZXpuVfUJB-GNLwo/s400/CRD.jpeg" width="400" /></a></div>
<br />
We could do the same with the random block design:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #Random Block Design
RBD = design.rcbd(trt=TRT.Control, r=6)$book
RBD = RBD[order(RBD$block),]
RBD$col = RBD$block
RBD$row = rep(1:13,6)
desplot(form=block~row+col, data=RBD, text=TRT.Control, col=TRT.Control, out1=block, out2=row, cex=1, main="Randomized Block Design")
</code></pre>
<br />
thus obtaining the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnumKFeyQ6z8kr_Ka13bWw_5XKUun1gUHI1Ye-gf1wG4n2nH-ko0Lt1Pr7yI0MJLP9QRdz2f2-UyEomXugxfoet0OKTjZ3auInq5QLJOfDsuf9PXT2nT2PklAEfWP7121hyGVc_RZr1HUt/s1600/RBD.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="865" data-original-width="971" height="569" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnumKFeyQ6z8kr_Ka13bWw_5XKUun1gUHI1Ye-gf1wG4n2nH-ko0Lt1Pr7yI0MJLP9QRdz2f2-UyEomXugxfoet0OKTjZ3auInq5QLJOfDsuf9PXT2nT2PklAEfWP7121hyGVc_RZr1HUt/s640/RBD.jpeg" width="640" /></a></div>
<br />
<br />
<h3>
Final Note</h3>
</div>
<div>
For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about <a href="http://r-video-tutorial.blogspot.co.uk/2017/07/power-analysis-and-sample-size.html" target="_blank">Power Analysis</a>. However, there is also the package <a href="https://cran.r-project.org/web/packages/Crossover/index.html" target="_blank">Crossover</a> that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: <a href="https://cran.r-project.org/web/views/ClinicalTrials.html" target="_blank">https://cran.r-project.org/web/views/ClinicalTrials.html</a></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com2tag:blogger.com,1999:blog-1442302563171663500.post-83237799874109962572017-07-21T17:27:00.001+02:002018-03-08T15:47:52.118+01:00Power analysis and sample size calculation for Agriculture<div class="separator" style="clear: both; text-align: center;">
</div>
Power analysis is extremely important in statistics since it allows us to calculate how many chances we have of obtaining realistic results. Sometimes researchers tend to underestimate this aspect and they are just interested in obtaining significant p-values. The problem with this is that a significance level of 0.05 does not necessarily mean that what you are observing is real.<br />
In the book "Statistics Done Wrong" by Alex Reinhart (which you can read for free here: <a href="https://www.statisticsdonewrong.com/" target="_blank">https://www.statisticsdonewrong.com/</a>) this problem is discussed with an example where we can clearly see that a significance of 0.05 does not mean that we have 5% chances of getting it wrong, but actually we have closer to 30% chances of obtaining unrealistic results. This is because there are two types of errors in which we can incur (for example when performing an ANOVA), the type I (i.e. rejecting a null hypothesis when it is actually true) and type II (i.e. accepting a null hypothesis when it is actually false).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://datasciencedojo.com/wp-content/uploads/type1and2error.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="322" data-original-width="482" height="266" src="https://datasciencedojo.com/wp-content/uploads/type1and2error.gif" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Image taken from: https://datasciencedojo.com/</td></tr>
</tbody></table>
<br />
The probability of incurring in a type I error is indicated by α (or significance level) and usually takes a value of 5%; this means that we are happy to consider a scenario where we have 5% chances of rejecting the null hypothesis when it is actually true. If we are not happy with this, we can further decrease this probability by decreasing the value of α (for example to 1%). On the contrary the probability of incurring in a type II error is expressed by β, which usually takes a value of 20% (meaning a power of 80%). This means we are happy to work assuming that we have a 20% chance of accepting the null hypothesis when it is actually false.<br />
If our experiment is not designed properly we cannot be sure whether we actually incurred in one of these two errors. In other words, if we run a bad experiment and we obtain a insignificant p-value it may be that we incurred in a type II error, meaning that in reality our treatment works but its effect cannot be detected by our experiment. However, it may also be that we obtained a significant p-value but we incurred in a type I error, and if we repeat the experiment we will find different results.<br />
The only way we can be sure to run a good experiment is by running a power analysis. By definition power is the probability of obtaining statistical significance (not necessarily a small p-value, but at least a realistic outcome). Power analysis can be used before an experiment to test whether our design has good chances of succeeding (<i>a priori</i>) or after to test whether the results we obtained were realistic.<br />
<br />
<br />
<h3>
Update 17/11/2017<br />How many subjects to compute a robust mean?</h3>
This is a question that I get sometimes when talking with students that are planning descriptive experiments. How many subjects do I need for the mean value I compute to be robust?<br />
<br />
The answer is provided by Berkowitz here: <a href="http://www.columbia.edu/~mvp19/RMC/M6/M6.doc" target="_blank">www.columbia.edu/~mvp19/RMC/M6/M6.doc</a><br />
<br />
The simplified formula to compute the minimum number of samples is:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfzO6NKoCEGsjRv9nQruPJsr0-pW1tr5MVfwNLvEAfP8t10R-TmN5CkYmYOlf2l_QTt39oAupZxRtUE-5SS6tfzmqlqaB4pE3PneJVSIL8vLxA_4S-oNEavSQHI6VT4ywkAIqrxbYq2fyY/s1600/Fig.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="196" data-original-width="1600" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfzO6NKoCEGsjRv9nQruPJsr0-pW1tr5MVfwNLvEAfP8t10R-TmN5CkYmYOlf2l_QTt39oAupZxRtUE-5SS6tfzmqlqaB4pE3PneJVSIL8vLxA_4S-oNEavSQHI6VT4ywkAIqrxbYq2fyY/s640/Fig.png" width="640" /></a></div>
<span style="font-family: "calibri" , sans-serif; font-size: 11.0pt; line-height: 107%;"><br /></span>
<span style="font-family: "calibri" , sans-serif; font-size: 11.0pt; line-height: 107%;">where SD is the standard deviation and SE is the standard error. These values can be obtained from previous experiments, or from the literature. <!--[endif]--></span><!--[endif]--><br />
<br />
<h3>
Effect Size</h3>
<div>
A simple and effective definition of effect size is provided in the book "Power Analysis for Experimental Research" by Bausell & Li. They say: </div>
<div>
"effect size is nothing more than a standardized measure of the size of the mean difference(s) among the study’s groups or of the strength of the relationship(s) among its variables".</div>
<div>
<br /></div>
<div>
Despite its simple definition the calculation of the effect size is not always straightforward and many indexes have been proposed over the years. Bausell & Li propose the following definition, in line with what proposed by Cohen in his "Statistical Power Analysis for the Behavioral Sciences":</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1cffDgO-YVQROmP-LR9SmZKxuIgBT9JImQy4wAr_uVkmp_ewN1lchvTRiMUBMnWxZs_ZUpvso3b-Yb4QJHMYUdYe-qmJbV-v1oOGJx05CKaNpoR-4-O3A5HAVoBD47TPAk2m8PTuDB-Gi/s1600/Eq_ES.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="199" data-original-width="1600" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1cffDgO-YVQROmP-LR9SmZKxuIgBT9JImQy4wAr_uVkmp_ewN1lchvTRiMUBMnWxZs_ZUpvso3b-Yb4QJHMYUdYe-qmJbV-v1oOGJx05CKaNpoR-4-O3A5HAVoBD47TPAk2m8PTuDB-Gi/s640/Eq_ES.png" width="640" /></a></div>
where ES is the effect size (in Cohen this is referred as d). In this equation, Ya is the mean of the measures for treatment A, and Yb is the mean for treatment B. The denominator is the pooled standard deviation, which is computed as follows:</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja1ZSUlgI7TEvOD9eiGLfIm9XDJI2sZbF6YC8o9PGA4JuCUdI0faTpU9LLlaLtVZLyokGp3is8VTuE6mFkON3XyCQVQ2gKriQk9XnALhQIGwtHpZVm3eTLtlRQSWznlOAbJT-9ifvERw75/s1600/Sd_Pooled.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="1600" height="82" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja1ZSUlgI7TEvOD9eiGLfIm9XDJI2sZbF6YC8o9PGA4JuCUdI0faTpU9LLlaLtVZLyokGp3is8VTuE6mFkON3XyCQVQ2gKriQk9XnALhQIGwtHpZVm3eTLtlRQSWznlOAbJT-9ifvERw75/s640/Sd_Pooled.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<div class="separator" style="clear: both; text-align: center;">
</div>
where SD are the standard deviation for treatments B and A, and n are the number of samples for treatment B and A.<br />
<br />
This is the main definition but then every software or functions tend to use indexes correlated to this but not identical. We will see each way of calculating the effect size case by case.<br />
<br />
<br />
<h3>
One-Way ANOVA </h3>
</div>
<h4>
Sample size</h4>
<div>
For simple models the power calculating can be performed with the package pwr:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(pwr)
</code></pre>
<br /></div>
<div>
In the previous post (<a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models</a>) we worked on a dataset where we tested the impact on yield of 6 levels of nitrogen. Let's assume that we need to run a similar experiment and we would like to know how many samples we should collect (or how many plants we should use in the glass house) for each level of nitrogen. To calculate this we need to do a power analysis.<br />
<br />
To compute the sample size required to reach good power we can run the following line of code:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8)
</code></pre>
<br />
Let's start describing the options from the end. We have the option power, to specify the power you require for your experiment. In general, this can be set to 0.8, as mentioned above. The significance level is alpha and usually we are happy to accept a significance of 5%. Another option is k, which is the number of groups in our experiment, in this case we have 6 groups.<br />
Finally we have the option f, which is the effect size. As I mentioned above, there are many indexes to express the effect size and f is one of them.<br />
According to Cohen, f can be expressed as:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQB64KKOJjmYAMesV84jXMIMfPSsdjy61j5dg8sMLAvu8VyELF0xTxD4dv27DcjTdeIthcJUfUDhSg69SUy0izt768Kgsye5oJa67ZcL8FfkPv-F_EqBmJCwHTekQQxTphC2PCdzEuPpOG/s1600/Eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="166" data-original-width="1600" height="65" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQB64KKOJjmYAMesV84jXMIMfPSsdjy61j5dg8sMLAvu8VyELF0xTxD4dv27DcjTdeIthcJUfUDhSg69SUy0izt768Kgsye5oJa67ZcL8FfkPv-F_EqBmJCwHTekQQxTphC2PCdzEuPpOG/s640/Eq1.png" width="640" /></a></div>
<br />
where the numerator is the is the standard deviation of the effects that we want to test and the denominator is the common standard deviation. For two means, as in the equation we have seen above, f is simply equal to:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWdIAPYx25vA2ylwH1GZFqfcveL1kgUEJud59suMh60R6347NaCUly0RNFrxyAfi-WU2Ug51Xg5VJ_1DUz2pl5RvHf1hu-obBkloXeei6EXbZ1mAb4NddP-XtIY7CuYJ8oj7ntVhu52Ru0/s1600/F_from_d.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="196" data-original-width="1600" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWdIAPYx25vA2ylwH1GZFqfcveL1kgUEJud59suMh60R6347NaCUly0RNFrxyAfi-WU2Ug51Xg5VJ_1DUz2pl5RvHf1hu-obBkloXeei6EXbZ1mAb4NddP-XtIY7CuYJ8oj7ntVhu52Ru0/s640/F_from_d.png" width="640" /></a></div>
<br />
<br />
Clearly, before running the experiment we do not really know what the effect size would be. In some case we may have an idea, for example from previous experiments or a pilot study. However, most of the times we do not have a clue. In such cases we can use the classification suggested by Cohen, who considered the following values for f:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXEIiaEYKzTbfGKmioYmdMGoHaBdCwGBAuRhJHijqsOp-QXcaxFRw3nZMlh5qwB0e0954_5-lT6csLLCQHekMFANRTJg0KW6KIhp3WnlLezrwkX_gOrcmz6F8Q7zUi7CvBDXlHU3Z-vGkz/s1600/Table_f.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="362" data-original-width="1600" height="144" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXEIiaEYKzTbfGKmioYmdMGoHaBdCwGBAuRhJHijqsOp-QXcaxFRw3nZMlh5qwB0e0954_5-lT6csLLCQHekMFANRTJg0KW6KIhp3WnlLezrwkX_gOrcmz6F8Q7zUi7CvBDXlHU3Z-vGkz/s640/Table_f.png" width="640" /></a></div>
The general rule is that if we do not know anything about our experiment we should use a medium effect size, so in this case 0.25. This was suggested in the book Bausell & Li and it is based on a review of 302 studies in the social and behavioral sciences. for this reason it may well be that the effect size of your experiment would be different. However, if you do not have any additional information this is the only thing the literature suggest.<br />
<br />
The function above returns the following output:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8)
Balanced one-way analysis of variance power calculation
k = 6
n = 35.14095
f = 0.25
sig.level = 0.05
power = 0.8
NOTE: n is number in each group
</code></pre>
<br />
In this example we would need 36 samples for each nitrogen level to achieve a power of 80% with a significance of 5%.<br />
<br />
<br />
<h3>
Power Calculation</h3>
<div>
As I mentioned above, sometimes we have a dataset we collected assuming we could reach good power but we are not actually sure if that is the case. In those instances what we can do is the <i>a posteriori </i>power analysis, where we basically compute the power for a model we already fitted.</div>
<div>
<br /></div>
<div>
As you remember is the previous post about linear models, we fitted the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod1 = aov(yield ~ nf, data=dat)
</code></pre>
<br /></div>
To compute the power we achieved here we first need to calculate the effect size. As discussed above we have several options: d, f and another index called partial eta squared.<br />
Let's start from d, which can be simply calculated using means and standard deviation of two groups, for example N0 (control) and N5:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Control = dat[dat$nf=="N0","yield"]
Treatment1 = dat[dat$nf=="N5","yield"]
numerator = (mean(Treatment1)-mean(Control))
denominator = sqrt((((length(Treatment1)-1)*sd(Treatment1)^2)+((length(Control)-1)*sd(Control)^2))/(length(Treatment1)+length(Control)-2))
d = numerator/denominator
</code></pre>
<br />
This code simply computes the numerator (difference in means) and the denominator (pooled standard deviation) and then computes the Cohen's d, you just need to change the vectors for objects Control and Treatment1. The effect size results in 0.38.<br />
<br />
Again Cohen provides some values for the d, so that we can determine how large is our effects, which are presented below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYhgr8OpUnL2eN9kGqz64mNac45cdueIvOHoQ644FPgwItBczp_7UDxIaRe_SobQiElZ_haIDPE_9rWzYJKjZQ3HRSfkvLu37MgQgs2rz9dfJXNSKzAACL77D9PYXiefPmFpxPZmgZHqQG/s1600/Table_d.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="362" data-original-width="1600" height="144" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYhgr8OpUnL2eN9kGqz64mNac45cdueIvOHoQ644FPgwItBczp_7UDxIaRe_SobQiElZ_haIDPE_9rWzYJKjZQ3HRSfkvLu37MgQgs2rz9dfJXNSKzAACL77D9PYXiefPmFpxPZmgZHqQG/s640/Table_d.png" width="640" /></a></div>
From this table we can see that our effect size is actually low, and not medium as we assumed for the <i>a priori</i> analysis. This is important because if we run the experiment with 36 samples per group we may end up with unrealistic results simply due to low power. For this reason it is my opinion that we should always be a bit more conservative and maybe include some additional replicates or blocks, just to account for potential unforeseen differences between our assumptions and reality.<br />
<br />
The function to compute power is again pwr.anova.test, in which the effect size is expressed as f. We have two ways of doing that, the first is by using the d values we just calculated and halve it, so in this case f = 0.38/2 = 0.19. However, this will tell you the specific effects size for the relation between N0 and N5, and not for the full set of treatments.<br />
<br />
NOTE:<br />
At this link there is an Excel file that you can use to convert between indexes of effect size:<br />
<a href="http://www.stat-help.com/spreadsheets/Converting%20effect%20sizes%202012-06-19.xls" target="_blank">http://www.stat-help.com/spreadsheets/Converting%20effect%20sizes%202012-06-19.xls</a><br />
<br />
<br />
Another way to get a fuller picture is by using the partial Eta Squared, which can be calculated using the sum of squares:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB34QpoAtbk5YvI4L2R3iWTOpeAeDFfxjHgrBJyyd6R7DGzm85Y_hi5-ZTjY6KWYF3QPrI1-ih8UWnmSWiaXDdnqVDt6Zb4shqa5pQ-n4VCHBXwJWkA-sMO8duNXioZrr4WF_k1nVEMZA8/s1600/EtaSquared.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="174" data-original-width="1600" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB34QpoAtbk5YvI4L2R3iWTOpeAeDFfxjHgrBJyyd6R7DGzm85Y_hi5-ZTjY6KWYF3QPrI1-ih8UWnmSWiaXDdnqVDt6Zb4shqa5pQ-n4VCHBXwJWkA-sMO8duNXioZrr4WF_k1nVEMZA8/s640/EtaSquared.png" width="640" /></a></div>
<br />
This will tell us the average effect size for all the treatments we applied, so not only for N5 compared to N0, but for all of them.<br />
To compute the partial eta squared we first need to access the anova table, with the function anova:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(mod1)
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
nf 5 23987 4797.4 12.396 6.075e-12 ***
Residuals 3437 1330110 387.0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
From this table we can extract the sum of squares for the treatment (i.e. nf) and the sum of squares of the residuals and then solve the equation above:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > EtaSQ = 23987/(23987+1330110)
> print(EtaSQ)
[1] 0.01771439
</code></pre>
<br />
As for the other indexes, eta squares also has its table of interpretation:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf3tFA2ihgFYSbg8zpmaY-Id9pn2ophWZtw8VAqUiiZp3C7Yoo1-5Mj6_o_LYZSVpgtmGTwNkcW5LIkehVLvsQuFX3GClgA-YvRnThWWJ4FoVlMYzJQfTkhRPULmTKq292fRAw2h-ImYwl/s1600/Table_eta.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="381" data-original-width="1600" height="152" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf3tFA2ihgFYSbg8zpmaY-Id9pn2ophWZtw8VAqUiiZp3C7Yoo1-5Mj6_o_LYZSVpgtmGTwNkcW5LIkehVLvsQuFX3GClgA-YvRnThWWJ4FoVlMYzJQfTkhRPULmTKq292fRAw2h-ImYwl/s640/Table_eta.png" width="640" /></a></div>
The relation between f and eta squared is the following:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwdocYo55YGC6cPMmKYC-GjsSNK5bnp6VmdpfC0X9V86zNIhR_uPnC8SDo9u7r92nLRh__yVj0SKFL4-wmLRgBbJu70jsIrAD3nU0aGZgyQ_YTbNE4VzOATRx6651CfjPaLxgjPMs1BGpQ/s1600/f_from_eta.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="262" data-original-width="1600" height="104" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwdocYo55YGC6cPMmKYC-GjsSNK5bnp6VmdpfC0X9V86zNIhR_uPnC8SDo9u7r92nLRh__yVj0SKFL4-wmLRgBbJu70jsIrAD3nU0aGZgyQ_YTbNE4VzOATRx6651CfjPaLxgjPMs1BGpQ/s640/f_from_eta.png" width="640" /></a></div>
<br />
so to compute the f related to the full treatment we can simply do the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > f = sqrt(EtaSQ / (1-EtaSQ))
> print(f)
[1] 0.1342902
</code></pre>
<br />
So now we have everything we need to calculate the power of our model:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pwr.anova.test(k=6, n=571, f=f, sig.level=0.05)
Balanced one-way analysis of variance power calculation
k = 6
n = 571
f = 0.1342902
sig.level = 0.05
power = 1
NOTE: n is number in each group
</code></pre>
<br />
To compute the power we need to run again the function pwr.anova.test, but this time without specifying the option power, but replacing it with the option n, which is the number of samples per group.<br />
As you remember from the previous post this was an unbalanced design, so the number of samples per group is not constant. We could either use a vector as input for n, with all the samples per each group. In that case the function will return a power for each group. However, what I did here is putting the lowest number, so that we are sure to reach good power for the lowest sample size.<br />
<br />
As you can see even with the small effect size we are still able to reach a power of 1, meaning 100%. This is because the sample size is more than adequate to catch even such a small effect. You could try to run again the sample size calculation to actually see what would be the minimum sample requirement for the observed effect size.<br />
<br />
<br />
<h3>
Linear Model</h3>
</div>
<div>
The method we have seen above is only valid for one-way ANOVAs. For more complex model, which may simply be ANOVA with two treatments we should use the function specific for linear models.</div>
<div>
<h4>
Sample Size</h4>
To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test.<br />
Using this function is slightly more complex because here we start reasoning in terms of degrees of freedom for the F ratio, which can be obtained using the following equation:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2_hZAzTkufDFE6j8-PbXABlYtWN8yTFdoCyWPIK5cLWxJElUD-eNEtJl8t1g4wPAMHeR1MQyk7KoiCZT0iLdB4KWK8S_CSTzgrkLr2lTJVi67jB2gHA7RWn48WtR1qS47BzMmZvk8JIOw/s1600/FRatio.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="98" data-original-width="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2_hZAzTkufDFE6j8-PbXABlYtWN8yTFdoCyWPIK5cLWxJElUD-eNEtJl8t1g4wPAMHeR1MQyk7KoiCZT0iLdB4KWK8S_CSTzgrkLr2lTJVi67jB2gHA7RWn48WtR1qS47BzMmZvk8JIOw/s1600/FRatio.JPG" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">From: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO</td></tr>
</tbody></table>
where MS between is the mean square variance between groups and MS within is the mean square variance within each group.<br />
These two terms have the following equations (again from: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO) :<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHXWF2sLfz4lZqo_H4EQh7SN9TVahXth7_L5giMqdC1op_Pa9VZ4HiZF-qtJBVh0i37mNayeFQASxfE_t5g9T1rGwudobRvMvKmUDzVQZQIlNDLKcZnldh3kE0jVRiL7DCEOZAY6eEWilD/s1600/MSbetween.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="72" data-original-width="443" height="52" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHXWF2sLfz4lZqo_H4EQh7SN9TVahXth7_L5giMqdC1op_Pa9VZ4HiZF-qtJBVh0i37mNayeFQASxfE_t5g9T1rGwudobRvMvKmUDzVQZQIlNDLKcZnldh3kE0jVRiL7DCEOZAY6eEWilD/s320/MSbetween.JPG" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqsKcD23PWO7NzABrKzKW2vdvXlSJwmLS-YHknYT1CTVLGP1UXlMLMW14nSF_t0JUIvskMi6xO4cQgQLsHCoO0Okyv3wyZCvbUBf_af3kjmQA8WZGdaYIySALMiroCiF_7cB2Q6b7qJJaw/s1600/MSwithin.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="74" data-original-width="397" height="59" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqsKcD23PWO7NzABrKzKW2vdvXlSJwmLS-YHknYT1CTVLGP1UXlMLMW14nSF_t0JUIvskMi6xO4cQgQLsHCoO0Okyv3wyZCvbUBf_af3kjmQA8WZGdaYIySALMiroCiF_7cB2Q6b7qJJaw/s320/MSwithin.JPG" width="320" /></a></div>
<br />
The degrees of freedom we need to consider are the denominators of the last two equations. For an <i>a priori</i> power analysis we need to input the option u, with the degrees of freedom of the numerator of the F ratio, thus MS between. As you can see this can be computed as k-1, for a one-way ANOVA.<br />
For more complex model we need to calculate the degrees of freedom ourselves. This is not difficult because we can generate dummy datasets in R with the specific treatment structure we require, so that R will compute the degrees of freedom for us.<br />
We can generate dummy dataset very easily with the function expand.grid:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > data = expand.grid(rep=1:3, FC1=c("A","B","C"), FC2=c("TR1","TR2"))
> data
rep FC1 FC2
1 1 A TR1
2 2 A TR1
3 3 A TR1
4 1 B TR1
5 2 B TR1
6 3 B TR1
7 1 C TR1
8 2 C TR1
9 3 C TR1
10 1 A TR2
11 2 A TR2
12 3 A TR2
13 1 B TR2
14 2 B TR2
15 3 B TR2
16 1 C TR2
17 2 C TR2
18 3 C TR2
</code></pre>
<br />
Working with expand.grid is very simple. We just need to specify the level for each treatment and the number of replicates (or blocks) and the function will generate a dataset with every combination.<br />
Now we just need to add the dependent variable, which we can generate randomly from a normal distribution:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> data$Y = rnorm(nrow(data))
</code></pre>
<br />
Now our dataset is ready so we can fit a linear model to it and generate the ANOVA table:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mod.pilot = lm(Y ~ FC1*FC2, data=data)
> anova(mod.pilot)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FC1 2 0.8627 0.4314 0.3586 0.7059
FC2 1 3.3515 3.3515 2.7859 0.1210
FC1:FC2 2 1.8915 0.9458 0.7862 0.4777
Residuals 12 14.4359 1.2030
</code></pre>
<br />
Since this is a dummy dataset all the sum of squares and the other values are meaningless. We are only interested in looking at the degrees of freedom.<br />
To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test, as follows:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)
</code></pre>
<div>
<br />
The first option in the function is u, which represents the degrees of freedom of the numerator of the F ratio. This is related to the degrees of freedom of the component we want to focus on. As you probably noticed from the model, we are trying to see if there is an interaction between two treatments. From the ANOVA table above we can see that the degrees of freedom of the interaction are equal to 2, so that it what we include as u.<br />
Other options are again power and significance level, which we already discussed. Moreover, in this function the effect size is f2, which is again different from the f we've seen before. F2 again has its own table:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibSWWGV1lriSBZtFYEvR6x_IdfEAKOSr53w-hgN6w4ktX2unywY0cbbVtbAbP-IZuMo7RXGYI1mzWIrKIoWGFZr5Fb3Hfqnk3PBGPIUAZ7vt7SUkXLKv-lDoFYnvfR8AiJIW3Lu2wyzEBc/s1600/Table_f2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="362" data-original-width="1600" height="144" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibSWWGV1lriSBZtFYEvR6x_IdfEAKOSr53w-hgN6w4ktX2unywY0cbbVtbAbP-IZuMo7RXGYI1mzWIrKIoWGFZr5Fb3Hfqnk3PBGPIUAZ7vt7SUkXLKv-lDoFYnvfR8AiJIW3Lu2wyzEBc/s640/Table_f2.png" width="640" /></a></div>
Since we assume we have no idea about the real effect size we use a medium value for the <i>a priori</i> testing.<br />
<br />
The function returns the following table:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)
Multiple regression power calculation
u = 2
v = 38.68562
f2 = 0.25
sig.level = 0.05
power = 0.8
</code></pre>
<br />
As you can see what the function is actually providing us is the value of the degrees of freedom for the denominator of the F test (with v), which results in 38.68, so 39 since we always round it by excess.<br />
If we look to the equation to compute MS withing we can see that the degrees of freedom is given by n-k, meaning that to transform the degrees of freedom into a sample size we need to add what we calculated before for the option u. The sample size is then equal to n = v + u + 1, so in this case the sample size is equal 39 + 2 + 1 = 42<br />
<br />
This is not the number of samples per group but it is the <b>total number of samples</b>.<br />
<br />
<br />
Another way of looking at the problem would be to compute the total power of our model, and not just how much power we have to discriminate between levels of one of the treatments (as we saw above). To do so we can still use the function pwr.f2.test, but with some differences. The first is that we need to compute u using all elements in the model, so basically sum the decrees of freedom of the ANOVA table, or sum all the coefficients in the model minus the intercept:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> u = length(coef(mod3))-1
</code></pre>
<br />
Another difference is in how we compute the effects size f2. Before we used its relation with partial eta square, now we can use its relation with the R2 of the model:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7wFQj2Siifnk5n3TEcPC1LreCklQAZoRNrB-gGGH6BYEz-4ATKyh3PBKKjWnsd9TEzCFqoL47GfhZLmfQXkh8AItVrUhKc8z8hdryWEa3KwEko4dvYrLfcw16fBrRcdZdZ1F2cce5dd9K/s1600/Eq_f2_r2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="189" data-original-width="1600" height="74" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7wFQj2Siifnk5n3TEcPC1LreCklQAZoRNrB-gGGH6BYEz-4ATKyh3PBKKjWnsd9TEzCFqoL47GfhZLmfQXkh8AItVrUhKc8z8hdryWEa3KwEko4dvYrLfcw16fBrRcdZdZ1F2cce5dd9K/s640/Eq_f2_r2.png" width="640" /></a></div>
<br />
With these additional element we can compute the power of the model.<br />
<br />
<br />
<h4>
Power Calculation</h4>
</div>
<div>
Now we look at estimating the power for a model we've already fitted, which can be done with the same function.</div>
<div>
We will work with one of the models we used in the post about <a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models</a>:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod3 = lm(yield ~ nf + bv, data=dat)
</code></pre>
<br />
Once again we first need to calculate the observed effect size as the eta squared, using again the sum of squares:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > Anova(mod3, type="III")
Anova Table (Type III tests)
Response: yield
Sum Sq Df F value Pr(>F)
(Intercept) 747872 1 2877.809 < 2.2e-16 ***
nf 24111 5 18.555 < 2.2e-16 ***
bv 437177 1 1682.256 < 2.2e-16 ***
Residuals 892933 3436
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
In this example, I used the function Anova (with option type="III") in the package car just to remind you that if you have an unbalanced design, like in this case, you should use the type III sum of squares.<br />
From this table we can obtain the sum of squares we need to compute the eta squared, for example for nf we will use the following code:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > EtaSQ = 24111/(24111+892933)
> EtaSQ
[1] 0.02629209
</code></pre>
<br />
Then we need to transform this into f2 (of f squared), which is what the pwr.f2.test function uses:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > f2 = EtaSQ / (1-EtaSQ)
> f2
[1] 0.02700203
</code></pre>
<br />
The only thing we need to do now is calculating the value of v, i.e. the denominator degrees of freedom. This is equal to the n (number of samples) - u - 1, but a quick way of obtaining this number is looking at the anova table above and take the degrees of freedom of the residuals, i.e. 3436.<br />
<br />
Now we have everything we need to obtain the observed power:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pwr.f2.test(u = 5, v = 3436, f2 = f2, sig.level = 0.05)
Multiple regression power calculation
u = 5
v = 3436
f2 = 0.02700203
sig.level = 0.05
power = 1
</code></pre>
<br />
which again returns a very high power, since we have a lot of samples.<br />
<br />
<br />
<br />
<h3>
Generalized Linear Models</h3>
<div>
For GLM we need to install the package lmsupport:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #install.packages("lmSupport")
library(lmSupport)
</code></pre>
<br />
<h4>
Sample Size</h4>
For calculating the sample size for GLM we can use the same procedure we used for linear models.<br />
<br />
<br />
<h4>
Power Calculation</h4>
<div>
For this example we are going to use one of the model we discussed in the post about <a href="http://r-video-tutorial.blogspot.co.uk/2017/07/generalized-linear-models-and-mixed.html" target="_blank">GLM</a>, using the dataset beall.webworms (n = 1300):</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = beall.webworms
pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))
</code></pre>
<br />
Once again we would need to compute effect size and degrees of freedom. As before, we can use the function anova to generate the data we need:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(pois.mod2)
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 1299 1955.9
block 12 122.040 1287 1833.8
spray 1 188.707 1286 1645.2
lead 1 42.294 1285 1602.8
spray:lead 1 4.452 1284 1598.4
</code></pre>
<br />
Let's say we are interested in looking at the interaction between spray and lead, its degrees of freedom are 1, so this is our u. On its side we also have the residuals degrees of freedom, so v is 1284.<br />
The other thing we need is the effect size, which we can compute with the function modelEffectSizes from the package lmSupport:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > modelEffectSizes(pois.mod2)
glm(formula = y ~ block + spray * lead, family = c("poisson"),
data = dat)
Coefficients
SSR df pEta-sqr dR-sqr
block 122.0402 12 0.0709 NA
spray 142.3487 1 0.0818 0.0849
lead 43.7211 1 0.0266 0.0261
Sum of squared errors (SSE): 1598.4
Sum of squared total (SST): 1675.9
</code></pre>
<br />
This function calculates the partial eta squares, and it works also for lm models. As you can see it does not provide the eta squared for the interaction, but just to be on the safe side we can use the lowest value (0.03) from the values provided for spray and lead.<br />
Now that we have the observed eta squared we can use the function modelPower:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;">> modelPower(u=1, v=1284, alpha=0.05, peta2=0.03)
Results from Power Analysis
pEta2 = 0.030
u = 1
v = 1284.0
alpha = 0.050
power = 1.000
</code></pre>
<br />
This function can take the option f2, as we've seen before for the package pwr. However, since computing the partial eta square is generally easier, we can use the option peta2 and use directly this index.<br />
Once again our power is very high.<br />
<br />
<br />
<h4>
Note 12/12/2017</h4>
Please note that the line above only works with the older version of the package lmSupport (version 2.9.8). The new version features a different syntax.<br />
You can download the old version from here: <a href="https://cran.r-project.org/src/contrib/Archive/lmSupport/lmSupport_2.9.8.tar.gz" target="_blank">https://cran.r-project.org/src/contrib/Archive/lmSupport/lmSupport_2.9.8.tar.gz</a><br />
<br />
<br />
<h3>
Linear Mixed Effects Models</h3>
<div>
For power analysis with mixed effects models we would need to install the following packages:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #install.packages("simr")
library(simr)
</code></pre>
<br />
In this example we will be working with models fitted with the package lme4, but what is discussed here should work also with models fitted with nlme.<br />
<br />
<h4>
Sample Size</h4>
</div>
<div>
<i>A priori </i>power analysis for mixed effect model is not easy. There are packages that should provide functions to do that (e.g. simstudy and longpower), but they are probably more related to the medical sciences and I found them difficult to use. For this reason I decided that probably the easiest way to test the power of an experiment for which we need to use a mixed-effect model (e.g. involving clustering or repeated measures) would be to use a dummy dataset again and simulation. However, please be advised that I am not 100% sure of the validity of this procedure.</div>
<div>
<br /></div>
<div>
To create the dummy dataset we can use the same procedure we employed above, with expand.grid:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> data = expand.grid(subject=1:5, treatment=c("Tr1", "Tr2", "Tr3"))
data$Y = numeric(nrow(data))
</code></pre>
<div>
<br /></div>
<div>
In this case we are simulating a simple experiment with 5 subjects, 3 treatments and a within subject design, like a crossover I suppose.<br />
As you can see the Y has not been drawn from a normal distribution, this is because for the time being it is just a list of zeroes. We need to create data for each treatment as follows:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> data[data$treatment=="Tr1","Y"] = rnorm(nrow(data[data$treatment=="Tr1",]), mean=20, sd=1)
data[data$treatment=="Tr2","Y"] = rnorm(nrow(data[data$treatment=="Tr2",]), mean=20.5, sd=1)
data[data$treatment=="Tr3","Y"] = rnorm(nrow(data[data$treatment=="Tr3",]), mean=21, sd=1)
</code></pre>
<br /></div>
<div>
In these lines I created three samples, from normal distributions, which means differ by half their standard deviation. This (when SD = 1) provides an effect size (d) of 0.5, so medium.<br />
<br />
Now we can create the model:<br />
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod1 = lmer(Y ~ treatment + (1|subject), data=data)
summary(mod1)
</code></pre>
<div>
<br />
and then test its power with the function powerSim from the package simr. This function runs 1000 simulation and provide a measure for the power of the experiment:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > powerSim(mod1, alpha=0.05)
Power for predictor 'treatment', (95% confidence interval):
25.90% (23.21, 28.73)
Test: Likelihood ratio
Based on 1000 simulations, (84 warnings, 0 errors)
alpha = 0.05, nrow = 15
Time elapsed: 0 h 3 m 2 s
nb: result might be an observed power calculation
Warning message:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
</code></pre>
<br /></div>
From this output we can see that our power is very low, so we probably need to increase the number of subjects and then try again the simulation.<br />
<br />
Let's now look at repeated measures. In this case we do not only have the effect size to account for in the data, but also the correlation between in time between measures.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(mvtnorm)
sigma <- matrix(c(1, 0.5, 0.5, 0.5,
0.5, 1, 0.5, 0.5,
0.5, 0.5, 1, 0.5,
0.5, 0.5, 0.5 ,1 ), ncol=4, byrow=T)
data = expand.grid(subject=1:4, treatment=c("Tr1", "Tr2", "Tr3"), time=c("t1","t2","t3","t4"))
data$Y = numeric(nrow(data))
T1 = rmvnorm(4, mean=rep(20, 4), sigma=sigma)
T2 = rmvnorm(4, mean=rep(20.5, 4), sigma=sigma)
T3 = rmvnorm(4, mean=rep(21, 4), sigma=sigma)
data[data$subject==1&data$treatment=="Tr1","Y"] = T1[,1]
data[data$subject==2&data$treatment=="Tr1","Y"] = T1[,2]
data[data$subject==3&data$treatment=="Tr1","Y"] = T1[,3]
data[data$subject==4&data$treatment=="Tr1","Y"] = T1[,4]
data[data$subject==1&data$treatment=="Tr2","Y"] = T2[,1]
data[data$subject==2&data$treatment=="Tr2","Y"] = T2[,2]
data[data$subject==3&data$treatment=="Tr2","Y"] = T2[,3]
data[data$subject==4&data$treatment=="Tr2","Y"] = T2[,4]
data[data$subject==1&data$treatment=="Tr3","Y"] = T3[,1]
data[data$subject==2&data$treatment=="Tr3","Y"] = T3[,2]
data[data$subject==3&data$treatment=="Tr3","Y"] = T3[,3]
data[data$subject==4&data$treatment=="Tr3","Y"] = T3[,4]
modRM = lmer(Y ~ treatment + (time|subject), data=data)
summary(modRM)
powerSim(modRM, alpha=0.05)
</code></pre>
<br />
In this case we need to use the function rmvnorm to draw, from a normal distribution, samples with a certain correlation. For this example I followed the approach suggested by <a href="https://twitter.com/williamahuber?lang=en" target="_blank">William Huber</a> here: <a href="https://stats.stackexchange.com/questions/24257/how-to-simulate-multivariate-outcomes-in-r/24271#24271" target="_blank">https://stats.stackexchange.com/questions/24257/how-to-simulate-multivariate-outcomes-in-r/24271#24271</a><br />
<br />
Basically, if we assume a correlation equal to 0.5 between time samples (which is what the software <a href="http://www.gpower.hhu.de/en.html" target="_blank">G*Power</a> does for repeated measures), we first need to create a symmetrical matrix in sigma. This will allow rmvnorm to produce values from distributions with standard deviation equal to 1 and 0.5 correlation.<br />
A more elegant approach is the one suggested by Ben Amsel on his blog: <a href="https://cognitivedatascientist.com/2015/12/14/power-simulation-in-r-the-repeated-measures-anova-5/" target="_blank">https://cognitivedatascientist.com/2015/12/14/power-simulation-in-r-the-repeated-measures-anova-5/</a><br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> sigma = 1 # population standard deviation
rho = 0.5 #Correlation between repeated measurs
# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))
# compute covariance between measures
Sigma <- t(S) * S * rho
# put the variances on the diagonal
diag(Sigma) <- sigma^2
</code></pre>
<br />
The result is the same but at least here you can specify different values for SD and correlation.<br />
<br />
The other elementS the function needs are mean values, for which I used the same as before. This should guarantee a difference of around half a standard deviation between treatments.<br />
The remaining of the procedure is the same we used before with no changes.<br />
<br />
As I said before, I am not sure if this is the correct way of computing power for linear mixed effects models. It may be completely or partially wrong, and if you know how to do this or you have comments please do not hesitate to write them below.<br />
<br />
<br />
<h3>
Power Analysis</h3>
As we have seen with the <i>a priori</i> analysis, computing the power of mixed effects models is actually very easy with the function powerSim.<br />
<br />
<br />
<br />
<br />
<h3>
References</h3>
PWR package Vignette: <a href="https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html" target="_blank">https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html</a><br />
<br />
William E. Berndtson (1991). "A simple, rapid and reliable method for selecting or
assessing the number of replicates for animal
experiments"<br />
<a href="http://scholars.unh.edu/cgi/viewcontent.cgi?article=1312&context=nhaes" target="_blank">http://scholars.unh.edu/cgi/viewcontent.cgi?article=1312&context=nhaes</a><br />
<br />
<b>NOTE</b>:<br />
<i>This paper is what some of my colleagues, who deal particularly with animal experiments, use to calculate how many subjects to use for their experiments. The method presented here is base on the coefficient of variation (CV%), which is something that also in agriculture is often used to estimate the number of replicates needed.</i><br />
<i><br /></i>
<i><br /></i>
<br />
<br />
Berkowitz J. "Sample Size Estimation" - <a href="http://www.columbia.edu/~mvp19/RMC/M6/M6.doc" target="_blank">http://www.columbia.edu/~mvp19/RMC/M6/M6.doc</a><br />
<br />
This document gives you some rule of thumb to determine the sample size for a number of experiments.<br />
<br />
<br />
<h4>
Update 26/07/2017</h4>
<div>
For computing effect size automatically you also have the option to use the package <a href="https://cran.r-project.org/web/packages/sjstats/index.html" target="_blank">sjstats</a>. This has function to compute eta-squared, partial eta-squared and others, but it also has an option to print a comprehensive ANOVA table with everything you get from a normal call to anova plus the effects sizes.</div>
<div>
You can find some example on this blog post from the author of the package <span style="background-color: white;">Daniel Lüdecke here:</span></div>
<div>
<span style="background-color: white;"><a href="https://strengejacke.wordpress.com/2017/07/25/effect-size-statistics-for-anova-tables-rstats/" target="_blank">https://strengejacke.wordpress.com/2017/07/25/effect-size-statistics-for-anova-tables-rstats/</a></span></div>
<br /></div>
<div>
<br />
<br />
<h3>
Final Note about the use of CV% </h3>
</div>
<div>
AS I mentioned above, CV% and the percentage of difference between means is one of the indexes used to estimate the number of replicates needed to run experiments. For this reason I decided to create some code to test whether power analysis and the method based on CV% provide similar results.</div>
<div>
Below is the function I created for this:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> d.ES = function(n, M, SD, DV){
M1=M
M2=M+(SD/DV)
PC.DIFF = (abs(M1-M2)/((M1+M2)/2))*100
numerator = (mean(M2)-mean(M1))
denominator = sqrt((((n-1)*SD^2)+((n-1)*SD^2))/(n+n-2))
ES=numerator/denominator
samp = sapply(ES, function(x){pwr.anova.test(k=2, f=x/2, sig.level=0.05, power=0.8)$n})
CV1=SD/M1
return(list(EffectSize=ES, PercentageDifference=PC.DIFF, CV.Control=CV1*100, n.samples=samp))
}
</code></pre>
<br />
This function takes 4 arguments: number of samples (n), mean of control (M), standard deviation (here we assume the standard deviation to be identical between groups), and DV, which indicates the number of times to divide the standard deviation to compute the mean of the treatment. If DV is equal to 2 then the mean of the treatment will be half the mean of control.<br />
<br />
The equation for the percentage of difference was taken from: <a href="https://www.calculatorsoup.com/calculators/algebra/percent-difference-calculator.php" target="_blank">https://www.calculatorsoup.com/calculators/algebra/percent-difference-calculator.php</a><br />
<br />
Now we can use this function to estimate Effect Size, percentage of differences in means, CV% and number of samples from power analysis (assuming an ANOVA with 2 groups).<br />
<br />
The first example looks at changing the standard deviation, and keeping everything else constant:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > d.ES(n=10, M=20, SD=seq(1, 15, by=1), DV=8)
$EffectSize
[1] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000 0.16666667
[7] 0.14285714 0.12500000 0.11111111 0.10000000 0.09090909 0.08333333
[13] 0.07692308 0.07142857 0.06666667
$PercentageDifference
[1] 0.623053 1.242236 1.857585 2.469136 3.076923 3.680982 4.281346 4.878049
[9] 5.471125 6.060606 6.646526 7.228916 7.807808 8.383234 8.955224
$CV.Control
[1] 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
$n.samples
[1] 10.54166 39.15340 86.88807 153.72338 239.65639 344.68632
[7] 468.81295 612.03614 774.35589 955.77211 1156.28483 1375.89403
[13] 1614.59972 1872.40183 2149.30043
</code></pre>
<br />
If you look at the tables presented in the paper by Berndtson you will see that the results are similar in terms of number of samples.<br />
<br />
Larger differences are seen when we look at changes in mean, while everything else stays constant:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > d.ES(n=10, M=seq(1,25,1), SD=5, DV=8)
$EffectSize
[1] 0.125
$PercentageDifference
[1] 47.619048 27.027027 18.867925 14.492754 11.764706 9.900990 8.547009
[8] 7.518797 6.711409 6.060606 5.524862 5.076142 4.694836 4.366812
[15] 4.081633 3.831418 3.610108 3.412969 3.236246 3.076923 2.932551
[22] 2.801120 2.680965 2.570694 2.469136
$CV.Control
[1] 500.00000 250.00000 166.66667 125.00000 100.00000 83.33333 71.42857
[8] 62.50000 55.55556 50.00000 45.45455 41.66667 38.46154 35.71429
[15] 33.33333 31.25000 29.41176 27.77778 26.31579 25.00000 23.80952
[22] 22.72727 21.73913 20.83333 20.00000
$n.samples
[1] 1005.615
</code></pre>
<br />
In this case the mean of the treatment is again 1/8 of the mean of the control, and the standard deviation is fixed at 5. Since the difference in means is the same, and the standard deviation is constant, the effect size also stays constant at 0.125, so very small.<br />
However, both percentage of difference and CV% change quite a bit and therefore the estimates from Berndtson could differ.<br />
<br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com1tag:blogger.com,1999:blog-1442302563171663500.post-72413725324638367922017-07-15T14:06:00.002+02:002017-07-24T16:33:05.900+02:00Generalized Additive Models and Mixed-Effects in Agriculture<h2>
Introduction</h2>
In the previous post I explored the use of linear model in the forms most commonly used in agricultural research.<br />
Clearly, when we are talking about linear models we are implicitly assuming that all relations between the dependent variable y and the predictors x are linear. In fact, in a linear model we could specify different shapes for the relation between y and x, for example by including polynomials (read for example: https://datascienceplus.com/fitting-polynomial-regression-r/). However, we can do that only in cases where we can clearly see a particular shape of the relation, for example quadratic. The problem is in many cases we can see from a scatterplot that we have a non-linear distribution of the points, but it is difficult to understand its form. Moreover, in a linear model the interpretation of polynomial coefficients become more difficult and this may decrease their usefulness.<br />
An alternative approach is provided by Generalized Additive Models, which allows us to fit models with non-linear smoothers without specifying a particular shape a priori.<br />
<br />
I will not go into much details about the theory behind GAMs. You can refer to these two books (freely available online) to know more:<br />
<br />
<span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 13px;">Wood, S.N., 2017. </span><i style="background-color: white; color: #222222; font-family: Arial, sans-serif; font-size: 13px;">Generalized additive models: an introduction with R</i><span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 13px;">. CRC press.</span><br />
<span style="background-color: white; font-size: 13px;"><span style="color: #222222; font-family: "arial" , sans-serif;">http://reseau-mexico.fr/sites/reseau-mexico.fr/files/igam.pdf</span></span><br />
<span style="background-color: white; font-size: 13px;"><span style="color: #222222; font-family: "arial" , sans-serif;"><br /></span></span>
<span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 13px;">Crawley, M.J., 2012. </span><i style="background-color: white; color: #222222; font-family: Arial, sans-serif; font-size: 13px;">The R book</i><span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 13px;">. John Wiley & Sons.</span><br />
<span style="background-color: white; font-size: 13px;"><span style="color: #222222; font-family: "arial" , sans-serif;">https://www.cs.upc.edu/~robert/teaching/estadistica/TheRBook.pdf</span></span><br />
<span style="background-color: white; font-size: 13px;"><span style="color: #222222; font-family: "arial" , sans-serif;"><br /></span></span>
<br />
<h2>
</h2>
<h2>
Some Background</h2>
<div>
As mentioned above, GAM models are more powerful that the other linear model we have seen in previous posts since they allow to include non-linear smoothers into the mix. In mathematical terms GAM solve the following equation:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWFswW3hUkz8BBAzSjgWFgyKfDZ9AGuHcxXHPay9ZLewL7iWsi0-IPZvQ5NNconXPlSWLP-oxuFCOdWu9ZwLeF-9b6gq4HI_w4gsKxw-VUbhe5sa6QunfMIbKSA-AtHQqI88CIC7wE4EVz/s1600/Eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="101" data-original-width="1600" height="40" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWFswW3hUkz8BBAzSjgWFgyKfDZ9AGuHcxXHPay9ZLewL7iWsi0-IPZvQ5NNconXPlSWLP-oxuFCOdWu9ZwLeF-9b6gq4HI_w4gsKxw-VUbhe5sa6QunfMIbKSA-AtHQqI88CIC7wE4EVz/s640/Eq1.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
It may seem like a complex equation, but actually it is pretty simple to understand. The first thing to notice is that with GAM we are not necessarily estimating the response directly, i.e. we are not modelling y. In fact, as with GLM we have the possibility to use link functions to model non-normal response variables (and thus perform poisson or logistic regression). Therefore, the term g(μ) is simply the transformation of y needed to "linearize" the model. When we are dealing with a normally distributed response this term is simply replace by y.<br />
Now we can explore the second part of the equation, where we have two terms: the parametric and the non-parametric part. In GAM we can include all the parametric terms we can include in lm or glm, for example linear or polynomial terms. The second part is the non-parametric smoother that will be fitted automatically and it is the key point of GAMs.<br />
To better understand the difference between the two parts of the equation we can explore an example. Let's say we have a response variable (normally distributed) and two predictors x1 and x2. We look at the data and we observe a clear linear relation between x1 and y, but a complex curvilinear pattern between x2 and y. Because of this we decide to fit a generalized additive model that in this particular case will take the following equation:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-NlNrSVpH6pVH26ddOtOqulDaAipZS9ZZ63q-jwCRGJhfFUOIDVDtpXyLlNvwa_vS8WS55fJr65lHIa-VbKFLKX1QMgBINQyjFeZHdgVWY_UkU0fL6sdGecmMbuyC3QrGOyO3THhMOTy/s1600/Eq2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="95" data-original-width="1600" height="38" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-NlNrSVpH6pVH26ddOtOqulDaAipZS9ZZ63q-jwCRGJhfFUOIDVDtpXyLlNvwa_vS8WS55fJr65lHIa-VbKFLKX1QMgBINQyjFeZHdgVWY_UkU0fL6sdGecmMbuyC3QrGOyO3THhMOTy/s640/Eq2.png" width="640" /></a></div>
Since y is normal we do not need the link function g(). Then we are modelling x1 as a linear model with intercept beta zero and coefficient beta one. However, since we observed a curvilinear relation between x2 and y we also including a non-parametric smoothing function to model x2.</div>
<br />
<br />
<h2>
Practical Example</h2>
<div>
In this tutorial we will work once again with the package agridat so that we can work directly with real data in agriculture. Other packages we will use are ggplot2, moments, pscl and MuMIn:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(agridat)
library(ggplot2)
library(moments)
library(pscl)
library(MuMIn)
</code></pre>
<br /></div>
In R there are two packages to fit generalized additive models, I will talk about the package mgcv. For an overview of GAMs from the package gam you can refer to this post: https://datascienceplus.com/generalized-additive-models/<br />
<br />
The first thing we need to do is install the package mgcv:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("mgcv")
library(mgcv)
</code></pre>
<br />
Now we can load once again the package lasrosas.corn with measures of yield based on nitrogen treatments, plus topographic position and brightness value (for more info please take a look at my previous post: <a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models (lm, ANOVA and ANCOVA) in Agriculture</a>). Then we can use the function pairs to plot all variable in scatterplots, colored by topographic position:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = lasrosas.corn
attach(dat)
pairs(dat[,4:9], lower.panel = NULL, col=topo)
</code></pre>
<br />
This produces the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNHTthtN5aUzVR21KVv_tpqmnDQxSdIAA4cTtDDMEi4o2mjAIyKr2FYQrEUnO3NfZWWQH-h7aLZ_JkhVLmr9G5zMGQicHiqcKH7aIJZEC7I7Bln52bj-RUGFDhGpLblXHi8y5H_9JqchZ3/s1600/Fig1.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="877" data-original-width="1216" height="459" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNHTthtN5aUzVR21KVv_tpqmnDQxSdIAA4cTtDDMEi4o2mjAIyKr2FYQrEUnO3NfZWWQH-h7aLZ_JkhVLmr9G5zMGQicHiqcKH7aIJZEC7I7Bln52bj-RUGFDhGpLblXHi8y5H_9JqchZ3/s640/Fig1.jpeg" width="640" /></a></div>
<br />
In the previous post we only fitted linear models to these data, and therefore the relations between yield and all other predictors were always modelled as lines. However, if we look at the scatterplot between yield and bv, we can clearly see a pattern that does not really look linear, with some blue dots that deviates from the main cloud. If these blue dots were not present we would be happy in modelling this relation as linear. In fact we can prove that by only focusing on this plot and removing the level W from topo:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> par(mfrow=c(1,2))
plot(yield ~ bv, pch=20, data=dat, xlim=c(100,220))
plot(yield ~ bv, pch=20, data=dat[dat$topo!="W",], xlim=c(100,220))
</code></pre>
<br />
which creates the following plot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoYVTubkE0_UPEylZT8W7kvWBB9v0X_GFjV96Q6IOOZ4ajoPkekGcCzPO_Tn5QTp1aI9MhTFE24PqSLNm_M4yiKlMebSQpf-6em_bX8Ak_jrvo6MYfpHIVQ9oRTjj-NpZ72Is_IYm3FkCt/s1600/Fig2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="1133" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoYVTubkE0_UPEylZT8W7kvWBB9v0X_GFjV96Q6IOOZ4ajoPkekGcCzPO_Tn5QTp1aI9MhTFE24PqSLNm_M4yiKlMebSQpf-6em_bX8Ak_jrvo6MYfpHIVQ9oRTjj-NpZ72Is_IYm3FkCt/s640/Fig2.jpeg" width="640" /></a></div>
From this plot it is clear that the level W is an anomaly compared to the rest of the dataset. However, even removing this from the dataset does not really produce a linear pattern, but more a quadratic one. For this reason, it may be that if we want to obtain the best possible results in terms of modelling yield we would need to split the data by topographic position. However, for this post we are not interested in this, but only in showing the use of GAMs. Therefore, we will keep all levels of topo and then try to model the relation between yield and topo with a non-parametric smoother.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod.lm = lm(yield ~ bv)
mod.quad = lm(yield ~ bv + I(bv^2))
mod.gam = gam(yield ~ s(bv), data=dat)
</code></pre>
<br />
Here we are testing three models: standard linear model, a linear model with a quadratic term and finally a GAM. We do that because clearly we are not sure which model is the best and we want to make sure we do not overfit our data.<br />
We can compare these models in the same way we explored in previous posts: by calculating the Akaike Information Criterion (AIC) and with an F test.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > AIC(mod.lm, mod.quad, mod.gam)
df AIC
mod.lm 3.000000 29005.32
mod.quad 4.000000 28924.18
mod.gam 7.738304 28853.72
> anova(mod.lm, mod.quad, mod.gam, test="F")
Analysis of Variance Table
Model 1: yield ~ bv
Model 2: yield ~ bv + I(bv^2)
Model 3: yield ~ s(bv)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3441.0 917043
2 3440.0 895165 1.0000 21879 85.908 < 2.2e-16 ***
3 3436.3 875130 3.7383 20034 21.043 3.305e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
The AIC suggests that the GAM is slightly more accurate that the other two, even with more degrees of freedom. The F test again results in significant difference between models, thus suggesting that we should use the more complex model.<br />
<br />
We can look at the difference in fitting of the three models graphically first using the standard plotting function and then with ggplot2:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> plot(yield ~ bv, pch=20)
abline(mod.lm,col="blue",lwd=2)
lines(50:250,predict(mod.gam, newdata=data.frame(bv=50:250)),col="red",lwd=2)
lines(50:250,predict(mod.quad, newdata=data.frame(bv=50:250)),col="green",lwd=2)
</code></pre>
<br />
This produces the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidKu94M-nInjgvJ7UkNJJ02sNi1KA7XyAM0cLv_a_5aVa8jxL4eaooGYQtsdjkj5EWu-wsTpxEYIxya_9bzDiZJzSY1Le-kW1Ue52k-Aeer_9iKjN0beURtFq8AeeFnsOeIpjWNouX_IBE/s1600/Fig3.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidKu94M-nInjgvJ7UkNJJ02sNi1KA7XyAM0cLv_a_5aVa8jxL4eaooGYQtsdjkj5EWu-wsTpxEYIxya_9bzDiZJzSY1Le-kW1Ue52k-Aeer_9iKjN0beURtFq8AeeFnsOeIpjWNouX_IBE/s640/Fig3.jpeg" width="640" /></a></div>
<br />
The same can be achieved with ggplot2:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> ggplot(data=dat, aes(x=bv, y=yield)) +
geom_point(aes(col=dat$topo)) +
geom_smooth(method = "lm", se = F, col="red")+
geom_smooth(method="gam", formula=y~s(x), se = F, col="blue") +
stat_smooth(method="lm", formula=yield~x+I(x^2),se = F, col="green")
</code></pre>
<br />
which produces the following:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBiOB33HexQtPuzivIWoqckZbpj3uJGoGZnsrbyWil5t-DVDrGNEaoZ9Afm5GrwXPChe3zZEwoH5FHDS0wJwvl_89YgkLnEdgDMS1X3R3K728C8NrCWmn4EmIIbusu9I8gcW9sPe2rIwf7/s1600/Fig4.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBiOB33HexQtPuzivIWoqckZbpj3uJGoGZnsrbyWil5t-DVDrGNEaoZ9Afm5GrwXPChe3zZEwoH5FHDS0wJwvl_89YgkLnEdgDMS1X3R3K728C8NrCWmn4EmIIbusu9I8gcW9sPe2rIwf7/s640/Fig4.jpeg" width="640" /></a></div>
<br />
This second image is even more informative because when we decide to use a categorical variable to color the dots, ggplot2 automatically creates a legend for it, so we know which level causes the shift in the data (i.e. W).<br />
<br />
As you can see all of these lines do not really fit the data perfectly, since the large cloud around 100 in yield and 180 in bv is not considered. However, the blue line of the non-parametric smoother seems to better catch the violet dots on the left and also bends when reaches the cloud, mimicking the quadratic behavior we observed before.<br />
<br />
With GAM we can still use the function summary to look at the model in details:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod.gam)
Family: gaussian
Link function: identity
Formula:
yield ~ s(bv)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.828 0.272 256.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(bv) 5.738 6.919 270.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.353 Deviance explained = 35.4%
GCV = 255.17 Scale est. = 254.68 n = 3443
</code></pre>
<br />
The interpretation is similar to linear models, and probably a bit easier that with GLM since in GAM we also have an R Squared directly from the summary output. As you can see the smooth term is highly significant and we can see its estimated degrees of freedom (around 6) and its F and p values. At the bottom of the output we see a numerical value for GCV, which stands for Generalized Cross Validation Score. This is what the model tries to reduce by default and it is given by the equation below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjxt_SGL6fnDPxyJvGmTChHAWQfutTPGEeOeVozjRscwGgDN4uOvA_jlelrc0gb68XlvaEY9TrASJnw-DtLxCf2SFKTJ_0NK0HByvFmSC06bz-EzO6agNfBw4oArT6luoEo-TpvryBQ8mO/s1600/Eq3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="175" data-original-width="1600" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjxt_SGL6fnDPxyJvGmTChHAWQfutTPGEeOeVozjRscwGgDN4uOvA_jlelrc0gb68XlvaEY9TrASJnw-DtLxCf2SFKTJ_0NK0HByvFmSC06bz-EzO6agNfBw4oArT6luoEo-TpvryBQ8mO/s640/Eq3.png" width="640" /></a></div>
<br />
<div>
where D is the deviance, n is the number of samples, and df the effective degrees of freedom of the model. for more info please refer to Wood's book. I read on-line in an answer on StackOverflow that GCV may produce underfitting, I am not completely sure about this because I have not found any mention of it on official documentations. However, just in case later on I will show you how to fit the smoother with REML, which according to StackOverflow should solve the issue with underfitting.</div>
<div>
<br /></div>
<h3>
Include more parameters</h3>
<br />
Now that we have a general idea about what function to fit for bv we can add more predictors and try to create a more accurate model.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > #Include more predictors
> mod.lm = lm(yield ~ nitro + nf + topo + bv)
> mod.gam = gam(yield ~ nitro + nf + topo + s(bv), data=dat)
>
> #Comparison R Squared
> summary(mod.lm)$adj.r.squared
[1] 0.5211237
> summary(mod.gam)$r.sq
[1] 0.5292042
</code></pre>
<br />
In the code above we are comparing two models with all of the predictors we have in the datasets. As you can see there is not much difference in the two models in terms of R Squared, so both model are able to explain pretty much the same level of variation in yield.<br />
<br />
However, as you remember from above, we clearly noticed changes in bv depending on topo, and we also noticed that if we exclude certain topographic categories the behavior of the curve would probably change. We can include this new hypothesis in the model by using the option by within s, to fit a non-parametric smoother to each topographic factor individually.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mod.gam2 = gam(yield ~ nitro + nf * topo + s(bv, by=topo), data=dat)
>
> summary(mod.gam2)$r.sq
[1] 0.5612617
</code></pre>
<br />
As you can see if we fit a curve to each subset of the plot above we can increase the R Squared, and therefore explain more variation in yield.<br />
We can further explore the difference in models with function AIC and anova, as we've seen in previous posts:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > AIC(mod.lm, mod.gam, mod.gam2)
df AIC
mod.lm 12.00000 27815.23
mod.gam 18.60852 27763.83
mod.gam2 42.22616 27548.63
>
> #F test
> anova(mod.lm, mod.gam, mod.gam2, test="F")
Analysis of Variance Table
Model 1: yield ~ nitro + nf + topo + bv
Model 2: yield ~ nitro + nf + topo + s(bv)
Model 3: yield ~ nitro + nf * topo + s(bv, by = topo)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3432.0 645661
2 3425.4 633656 6.6085 12005 10.525 1.512e-12 ***
3 3401.8 587151 23.6176 46505 11.408 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
The AIC is lower for mod.gam2, and the F test suggest it is significantly different from the other, meaning that we should use the more complex model.<br />
<br />
Another way of assessing the accuracy of our two models would be to use some diagnostic plots. Let's start with the model with a non-parametric smoother fitted to the whole datasets (mod.gam):<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> plot(mod.gam, all.terms=F, residuals=T, pch=20)
</code></pre>
<br />
which produce the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGR0uBoEfCISkJg4HyIMl3LYrS0Vs2cus-3bLY7pjh6Ol1Blvicwkfrf5D7p3fa_wpXFE6-9wMSOs4q1HNxYWRdaMCiA1b63IjeOppgNQV-Nx7OOccuqRQxPUAhOQftzwEGedfqon9Jkyw/s1600/Fig5.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGR0uBoEfCISkJg4HyIMl3LYrS0Vs2cus-3bLY7pjh6Ol1Blvicwkfrf5D7p3fa_wpXFE6-9wMSOs4q1HNxYWRdaMCiA1b63IjeOppgNQV-Nx7OOccuqRQxPUAhOQftzwEGedfqon9Jkyw/s640/Fig5.jpeg" width="640" /></a></div>
This plot can be interpreted exactly like the fitted vs. residuals plot we produced for the post about linear model (see here: <a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models (lm, ANOVA and ANCOVA) in Agriculture</a>). For the model to be good we would expect this line to be horizontal and the spread to be more or less homogeneous (this is except when dealing with time-series, please see here: <a href="https://petolau.github.io/Analyzing-double-seasonal-time-series-with-GAM-in-R/" target="_blank">Analyzing-double-seasonal-time-series-with-GAM-in-R</a>). However, this is clearly not the case and this strongly suggest our model is not a good one.<br />
Now let's take a look at the same plot for mod.gam2, the one where we fitted a curve for each level of topo:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> par(mfrow=c(2,2))
plot(mod.gam2, all.terms=F, residuals=T, pch=20, page=1)
</code></pre>
<br />
In this case we need to use the function par to create 4 sub-plots within the plotting window. This is because now the model fits a curve for each of the four categories in topo, so four plots will be created.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZFZPz3DTSpgn5XIH5zSwe0YHL2xEmIAE7qaESssGXUP-TRRRDX2oTp8nQfZ67qth01ibhI3ZzTtD1omfK0c8Nug2EyHiA0gQeCLCLj6O_PENW-tOdV5WtvK7-ZHhyphenhyphenUgFWGwTQiORYTw7m/s1600/Fig6.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="815" data-original-width="867" height="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZFZPz3DTSpgn5XIH5zSwe0YHL2xEmIAE7qaESssGXUP-TRRRDX2oTp8nQfZ67qth01ibhI3ZzTtD1omfK0c8Nug2EyHiA0gQeCLCLj6O_PENW-tOdV5WtvK7-ZHhyphenhyphenUgFWGwTQiORYTw7m/s640/Fig6.jpeg" width="640" /></a></div>
The result is clearly much better. All lines are more or less horizontal, even tough in some case the spread of the confidence intervals in uneven. However, this model is clearly a step forward in term of accuracy compared to mod.gam.<br />
<br />
Another useful function for producing diagnostic plots is gam.check:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> par(mfrow=c(2,2))
gam.check(mod.gam2)
</code></pre>
<br />
which creates the following:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj95Hm7JakyKBkh6jHupu2pG0cng8nOZML1VMqyGXd9MqPiBxeLl3XiiUM1o-tCpbxc6G1lSTTcMQbXz4OTE2W_54Ry-QDHVsq_2GjbReKm3-Myn63zpzf37RrgV2NkhMCTiFicjqf2DuF1/s1600/Fig7.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj95Hm7JakyKBkh6jHupu2pG0cng8nOZML1VMqyGXd9MqPiBxeLl3XiiUM1o-tCpbxc6G1lSTTcMQbXz4OTE2W_54Ry-QDHVsq_2GjbReKm3-Myn63zpzf37RrgV2NkhMCTiFicjqf2DuF1/s640/Fig7.jpeg" width="640" /></a></div>
<br />
This shows similar plots to what we see in the previous post about linear models. Again we are aiming at normally distributed residuals. Moreover, the plot residuals vs. fitted should show a cloud centered around 0 and with more or less equal variance throughout the range of fitted values, which is approximately what we see here.<br />
<br />
<br />
<h3>
Changing Parameters</h3>
<div>
The function s, used to fit non-parametric smoother, can take a series of option that changes it behavior. We will now look at some of them:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod.gam1 = gam(yield ~ s(bv), data=dat)
mod.gam2 = gam(yield ~ s(bv), data=dat, gamma=1.4)
mod.gam3 = gam(yield ~ s(bv), data=dat, method="REML")
mod.gam4 = gam(yield ~ s(bv, bs="cr"), data=dat) #All options for bs at help(smooth.terms)
mod.gam5 = gam(yield ~ s(bv, bs="ps"), data=dat)
mod.gam6 = gam(yield ~ s(bv, k=2), data=dat)
</code></pre>
<div>
<br />
The first line is the standard use, without any option and we will use it just for comparison. The second call (mod.gam2) changes the gamma, which increases the "penalty" per increment in degree of freedom. Its default value is 1, but Wood suggest that increasing it to 1.4 can reduce over-fitting (Pag. 227 of Wood's book, link on top of the page). The third model fits the GAM using REML instead of the standard GCV score, which should provide a more robust fitting. The fourth and fifth models use the option bs within the function s to change the way the curve is fitted. In mod.gam4, cr stands for cubic regression spline, while in mod.gam5 ps stands for P-Splines. There are several options available for bs and you can look at them with help(smooth.terms). Each of these option comes with advantages and disadvantages; to know more about this topic you can read pag. 222 from Wood's book.<br />
The final model (mod.gam6) changes the dimension of the curve, with which we can select the maximum degrees of freedom (default value is 10). In this case we are basically telling R to fit a quadratic curve.<br />
We can plot all the lines generated from the models above to have an idea of individual impacts:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> plot(yield ~ bv, pch=20)
lines(50:250,predict(mod.gam1, newdata=data.frame(bv=50:250)),col="blue",lwd=2)
lines(50:250,predict(mod.gam2, newdata=data.frame(bv=50:250)),col="red",lwd=2)
lines(50:250,predict(mod.gam3, newdata=data.frame(bv=50:250)),col="green",lwd=2)
lines(50:250,predict(mod.gam4, newdata=data.frame(bv=50:250)),col="yellow",lwd=2)
lines(50:250,predict(mod.gam5, newdata=data.frame(bv=50:250)),col="orange",lwd=2)
lines(50:250,predict(mod.gam6, newdata=data.frame(bv=50:250)),col="violet",lwd=2)
</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTih9GbRsF8oyJU4bOAUUh-jGIMXxy9OqgQPF_xc_nH_hMJM5zQ53T53x9ijwibW0JXCnT-4B504ofQTlUYFhJdHpbeXoPpPgnIr1V_9yfaZ7aXnrDg-YcZXAZAv3l0jCinUGYXIHgxLIF/s1600/Fig8.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTih9GbRsF8oyJU4bOAUUh-jGIMXxy9OqgQPF_xc_nH_hMJM5zQ53T53x9ijwibW0JXCnT-4B504ofQTlUYFhJdHpbeXoPpPgnIr1V_9yfaZ7aXnrDg-YcZXAZAv3l0jCinUGYXIHgxLIF/s640/Fig8.jpeg" width="640" /></a></div>
<br />
As you can see the violet line is basically a quadratic curve, while the rest are quite complex in shape. In particular, the orange line created with P-splines looks like is overfitting the data, while the other look generally the same.<br />
<br />
<br />
<br />
<h3>
Count Data - Poisson Regression</h3>
</div>
</div>
<div>
GAM can be used with all the distributions and link function we have explored for GLM (<a href="http://r-video-tutorial.blogspot.co.uk/2017/07/generalized-linear-models-and-mixed.html" target="_blank">Generalized Linear Models</a>). To explore this we are going to use another dataset from the package agridat: named mead.cauliflower. This dataset presents leaves for cauliflower plants at different times.</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = mead.cauliflower
str(dat)
attach(dat)
pairs(dat, lower.panel = NULL)
</code></pre>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheS-ENDNASLObnCBLh_Ei6HZIWbllDyz3G-C8ycRcjAlo_epBo0arn__EhLkC9dck3WX54-hKdv9-qOUwGNyv3I3WxPpXbKEudfVA5EelJZF3AMPuGi1YDoMadf9xO5LNCKip0AOkdCI_r/s1600/Fig9.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="636" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheS-ENDNASLObnCBLh_Ei6HZIWbllDyz3G-C8ycRcjAlo_epBo0arn__EhLkC9dck3WX54-hKdv9-qOUwGNyv3I3WxPpXbKEudfVA5EelJZF3AMPuGi1YDoMadf9xO5LNCKip0AOkdCI_r/s640/Fig9.jpeg" width="640" /></a></div>
From the pairs plot it seems that a linear model would probably describe the relation between leaves and the variable degdays pretty well. However, since we are talking about GAMs we will try to fit a generalized additive model and see how it compares to that standard GLM.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pois.glm = glm(leaves ~ year + degdays, data=dat, family=c("poisson"))
pois.gam = gam(leaves ~ year + s(degdays), data=dat, family=c("poisson"))
</code></pre>
<br />
as you can see there are only minor differences in the syntax between the two lines. We are still using the option family to specify that we want the poisson distribution for the error term, plus the log link (used by default so we do not need to specify it).<br />
To compare the model we can again use AIC and anova:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > AIC(pois.glm, pois.gam)
df AIC
pois.glm 3.000000 101.4505
pois.gam 3.431062 101.1504
>
> anova(pois.glm, pois.gam)
Analysis of Deviance Table
Model 1: leaves ~ year + degdays
Model 2: leaves ~ year + s(degdays)
Resid. Df Resid. Dev Df Deviance
1 11.000 6.0593
2 10.569 4.8970 0.43106 1.1623
</code></pre>
<br />
Both results suggest that in fact a GAM for this dataset is not needed, since it is only slightly different from the GLM model. We could also compare the R Squared of the two models, using the function to compute it for GLM we tested in the previous post:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pR2(pois.glm)
llh llhNull G2 McFadden r2ML r2CU
-47.7252627 -132.3402086 169.2298918 0.6393744 0.9999944 0.9999944
> r.squaredLR(pois.glm)
[1] 0.9999944
attr(,"adj.r.squared")
[1] 0.9999944
>
> summary(pois.gam)$r.sq
[1] 0.9663474
</code></pre>
<br />
For overdispersed data we have the option to use both the quasipoisson and the negative binomial distributions:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pois.gam.quasi = gam(leaves ~ year + s(degdays), data=dat, family=c("quasipoisson"))
pois.gam.nb = gam(leaves ~ year + s(degdays), data=dat, family=nb())
</code></pre>
<br />
For more info about the use of the negative binomial please look at this article:<br />
<a href="http://astrostatistics.psu.edu/su07/R/html/mgcv/html/gam.neg.bin.html" target="_blank">GAMs with the negative binomial distribution</a><br />
<br />
<br />
<h3>
Logistic Regression</h3>
</div>
<div>
Since we can use of the families we have in GLMs we can also use GAM with binary data, the syntax again is very similar to what we used for GLM:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = johnson.blight
str(dat)
attach(dat)
logit.glm = glm(blight ~ rain.am + rain.ja + precip.m, data=dat, family=binomial)
logit.gam = gam(blight ~ s(rain.am, rain.ja,k=5) + s(precip.m), data=dat, family=binomial)
</code></pre>
<br /></div>
<div>
As you can see we are using an interaction between rain.am and rain.ja in the model, plus another smooth curve fitted only to precip.m.<br />
We can compare the two models as follows:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(logit.glm, logit.gam, test="Chi")
Analysis of Deviance Table
Model 1: blight ~ rain.am + rain.ja + precip.m
Model 2: blight ~ s(rain.am, rain.ja, k = 5) + s(precip.m)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 21 20.029
2 21 20.029 1.0222e-05 3.4208e-06 6.492e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> AIC(logit.glm, logit.gam)
df AIC
logit.glm 4.00000 28.02893
logit.gam 4.00001 28.02895
</code></pre>
<br />
Despite the identical AIC values, the fact that the anova test is significant suggests we should use the more complex model, i.e. the GAM.<br />
<br />
In the package mgcv there is also a function dedicated to the visualization of the curve on the response variable:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> vis.gam(logit.gam, view=c("rain.am", "rain.ja"), type="response")
</code></pre>
<br />
this creates the following 3D plot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1_Az-LFdYLju_AgL7Njfu_O5jD7iRE-OagBdyUPur26cclZrT9PVLhsSob0eQhyvucqzJPoKwoTnv_ablAK5UI3lRnAbOGGn6Mn-UuNv6K_sVlS7Bof1JUvfnrddrQPvaWhMWSTBOC2NI/s1600/Fig10.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1_Az-LFdYLju_AgL7Njfu_O5jD7iRE-OagBdyUPur26cclZrT9PVLhsSob0eQhyvucqzJPoKwoTnv_ablAK5UI3lRnAbOGGn6Mn-UuNv6K_sVlS7Bof1JUvfnrddrQPvaWhMWSTBOC2NI/s640/Fig10.jpeg" width="640" /></a></div>
This shows the response on the z axis and the two variables associated in the interaction. Since this plot is a bit difficult to interpret we can also plot is as contours:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> vis.gam(logit.gam, view=c("rain.am", "rain.ja"), type="response", plot.type="contour")
</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl9QWtn6tdk8JJLNkR9TW_TnNgw-641R5E1ocRHEewoBafIppO82qH2TdIf4HEv4EvmWdC-ABKDkLiVQC-p18MNIVdqJxC0csIGSBY2Yrhc14j3bRgehLZoMrn8opf0KDK5EIUNVYZxeVG/s1600/Fig11.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="880" data-original-width="900" height="624" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl9QWtn6tdk8JJLNkR9TW_TnNgw-641R5E1ocRHEewoBafIppO82qH2TdIf4HEv4EvmWdC-ABKDkLiVQC-p18MNIVdqJxC0csIGSBY2Yrhc14j3bRgehLZoMrn8opf0KDK5EIUNVYZxeVG/s640/Fig11.jpeg" width="640" /></a></div>
This allows to determine the changes in Leaves dependent only from the interaction between rain.am and rain.ja.<br />
<br />
<br />
<h3>
Generalized Additive Mixed Effects Models</h3>
</div>
<div>
In the package mgcv there is the function gamm, which allows fitting generalized additive mixed effects model, with a syntax taken from the package nlme. However, compared to what we see in the post about <a href="http://r-video-tutorial.blogspot.co.uk/2017/07/linear-mixed-effects-models-in.html" target="_blank">Mixed-Effects Models</a> there are some changes we need to make.</div>
<div>
Let's focus again on the dataset lasrosas.corn, which has a column year that we can consider as a possible source of additional random variation. The code below imports the dataset and then transform the variable year from numeric to factorial:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = lasrosas.corn
dat$year = as.factor(paste(dat$year))
</code></pre>
<br />
We will start by looking at a random intercept model. If this was not a GAM with mixed effects, but a simpler linear mixed effects model, the code to fit it would be the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> LME = lme(yield ~ nitro + nf + topo + bv, data=dat, random=~1|year)
</code></pre>
<br />
This is probably the same line of code we used in the previous post. In the package nlme this same model can be fitted using a list as input for the option random. Look at the code below:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> LME1 = lme(yield ~ nitro + nf + topo + bv, data=dat, random=list(year=~1))
</code></pre>
<br />
Here in the list we are creating a new value year, which takes a value of ~1, indicating that its random effect applies only to the intercept.<br />
We can use the anova function to see that LME and LME2 are in fact the same model:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(LME, LME1)
Model df AIC BIC logLik
LME 1 13 27138.22 27218.05 -13556.11
LME1 2 13 27138.22 27218.05 -13556.11
</code></pre>
<br />
I showed you this alternative syntax with list because in gamm this is the only syntax we can use. So for fitting a GAM with random intercept for year we should use the following code:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> gam.ME = gamm(yield ~ nitro + nf + topo + s(bv), data=dat, random=list(year=~1))
</code></pre>
<br />
The object gam.ME is a list with two component, a mixed effect mode and a GAM. To check their summaries we need to use two lines:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> summary(gam.ME[[1]])
summary(gam.ME[[2]])
</code></pre>
<br />
<br />
Now we can see the code to fit a random slope and intercept model. gain we need to use the syntax with a list:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> gam.ME2 = gamm(yield ~ nitro + nf + topo + s(bv), data=dat, random=list(year=~1, year=~nf))
</code></pre>
<br />
Here we are including two random effects, one for just the intercept (year=~1) and another for random slope and intercept for each level of nf (year=~nf).</div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com6tag:blogger.com,1999:blog-1442302563171663500.post-49218749484846662302017-07-10T10:07:00.000+02:002017-07-14T22:05:18.323+02:00Assessing the Accuracy of our models (R Squared, Adjusted R Squared, RMSE, MAE, AIC)<h2>
Assessing the accuracy of our model</h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
There are several ways to check
the accuracy of our models, some are printed directly in R within the summary
output, others are just as easy to calculate with specific functions. Please take a look at my previous post for more info on the code.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
R Squared</h3>
</div>
<div>
<div class="MsoNormal" style="text-align: justify;">
This is probably the most
commonly used statistics and allows us to understand the percentage of variance
in the target variable explained by the model. It can be computed as a ratio of
the regression sum of squares and the total sum of squares. This is one of the
standard measures of accuracy that R prints out, through the function summary, for linear models and ANOVAs.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Adjusted R Squared</h3>
</div>
<div>
<div class="MsoNormal" style="text-align: justify;">
This is a form of R-squared that
is adjusted for the number of predictors in the model. It can be computed as
follows:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBdu66DMOrkuTi15yJlT8n7pOfk-JGtXOcLfAu7jE6s1aWAMU9AnumxmYSaL0NgGtOjxn78vLAMPmsLUOxLhca7Uy7F1ler5DSgd4THVbdMW0Q6o3MVbTwEKN78OiQitdLb0Qcn-az5Fae/s1600/Eq10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="129" data-original-width="1600" height="50" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBdu66DMOrkuTi15yJlT8n7pOfk-JGtXOcLfAu7jE6s1aWAMU9AnumxmYSaL0NgGtOjxn78vLAMPmsLUOxLhca7Uy7F1ler5DSgd4THVbdMW0Q6o3MVbTwEKN78OiQitdLb0Qcn-az5Fae/s640/Eq10.png" width="640" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
Where R2 is the R squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine whether our model is overfitting the data. This happens particularly when the sample size is small, in such cases if we fill the model with more predictors we may end up increasing the R squared simply because the model starts adapting to the noise (or random error) and not properly describing the data. It is a generally good indication if the adjusted R squared is similar to the standard R squared.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Root Mean Squared Deviation or Root Mean Squared Error</h3>
</div>
<div>
<div class="MsoNormal" style="text-align: justify;">
The previous indexes measure the
amount of variance in the target variable that can be explained by our model.
This is a good indication but in some cases we are more interested in
quantifying the error in the same measuring unit of the variable. In such cases
we need to compute indexes that average the residuals of the model. The problem
is residuals are both positive and negative and their distribution should
be fairly symmetrical (this is actually one of the assumptions in most linear models, so if this is not the case we should be worried). This means that their average will always be zero. So we
need to find other indexes to quantify the average residuals, for example by
averaging the squared residuals:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY_SzL_y4dQQksNN0QN6fvkhzWwbozle77AbgumF8Tom5PpV_neZZedlvjhHPMCY9S_nHiOM5pD8aJ7VdnmetYSYsnp7QtHcLeHDcmzybJUm9ww89Uhru5z6wfB1w8YMJszXZ7ZAmOg8WK/s1600/Eq11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="172" data-original-width="1600" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY_SzL_y4dQQksNN0QN6fvkhzWwbozle77AbgumF8Tom5PpV_neZZedlvjhHPMCY9S_nHiOM5pD8aJ7VdnmetYSYsnp7QtHcLeHDcmzybJUm9ww89Uhru5z6wfB1w8YMJszXZ7ZAmOg8WK/s640/Eq11.png" width="640" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
This is the square root
of the mean of the squared residuals, with Yhat_t being the estimated value at point t, Y_t being the observed value in t and n<!--[if gte msEquation 12]><m:oMath><i
style='mso-bidi-font-style:normal'><span style='font-family:"Cambria Math",serif'><m:r>n</m:r></span></i></m:oMath><![endif]--><!--[if !msEquation]--><span style="font-family: "calibri" , sans-serif; font-size: 11.0pt; line-height: 107%; position: relative; top: 3.0pt;"><v:shape id="_x0000_i1025" style="height: 14.25pt; width: 6.75pt;" type="#_x0000_t75"><v:imagedata chromakey="white" o:title="" src="file:///C:/Users/00754140/AppData/Local/Temp/msohtmlclip1/01/clip_image003.png"></v:imagedata></v:shape></span><!--[endif]--> being the sample size. The RMSE has the same
measuring unit of the variable y.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Mean Squared Deviation or Mean Squared Error</h3>
</div>
<div>
This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that, since they square the residuals, they tend to be more affected by extreme values. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE if the discrepancy between observed and predicted is large. Since this large residuals may be caused by potential outliers, this issue may cause overestimation of the error.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
Mean Absolute Deviation or Mean Absolute Error</h3>
<div>
<div class="MsoNormal" style="text-align: justify;">
To solve the problem with potential outliers, we can use the mean absolute error, where we average the absolute
value of the residuals:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqC2HFG970yLjgtRAcaPB2QplMaIFLT_BAGR66Msai0ph4mPdRh9bzD8R7tPUQe8vNJ9f_vt638CHZBXSgA5y5I2M5Y9ILufzeEBzdrAfLbCFkLSazq3DPnwyOzmESkZDAqOAghlPEjehD/s1600/Eq12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="124" data-original-width="1600" height="48" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqC2HFG970yLjgtRAcaPB2QplMaIFLT_BAGR66Msai0ph4mPdRh9bzD8R7tPUQe8vNJ9f_vt638CHZBXSgA5y5I2M5Y9ILufzeEBzdrAfLbCFkLSazq3DPnwyOzmESkZDAqOAghlPEjehD/s640/Eq12.png" width="640" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
This index is more
robust against large residuals. Since RMSE is still widely used, even though
its problems are well known, it is always better to calculate and present both in
a research paper.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Akaike Information Criterion</h3>
</div>
<div>
<div class="MsoNormal" style="text-align: justify;">
This is another popular
index we have used in previous posts to compare different models. It is very
popular because it corrects the RMSE for the number of predictors in the model,
thus allowing to account for overfitting. It can be simply computed as follows:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinGFvs3YN9Ia19a0uQm6IVsDFvMboejInWhJTuwO9gCXZ5HijfyF_iEQoRgSuUP-T1CnD-B4pZNWsN8KjYOgw-Grwq-aI3D9s5ZyRL4U3sec47bgeIQ86krWNlcGLzjaX5Z3arZvXl6fh4/s1600/Eq13.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="81" data-original-width="1600" height="32" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinGFvs3YN9Ia19a0uQm6IVsDFvMboejInWhJTuwO9gCXZ5HijfyF_iEQoRgSuUP-T1CnD-B4pZNWsN8KjYOgw-Grwq-aI3D9s5ZyRL4U3sec47bgeIQ86krWNlcGLzjaX5Z3arZvXl6fh4/s640/Eq13.png" width="640" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal">
Where again p is the number of terms in
the model.<o:p></o:p></div>
</div>
<div class="MsoNormal">
<o:p></o:p></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com0tag:blogger.com,1999:blog-1442302563171663500.post-21080779189215772992017-07-10T10:02:00.001+02:002018-01-08T13:08:55.946+01:00Linear Mixed Effects Models in AgricultureThis post was originally part of my previous post about linear models. However, I later decided to split it into several texts because it was effectively too long and complex to navigate.<br />
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): <a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models (lm, ANOVA and ANCOVA) in Agriculture</a><br />
<br />
<br />
<h2>
</h2>
<h2>
</h2>
<h2>
Linear Mixed-Effects Models</h2>
<div>
<div class="MsoNormal">
This class of models is used to account for more than one
source of random variation. For example, assume we have a dataset where we are trying to model yield as a function of nitrogen levels. However, the data were collected in many different farms. In this case, each farm would need
to be consider a cluster and the model would need to take this clustering into
account. Another common set of experiments where linear mixed-effects models
are used is repeated measures, where time provides an additional source of
correlation between measures. For these models we do not need to worry about
the assumptions from previous models, since these are very robust against all
of them. For example, for unbalanced design with blocking, probably these
methods should be used instead of the standard ANOVA.<o:p></o:p></div>
<div class="MsoNormal">
<br />
<div class="MsoNormal">
At the beginning on this tutorial we explored the equation
that supports linear model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT8dOO1QqGeYJMQF_X95bODIFv9MKKA3EKNGXIq7_YwD3ltsTP4D8CUIj7AZU5BJt_2vqN0EosngqrQHo0VD2m-waIOSPtzmgf0x8z3uEuR3zOUtMXo6YEz0o_VRxH3pRRyFmBG8rPzy93/s1600/Eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT8dOO1QqGeYJMQF_X95bODIFv9MKKA3EKNGXIq7_YwD3ltsTP4D8CUIj7AZU5BJt_2vqN0EosngqrQHo0VD2m-waIOSPtzmgf0x8z3uEuR3zOUtMXo6YEz0o_VRxH3pRRyFmBG8rPzy93/s640/Eq1.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This equation can be divided into two components, the
fixed and random effects. For fixed effect we refer to those variables we are
using to explain the model. These may be factorial (in ANOVA), continuous or a
mixed of the two (ANCOVA) and they can also be the blocks used in our design.
The other component in the equation is the random effect, which provides a
level of uncertainty that it is difficult to account for in the model. For example,
when we work with yield we might see differences between plants grown from
similar soils and conditions. These may be related to the seeds or to other
factors and are part of the within-subject variation that we cannot explain.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
There are times however where in the data there are multiple
sources of random variation. For example, data may be clustered in separate
fields or separate farms. This will provide an additional source of random
variation that needs to be taken into account in the model. To do so the
standard equation can be amended in the following way:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8g1VMvMdAbaiZdPje5-XFxqTnhH0IINw4kZyjmhiKTRVcVbeaum64GAHMLNrpCCFteiA1hNKMnqANqlTnXtgWlWm-hd-pfxh29UmgZlOM_6V3DSGdRCGySiQQbLhNof3bwpP0xo5lNVK0/s1600/Eq6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="28" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8g1VMvMdAbaiZdPje5-XFxqTnhH0IINw4kZyjmhiKTRVcVbeaum64GAHMLNrpCCFteiA1hNKMnqANqlTnXtgWlWm-hd-pfxh29UmgZlOM_6V3DSGdRCGySiQQbLhNof3bwpP0xo5lNVK0/s640/Eq6.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This is referred to as a random intercept
model, where the random variation is split into a cluster specific variation <i>u</i> and the standard error term. Effectively, this model assumes that each cluster will only have an effect on the slope of the linear model. In other word, we assume that data collected at different farm will have the same correlation pattern but will be shifted, see image below (source: <a href="http://zoonek2.free.fr/UNIX/48_R/14.html" target="_blank">Mixed Models</a>):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjzlnJxh6igfoQXjhr6JJarpJ-k2J-CeD8pgD1BLlqVAJWy5RG4klOR-kx18829B7bJ7FUv8jJL0IxtJ0Fd84LjVWVQe7F_OgYaT4N-osAXZofHYgx4lmPzhmcpm6VnYA3tF2N2mwlgsXR/s1600/g1096.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjzlnJxh6igfoQXjhr6JJarpJ-k2J-CeD8pgD1BLlqVAJWy5RG4klOR-kx18829B7bJ7FUv8jJL0IxtJ0Fd84LjVWVQe7F_OgYaT4N-osAXZofHYgx4lmPzhmcpm6VnYA3tF2N2mwlgsXR/s400/g1096.png" width="400" /></a></div>
<br />
<br />
A more
complex form, that is normally used for repeated measures is the random slope
and intercept model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLNUBznHdN2CBvXfj6wDoJy0KcPe1nu8M11EowAHDOq33ct7mHPZt4PdWP_5dQMv31Kr9BuSJenYxIlf0D-VC9kc0Nu6P5v4qY8Yqdtkw241SIbfePCTIyLE3mzBEqHCrTNbqQ4V-2Mg9k/s1600/Eq7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLNUBznHdN2CBvXfj6wDoJy0KcPe1nu8M11EowAHDOq33ct7mHPZt4PdWP_5dQMv31Kr9BuSJenYxIlf0D-VC9kc0Nu6P5v4qY8Yqdtkw241SIbfePCTIyLE3mzBEqHCrTNbqQ4V-2Mg9k/s640/Eq7.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Where we add a new source of random variation <i>v</i> related to time <i>T</i>. In this case we assume that the random variation happends not only by changing the intercept of the linear model, but also its slope. The image below from the same website should again clarify things:<o:p></o:p><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0qBTKSNeu8vSybkfhww7nhTZvtXVmN0EZq9_ymyS1HF3_TbJ5NYBzj088MwLpblmGyr6pggPniGMGEWASPzYeIzfxHrHuf2uct7eUbp4fdivqNDHNF6C7_yUlAAkzPj5HS9vxJzfVVDTY/s1600/g1097.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0qBTKSNeu8vSybkfhww7nhTZvtXVmN0EZq9_ymyS1HF3_TbJ5NYBzj088MwLpblmGyr6pggPniGMGEWASPzYeIzfxHrHuf2uct7eUbp4fdivqNDHNF6C7_yUlAAkzPj5HS9vxJzfVVDTY/s400/g1097.png" width="400" /></a></div>
<br />
<br /></div>
<div class="MsoNormal">
<div class="separator" style="clear: both; text-align: center;">
</div>
As a general rule we can use plotting to determine if and what random effects to use for modelling our data. In the examples above, a simple xy plot with colour would provide a lot of information. Alternatively, we could use the plotting method with ggplot2 and the function facet_wrap to divide our scatterplots by factors and see if there are changes only in intercept or also slope.</div>
<div class="MsoNormal">
<br /></div>
<h3>
Random Intercept Model for Clustered Data</h3>
<div>
<div class="MsoNormal">
In the following examples we will use the function lme in the package nlme, so please install and/or load the package first. For this example we are using the same dataset lasrosas.corn from package agridat we used in the previous post <a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models in Agriculture</a><br />
<br />
Just to explain the syntax to use linear mixed-effects model
in R for cluster data, we will assume that the factorial variable rep in our
dataset describes some clusters. To fit a mixed-effects model we are
going to use the function <span class="CodeChar">lme</span> from the
package <span class="CodeChar">nlme</span>. This function can work with
unbalanced designs:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
</div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)
</code></pre>
<br />
<div class="MsoNormal">
The syntax is very similar to all the models we fitted
before, with a general formula describing our target variable yield and all the
treatments, which are the fixed effects of the model. Then we have the option
random, which allows us to include an additional random component for the
clustering factor rep. In this case the ~1 indicates that the random effect
will be associated with the intercept.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
Once again we can use the function summary to explore our
results:</div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(lme1)
Linear mixed-effects model fit by REML
Data: dat
AIC BIC logLik
27648.36 27740.46 -13809.18
Random effects:
Formula: ~1 | rep
(Intercept) Residual
StdDev: 0.798407 13.3573
Fixed effects: yield ~ nf + bv * topo
Value Std.Error DF t-value p-value
(Intercept) 327.3304 14.782524 3428 22.143068 0
nfN1 3.9643 0.788049 3428 5.030561 0
nfN2 5.2340 0.790104 3428 6.624449 0
nfN3 5.4498 0.789084 3428 6.906496 0
nfN4 7.5286 0.789551 3428 9.535320 0
nfN5 7.7254 0.789111 3428 9.789976 0
bv -1.4685 0.085507 3428 -17.173569 0
topoHT -233.3675 17.143956 3428 -13.612232 0
topoLO -251.9750 20.967003 3428 -12.017693 0
topoW -146.4066 16.968453 3428 -8.628162 0
bv:topoHT 1.1945 0.097696 3428 12.226279 0
bv:topoLO 1.4961 0.123424 3428 12.121624 0
bv:topoW 0.7873 0.097865 3428 8.044485 0
</code></pre>
<br />
<div class="MsoNormal">
We can also use the function <span class="CodeChar">Anova</span>
to display the ANOVA table:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > Anova(lme2, type=c("III"))
Analysis of Deviance Table (Type III tests)
Response: yield
Chisq Df Pr(>Chisq)
(Intercept) 752.25 1 < 2.2e-16 ***
nf 155.57 5 < 2.2e-16 ***
bv 291.49 1 < 2.2e-16 ***
topo 236.52 3 < 2.2e-16 ***
year 797.13 1 < 2.2e-16 ***
bv:topo 210.38 3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal">
We might be interested in understanding if fitting a more
complex model provides any advantage in terms of accuracy, compared with a
model where no additional random effect is included. To do so we can compare
this new model with mod6, which we created with the <span class="CodeChar">gls</span>
function and includes the same treatment structure. We can do that with the function anova ,
specifying both models:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(lme1, mod6)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 15 27648.36 27740.46 -13809.18
mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275
</code></pre>
<br />
<div class="MsoNormal">
As you can see there is a decrease in AIC for the model
fitted with <span class="CodeChar">lme</span>, and the difference is
significant (p-value below 0.05). Therefore this new model where clustering is
accounted for is better than the one without an additional random effect, even
though only slightly. In this case we would need to decide if fitting a more
complex model (which is probably more difficult to explain to readers) is the
best way.<o:p></o:p><br />
<br />
Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.<br />
There are two important functions for this, both included in the package MuMIn:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(MuMIn)
> r.squaredLR(mod6)
[1] 0.5469906
attr(,"adj.r.squared")
[1] 0.5470721
> r.squaredGLMM(lme1)
R2m R2c
0.5459845 0.5476009
</code></pre>
<br /></div>
<div class="MsoNormal">
The first function r.squaredLR can be used for GLS models and provides both and R-Squared and an Adjusted R-Squared. The second function, r.squaredGLMM, is specific for mixed-effects models and provides two measures: R2m and R2c. The first reports the R2 of the model with just fixed effects, while the second the R squared of the full model.<br />
In this case we can see again that the R squared are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.</div>
<div class="MsoNormal">
<br /></div>
<h3>
Random Intercept and Slope for repeated measures</h3>
<div>
<div class="MsoNormal">
If we collected data at several time steps we are looking at
a repeated measures analysis, which is most cases can be treated as mixed random slope and intercept model. Again, we cannot just assume that because we have collected data over time we have a random slope and intercept, we always need to do some plotting first and take a closer look at our data. In cases like this where we are dealing with a factorial variable, we may be forced to rely on barcharts, divided by years. In such cases it may be difficult to understand whether we need a model this complex. So it may be that the only way is to just compare different models with anova (this function can be used with more than 2 models if needed).<br />
<br />
The code to create such a model is the following:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat)
summary(lme2)
Anova(lme2, type=c("III"))
</code></pre>
<br />
<div class="MsoNormal">
The syntax is very similar to what we wrote before, except
that now the random component includes both time and clusters. Again we can use
<span class="CodeChar">summary</span> to get more info about the model. We can also use again the function <span class="CodeChar">anova</span> to compare this with the previous model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(lme1, lme2)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 15 27648.36 27740.46 -13809.18
lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001
Warning message:
In anova.lme(lme1, lme2) :
fitted objects with different fixed effects. REML comparisons are not meaningful.
</code></pre>
<br />
<div class="MsoNormal">
From this output it is clear that the new model is better
that the one before and their difference in highly significant. If this happens it is generally better to adopt the more complex model.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
We can extract only the effects for the random components
using the function ranef:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > ranef(lme2)
(Intercept) year
R1 -0.3468601 -1.189799e-07
R2 -0.5681688 -1.973702e-07
R3 0.9150289 3.163501e-07
</code></pre>
<br />
<div class="MsoNormal">
This tells us the changes in yield for each cluster and time
step.<br />
<br />
We can also do the same for the fixed effects, and this will return the
coefficients of the model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > fixef(lme2)
(Intercept) nfN1 nfN2 nfN3 nfN4
-1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00
nfN5 bv topoHT topoLO topoW
7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02
year bv:topoHT bv:topoLO bv:topoW
5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01
</code></pre>
<br />
<div class="MsoNormal">
To have an idea of their confidence interval we can use the
function intervals (package nlme):<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > intervals(lme2, which = "fixed")
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04
nfN1 2.526139e+00 3.918006e+00 5.309873e+00
nfN2 3.736625e+00 5.132136e+00 6.527648e+00
nfN3 3.974809e+00 5.368513e+00 6.762216e+00
nfN4 6.070018e+00 7.464542e+00 8.859065e+00
nfN5 6.245584e+00 7.639337e+00 9.033089e+00
bv -1.469793e+00 -1.318391e+00 -1.166989e+00
topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02
topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02
topoW -1.436741e+02 -1.136168e+02 -8.355954e+01
year 5.414742e+00 5.818826e+00 6.222911e+00
bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00
bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00
bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01
attr(,"label")
[1] "Fixed effects:"
</code></pre>
<br />
<br />
<h2>
Syntax with lme4</h2>
<div>
Another popular package to perform mixed-effects models we could also use the package lme4 and the function lmer.</div>
<div>
For example, to fit the model with random intercept (what we called lme1) we would use the following syntax in lme4:</div>
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups Name Variance Std.Dev.
rep (Intercept) 0.6375 0.7984
Residual 178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 327.33043 14.78252 22.143
nfN1 3.96433 0.78805 5.031
nfN2 5.23400 0.79010 6.624
nfN3 5.44980 0.78908 6.906
nfN4 7.52862 0.78955 9.535
nfN5 7.72537 0.78911 9.790
bv -1.46846 0.08551 -17.174
topoHT -233.36750 17.14396 -13.612
topoLO -251.97500 20.96700 -12.018
topoW -146.40655 16.96845 -8.628
bv:topoHT 1.19446 0.09770 12.226
bv:topoLO 1.49609 0.12342 12.122
bv:topoW 0.78727 0.09786 8.044
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
</code></pre>
<br />
There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.<br />
<br />
As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [
lmerMod]
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups Name Variance Std.Dev.
rep (Intercept) 0.6375 0.7984
Residual 178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 327.33043 14.78252 3411.00000 22.143 < 2e-16 ***
nfN1 3.96433 0.78805 3428.00000 5.031 5.14e-07 ***
nfN2 5.23400 0.79010 3428.00000 6.624 4.03e-11 ***
nfN3 5.44980 0.78908 3428.00000 6.906 5.90e-12 ***
nfN4 7.52862 0.78955 3428.00000 9.535 < 2e-16 ***
nfN5 7.72537 0.78911 3428.00000 9.790 < 2e-16 ***
bv -1.46846 0.08551 3428.00000 -17.174 < 2e-16 ***
topoHT -233.36750 17.14396 3429.00000 -13.612 < 2e-16 ***
topoLO -251.97500 20.96700 3430.00000 -12.018 < 2e-16 ***
topoW -146.40655 16.96845 3430.00000 -8.628 < 2e-16 ***
bv:topoHT 1.19446 0.09770 3429.00000 12.226 < 2e-16 ***
bv:topoLO 1.49609 0.12342 3430.00000 12.122 < 2e-16 ***
bv:topoW 0.78727 0.09786 3430.00000 8.044 1.33e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
</code></pre>
<br />
As you can see now the p-values are showing and we can assess the significance for each term.<br />
<br />
Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.<br />
<br />
<br />
When we are dealing with random slope and intercept we would use the following syntax:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> lmer2 = lmer(yield ~ nf + bv * topo + year + (year|rep), data=dat)
</code></pre>
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com0tag:blogger.com,1999:blog-1442302563171663500.post-33766771028450501522017-07-10T10:00:00.003+02:002018-02-07T16:14:14.943+01:00Generalized Linear Models and Mixed-Effects in AgricultureAfter publishing my previous post, I realized that it was way too long and so I decided to split it in 2-3 parts. If you think something is missing in the explanation here it may be related to the fact that this was originally part of the previous post (http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html), so please look there first (otherwise please post your question in the comment section and I will try to answer).<br />
<br />
<br />
<h2>
</h2>
<h2>
</h2>
<h2>
</h2>
<h2>
</h2>
<h2>
Dealing with non-normal data – Generalized Linear Models</h2>
<div>
<div class="MsoNormal">
As you remember, when we first introduced the simple linear
model (<a href="http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html" target="_blank">Linear Models in Agriculture</a>) we defined a set of assumptions that need to be met to apply this model.
In the same post, we talked about methods to deal with deviations from assumptions of
independence, equality of variances and balanced designs and the fact that,
particularly if our dataset is large, we may reach robust results even if our
data are not perfectly normal. However, there are datasets for which the target
variable has a completely different distribution from the normal, this means that also the error terms will not be normally distributed.<br />
In these
cases we need to change our modelling method and employ generalized linear
models (GLM). Common scenarios where GLM should be considered are studies where the variable of interest is binary, for example presence or
absence of a species, or where we are interested in modelling counts, for
example the number of insects present in a particular location. In these cases,
where the target variable is not continuous but rather discrete or categorical,
the assumption of normality is usually not met. In this section we will focus
on the two scenarios mentioned above, but GLM can be used to deal with data
distributed in many different ways, and we will introduce how to deal with more
general cases.<o:p></o:p></div>
<div class="MsoNormal">
<br />
<br /></div>
<h3>
Count Data</h3>
</div>
<div>
<div class="MsoNormal">
Data of this type, i.e. counts or rates, are characterized
by the fact that their lower bound is always zero. This does not fit well with
a normal linear model, where the regression line may well estimate negative
values. For this type of variable we can employ a Poisson Regression, which fits
the following model:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ6vOM_0K0D2RZx5-stdGlVi2qWd5LigENee82hhVrR-bKbCnli0n3x-SZSsaQUoT3_Rx8mbksPhdUsOMVdEpVuVevh9R2fxrpqvYx_acksEyV87Bn8rBtyyBF1hSaFVtYnS7WYFTer0B3/s1600/Eq8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ6vOM_0K0D2RZx5-stdGlVi2qWd5LigENee82hhVrR-bKbCnli0n3x-SZSsaQUoT3_Rx8mbksPhdUsOMVdEpVuVevh9R2fxrpqvYx_acksEyV87Bn8rBtyyBF1hSaFVtYnS7WYFTer0B3/s640/Eq8.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
As you can see the equation is very
similar to the standard linear model, the difference is that to insure that all
Y are positive (since we cannot have negative values for count data) we are
estimating the log of <i>Y</i>. This is referred to as the link function, meaning the transformation of y that we need to make to insure linearity of the response. For the model we are going to look at below, which are probably the most common, the link function is implicitly called, meaning that glm call the right function for us and we do not have to specify it explicitly. However, we can do that if needed.<o:p></o:p><br />
<br />
From this equation you may think that instead of using glm we could log transform y and run a normal lm. The problem with that would be that lm is assuming a normal error term with constant variance (as we saw with the plot fitted versus residuals), but for this model this assumption would be violated. That is why we need to use glm.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In R fitting this model is very easy. For
this example we are going to use another dataset available in the package <span class="CodeChar">agridat</span>, named <span class="CodeChar">beall.webworms</span>,
which represents counts of webworms in a beet field, with insecticide
treatments:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > dat = beall.webworms
> str(dat)
'data.frame': 1300 obs. of 7 variables:
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ col : int 1 1 1 1 1 1 1 1 1 1 ...
$ y : int 1 0 1 3 6 0 2 2 1 3 ...
$ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ...
$ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
</code></pre>
<br />
<div class="MsoNormal">
We can check the distribution of our data with the function <span class="CodeChar">hist</span>:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms")
</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmBUuMYfiyI_ZONliSz1Kw8g9wbkSehDKyA2SWvyOVyfCCyl_JxLCrytosLWGHq_-6ZLqTYi7HAgzLHlU3yWJf3_iYJ13cdujjJ2RLxpxjemhqOBxY-A5H5xhAabFZMsb5SEuQ05-TjaS4/s1600/Fig9.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmBUuMYfiyI_ZONliSz1Kw8g9wbkSehDKyA2SWvyOVyfCCyl_JxLCrytosLWGHq_-6ZLqTYi7HAgzLHlU3yWJf3_iYJ13cdujjJ2RLxpxjemhqOBxY-A5H5xhAabFZMsb5SEuQ05-TjaS4/s400/Fig9.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
We are going to fit a simple model first to see how to
interpret its results, and then compare it with a more complex model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pois.mod = glm(y ~ trt, data=dat, family=c("poisson"))
</code></pre>
<br />
<div class="MsoNormal">
As you can see the model features a new option called family, here you specify the distribution of the error term, in this case a poisson distribution. We should also specify a log link function as we saw before, but this is the default setting so there is no need to include it.<br />
<br />
Once again the function summary will show some useful
details about this model:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(pois.mod)
Call:
glm(formula = y ~ trt, family = c("poisson"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6733 -1.0046 -0.9081 0.6141 4.2771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.33647 0.04688 7.177 7.12e-13 ***
trtT2 -1.02043 0.09108 -11.204 < 2e-16 ***
trtT3 -0.49628 0.07621 -6.512 7.41e-11 ***
trtT4 -1.22246 0.09829 -12.438 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1955.9 on 1299 degrees of freedom
Residual deviance: 1720.4 on 1296 degrees of freedom
AIC: 3125.5
Number of Fisher Scoring iterations: 6
</code></pre>
<br />
<div class="MsoNormal">
<br />
<h3>
Update 08/12/2017</h3>
<h4>
Note on interpretation</h4>
<div>
To interpret the coefficients of the model we need to remember that this GLM uses a log link function. Therefore for example the -1.02 is log transformed, so the coefficient for T2 would be exp(-1.02)=0.36.</div>
<div>
In terms of interpretation, we can say that the number of worms for T2 is 0.36 times the number of worms for T1 (this is because the coefficient is always referred to the reference level). So there is a decrease, and that is why the coefficient is negative. </div>
<div>
<br /></div>
<div>
More info here: <a href="https://stats.stackexchange.com/questions/234057/interpretation-of-slope-estimate-of-poisson-regression" target="_blank">https://stats.stackexchange.com/questions/234057/interpretation-of-slope-estimate-of-poisson-regression</a></div>
<div>
<br /></div>
<br />
<br />
The first valuable information is related to the residuals
of the model, which should be symmetrical as for any normal linear model. From
this output we can see that minimum and maximum, as well as the first and third
quartiles, are similar, so this assumption is confirmed. Then we can see that
the variable trt (i.e. treatment factor) is highly significant for the model,
with very low p-values. The statistical test in this case is not a t-test, as in the output of the function lm, but a Wald Test (<a href="http://www.blackwellpublishing.com/specialarticles/jcn_10_774.pdf" target="_blank">Wald Test</a>). This test computes the probability that the coefficient is 0, if the p is significant it means the chances the coefficient is zero are very low so the variable should be included in the model since it has an effect on y.<br />
Another important information is the deviance, particularly the residual deviance. As a general rule, this value should be lower or in line than the residuals degrees of freedom for the model to be good. In this case the fact that the residual deviance is high (even though not dramatically) may suggests the explanatory power of the model is low. We will see below how to obtain p-value for the significance of the model.<br />
<br />
We can use the function plot to obtain more info about how the model fits our data:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> par(mfrow=c(2,2))
plot(pois.mod)
</code></pre>
<br />
This creates the following plot, where the four outputs are included in the same image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG2PYIXYpNN9iH6dsJIeTHRZqXV9J6qtd4HYG-bMrjqVoTWoTBEPaFW5MI9Ck7S5g-wlYfStLt2UvvQJ-r90pVQIXAuO2yNziGg1if33rnDCEsqmF2RvPiFtLhB4GwuuERgxxMIKHql890/s1600/poisson.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="867" data-original-width="962" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG2PYIXYpNN9iH6dsJIeTHRZqXV9J6qtd4HYG-bMrjqVoTWoTBEPaFW5MI9Ck7S5g-wlYfStLt2UvvQJ-r90pVQIXAuO2yNziGg1if33rnDCEsqmF2RvPiFtLhB4GwuuERgxxMIKHql890/s400/poisson.jpeg" width="400" /></a></div>
<br />
This plots tell a lot about the goodness of fit of the model. The first image in top left corner is the same we created for lm (i.e. residuals versus fitted values). This again does not show any trend, just a general underestimation. Then we have the normal QQ plot, where we see that the residuals are not normal, which violates one of the assumptions of the model. Even though we are talking about non-linear error term, we are "linearizing" the model with the link function and by specifying a different family for the error term. Therefore, we still need to obtain normal residuals.<br />
<br />
The effects of the treatments are all negative and referred to the first
level T1, meaning for example that a change from T1 to T2 will decrease the
count by 1.02. We can check this effect by estimating changes between T1 and T2
with the function <span class="CodeChar">predict</span>, and the option <span class="CodeChar">newdata</span>:<br />
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > predict(pois.mod, newdata=data.frame(trt=c("T1","T2")))
1 2
0.3364722 -0.6839588
</code></pre>
<br />
<div class="MsoNormal">
Another important piece of information are the Null and
Residuals deviances, which allow us to compute the probability that this model
is better than the Null hypothesis, which states that a constant (with no
variables) model would be better.<br />
We can compute the p-value of the model with
the following line:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > 1-pchisq(deviance(pois.mod), df.residual(pois.mod))
[1] 1.709743e-14
</code></pre>
<br />
<div class="MsoNormal">
This p-value is very low, meaning that this model fits the
data well. However, it may not be the best possible model, and we can use the
AIC parameter to compare it to other models. For example, we could include more
variables:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))
</code></pre>
<br />
<div class="MsoNormal">
How does this new model compare with the previous?<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > AIC(pois.mod, pois.mod2)
df AIC
pois.mod 4 3125.478
pois.mod2 16 3027.438
</code></pre>
<br />
<div class="MsoNormal">
As you can see the second model has a lower AIC, meaning
that fits the data better than the first.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
One of the assumptions of the Poisson distribution is that
its mean and variance have the same value. We can check by simply comparing
mean and variance of our data:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mean(dat$y)
[1] 0.7923077
> var(dat$y)
[1] 1.290164
</code></pre>
<br />
<div class="MsoNormal">
In cases such as this when the variance is larger than the
mean (in this case we talk about <b>overdispersed count data</b>) we should employ
different methods, for example a quasipoisson distribution:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson"))
</code></pre>
<br />
<div class="MsoNormal">
The summary function provides us with the dispersion
parameter, which for a Poisson distribution should be 1:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(pois.mod3)
Call:
glm(formula = y ~ trt, family = c("quasipoisson"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6733 -1.0046 -0.9081 0.6141 4.2771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33647 0.05457 6.166 9.32e-10 ***
trtT2 -1.02043 0.10601 -9.626 < 2e-16 ***
trtT3 -0.49628 0.08870 -5.595 2.69e-08 ***
trtT4 -1.22246 0.11440 -10.686 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 1.35472)
Null deviance: 1955.9 on 1299 degrees of freedom
Residual deviance: 1720.4 on 1296 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
</code></pre>
<br />
<div class="MsoNormal">
Since the dispersion parameter is 1.35, we can conclude that
our data are not terrible overdispersed, so maybe a Poisson regression would still
be appropriate for this dataset.<br />
<br />
<br />
<h4>
Update 28/07/2017 - Overdispersion Test</h4>
<div>
In the package AER there is a function to directly test for overdispersion. The procedure to do so is quite simple:</div>
<div>
<br /></div>
<div>
First of all we install the package:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("AER")
library(AER)
</code></pre>
<div>
<br />
The we run the following line, which test whether the dispersion parameter is higher than 1, which is the assumption for the Poisson regression:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > dispersiontest(pois.mod, alternative="greater")
Overdispersion test
data: pois.mod
z = 6.0532, p-value = 7.101e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.350551
</code></pre>
<br />
As you can see the alternative hypothesis is that the dispersion parameter is higher than 1. Since the p-value is very low we can accept this alternative hypothesis and therefore use other forms of modelling. Since the dispersion is still close to 1, I think we can use the quasipoisson family. However, if this was higher it would have been better to use the negative binomial family with the function glm.nb in the package MASS, see below.<br />
<br /></div>
<br />
<br />
Another way of directly comparing the two models is with the analysis of deviance, which can be performed with the function anova:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(pois.mod, pois.mod2, test="Chisq")
Analysis of Deviance Table
Model 1: y ~ trt
Model 2: y ~ block + spray * lead
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1296 1720.4
2 1284 1598.4 12 122.04 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
This test compares the residual deviance of the two models to see whether they are different and calculates a p-values. In this case the p-value is highly significant, meaning that the models are different. Since we already compared the AIC, we can conclude that pois.mod2 is significantly (low p-value) better (lower AIC) than pois.mod.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
However, there are cases where the data are very overdispersed. In those cases, when we see that the distribution has lots of peaks we need to employ the negative binomial regression, with the function glm.nb available in the package MASS:</div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(MASS)
NB.mod1 = glm.nb(y ~ trt, data=dat)
</code></pre>
<br />
<br />
<b>NOTE:</b><br />
For GLM it is possible to also compute pseudo R-Squared to ease the interpretation of their accuracy. This can be done with the function pR2 from the package pscl. Please read below (Logistic Regression section) for an example on the use of this function.<br />
<br />
<br />
<h3>
Logistic Regression</h3>
<div>
<div class="MsoNormal">
Another popular form of regression that can be tackled with
GLM is the logistic regression, where the variable of interest is binary (0 or 1, presence or absence and any other binary outcome). In this case the
regression model takes the following equation:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikf4CtztBbDtBTMLUm5FvY8XEtLhneaMQu8qiGGQfH8be_z8EdpJ73wFcTluk7fYCtU-FOLkM-HkQ19y8NeY46-YtvLSO7dgh36nD0PAsz4PE5yAFXsHm386Ks5oRYRXIjzgDGgPyE_ofs/s1600/Eq9.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="131" data-original-width="1600" height="51" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikf4CtztBbDtBTMLUm5FvY8XEtLhneaMQu8qiGGQfH8be_z8EdpJ73wFcTluk7fYCtU-FOLkM-HkQ19y8NeY46-YtvLSO7dgh36nD0PAsz4PE5yAFXsHm386Ks5oRYRXIjzgDGgPyE_ofs/s640/Eq9.png" width="640" /></a></div>
<div class="MsoNormal">
Again, the equation is identical to the
standard linear model, but what we are computing from this model is the log of
the probability that one of the two outcomes will occur, also referred as logit function.<br />
<br />
<h4>
Update 07/02/2018</h4>
<div>
By reviewing some literature I realized that the term logistic regression can be confusing. Sometimes it is used indistinctly to indicate both model with a binary outcome, and models that involve proportions. However, this is probably not totally correct because binary outcome follow a Bernoulli distribution, and in such cases we should probably talk about Bernoulli regression. In R there is no distinction between the two, and both models can be fitted with the option family="binomial", but in other software there is, e.g. Genstat. In Genstat to run a regression with binary outcome you would select a Bernoulli distribution, while for proportions you would select a Binamial distribution. </div>
<div>
<br /></div>
<div>
In regards to the link function, even though the logit is probably the most commonly used, some authors also employ the probit transformation. According to Geyer (2003, http://www.stat.umn.edu/geyer/5931/mle/glm.pdf) but are legitimate transformations and they should not differ much in terms of fitting. The regression coefficient would be different when using the probit compared to the logit. However, if we use the function predict (as suggested here) and we do not rely on the coefficient, we should come up with very similar solutions.</div>
<div>
<br /></div>
</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For this example we are going to use another
dataset available in the package <span class="CodeChar">agridat</span>
called <span class="CodeChar">johnson.blight</span>,
where the binary variable of interest is the presence or absence of blight
(either 0 or 1) in potatoes:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > dat = johnson.blight
> str(dat)
'data.frame': 25 obs. of 6 variables:
$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
$ area : int 0 0 0 0 50 810 120 40 0 0 ...
$ blight : int 0 0 0 0 1 1 1 1 0 0 ...
$ rain.am : int 8 9 9 6 16 10 12 10 11 8 ...
$ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ...
$ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...
</code></pre>
<br />
<div class="MsoNormal">
In R fitting this model is very easy. In this case we are trying to see if the presence of blight is related to the number of rainy days in April and May (column rain.am):<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod9 = glm(blight ~ rain.am, data=dat, family=binomial)
</code></pre>
<br />
<div class="MsoNormal">
we are now using the binomial distribution for a logistic
regression. To check the model we can rely again on summary:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod9)
Call:
glm(formula = blight ~ rain.am, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9395 -0.6605 -0.3517 1.0228 1.6048
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.9854 2.0720 -2.406 0.0161 *
rain.am 0.4467 0.1860 2.402 0.0163 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.617 on 24 degrees of freedom
Residual deviance: 24.782 on 23 degrees of freedom
AIC: 28.782
Number of Fisher Scoring iterations: 5
</code></pre>
<br />
<div class="MsoNormal">
This table is very similar to the one created for count
data, so a lot of the discussion above can be used here. The main difference is
in the way we can interpret the coefficients, because we need to remember that
here we are calculating the logit function of the probability, so 0.4467
(coefficient for rain.am) is not the actual probability associated with an
increase in rain. However, what we can say by just looking at the coefficients
is that rain has a positive effect on blight, meaning that more rain increases
the chances of finding blight in potatoes. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
To estimate probabilities we need to use the function
predict (we could do it manually: <a href="https://www.youtube.com/watch?v=eX2sY2La4Ew&t" target="_blank">Logistic Regression</a> but this is easier):<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > predict(mod9, type="response")
1 2 3 4 5 6 7
0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722
8 9 10 11 12 13 14
0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141
15 16 17 18 19 20 21
0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722
22 23 24 25
0.48214935 0.59273722 0.98109229 0.89680283
</code></pre>
<br />
<div class="MsoNormal">
This calculates the probability associated with the values
of rain in the dataset. To know the probability associated with new values of
rain we can again use predict with the option newdata:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> >predict(mod9,newdata=data.frame(rain.am=c(15)),type="response")
1
0.8475443
</code></pre>
<br />
<div class="MsoNormal">
This tells us that when rain is equal to 15 days between April and May, we have 84%
chances of finding blight (i.e. chances of finding 1) in potatoes.<o:p></o:p><br />
<br />
We could use the same method to compute probabilities for a series of values of rain to see what is the threshold of rain that increases the probability of blight above 50%:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> prob.NEW = predict(mod9,newdata=data.frame(rain.am=1:30),type="response")
plot(1:30, prob.NEW, xlab="Rain", ylab="Probability of Blight")
abline(h=0.5)
</code></pre>
<br />
As you can see we are using once again the function predict, but in this case we are estimating the probabilities for increasing values of rain. Then we are plotting the results:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOSRBV5LhzunsfIMOo1uIgAf33Hbn8HfuXRtULvL_Xo5uGs4jo2qrwdgs6cHKXpLTP30EQ63L5OPt0rEL78qCFzwxSbGmeUazK4zOdUWuULBo_yMN5vj4yzwpsNHPwb39ol5oDuaKlVQG1/s1600/Probability.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOSRBV5LhzunsfIMOo1uIgAf33Hbn8HfuXRtULvL_Xo5uGs4jo2qrwdgs6cHKXpLTP30EQ63L5OPt0rEL78qCFzwxSbGmeUazK4zOdUWuULBo_yMN5vj4yzwpsNHPwb39ol5oDuaKlVQG1/s400/Probability.jpeg" width="400" /></a></div>
From this plot it is clear that we reach a 50% probability at around 12 rainy days between April and May.<br />
<br />
<h4>
Update 26/07/2017</h4>
Another simpler way to create the plot above would be with the function plotPredy, in the package rcompanion:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("rcompanion")
library(rcompanion)
plotPredy(data = dat,
y = blight,
x = rain.am,
model = mod9,
type = "response",
xlab = "Rain",
ylab = "Blight")
</code></pre>
<br />
which creates the following plot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyDz45LNHiyD2MD4kuTcT6At0RCZJTDLR6LuQXEdxjAAS4d6VHrtjHXpV-sgVX-QE_rXc4Dc2twU4E4d7xsqilpuvNHVtBLpcZVrFGhhgQU8gzI5UHVCvwbVjpwcLyuXeKt0aUwwBdVwXy/s1600/Probability2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyDz45LNHiyD2MD4kuTcT6At0RCZJTDLR6LuQXEdxjAAS4d6VHrtjHXpV-sgVX-QE_rXc4Dc2twU4E4d7xsqilpuvNHVtBLpcZVrFGhhgQU8gzI5UHVCvwbVjpwcLyuXeKt0aUwwBdVwXy/s400/Probability2.jpeg" width="400" /></a></div>
<br />
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
To assess the accuracy of the model we can use two
approaches, the first is based on the deviances listed in the summary. The
Residual deviance compares this model with the one that fits the data
perfectly. So if we calculate the following p-value (using the deviance and df
in the summary table for residuals):<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > 1-pchisq(24.782, 23)
[1] 0.3616226
</code></pre>
<br />
<div class="MsoNormal">
We see that because it is higher than 0.05 we cannot reject
that this model fits the data as well as the perfect model, therefore our model
seems to be good.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
We can repeat the same procedure for the Null hypothesis,
which again tests whether this model fits the data well:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > 1-pchisq(34.617, 24)
[1] 0.07428544
</code></pre>
<br />
<div class="MsoNormal">
Since this is again not significant it suggests (contrary to
what we obtained before) that this model is not very good.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
An additional and probably easier to understand way to
assess the accuracy of a logistic model is calculating the pseudo R2, which can
be done by installing the package <span class="CodeChar">pscl</span>:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("pscl")
library(pscl)
</code></pre>
<br />
<div class="MsoNormal">
Now we can run the following function:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > pR2(mod9)
llh llhNull G2 McFadden r2ML r2CU
-12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984
</code></pre>
<br />
<div class="MsoNormal">
From this we can see that our model explains around 30-40% of
the variation in blight, which is not particularly good. We can use this index
to compare models, as we did for AIC.<o:p></o:p><br />
Each of these R squared is computed in a different way and you can read the documentation to know more. In general, one of the most commonly reported is the McFadden, which however tends to be conservative, and the r2ML. In this paper you can find a complete overview/comparison of various pseudo R-Squared: <a href="http://www.glmj.org/archives/articles/Smith_v39n2.pdf" target="_blank">http://www.glmj.org/archives/articles/Smith_v39n2.pdf</a><br />
<br />
<br />
<h4>
Update 26/07/2017</h4>
In the package rcompanion there is the function nagelkerke, which computes other pseudo R squared, like McFadden, Cox and Snell and Nagelkerke:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > nagelkerke(mod9)
$Models
Model: "glm, blight ~ rain.am, binomial, dat"
Null: "glm, blight ~ 1, binomial, dat"
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.284116
Cox and Snell (ML) 0.325250
Nagelkerke (Cragg and Uhler) 0.433898
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-1 -4.9177 9.8353 0.0017119
</code></pre>
<br />
<br />
<h4>
Update 26/07/2016 - Proportions</h4>
<div>
Proportions are generally in the range between 0 and 1, and so they may fit in the binomial family. In fact, one way of modelling proportion is to use the same glm code we saw for the logistic regression.</div>
<div>
Let's look at one example:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > data = crowder.seeds
> str(data)
'data.frame': 21 obs. of 5 variables:
$ plate : Factor w/ 21 levels "P1","P10","P11",..: 1 12 15 16 17 18 19 20 21 2 ...
$ gen : Factor w/ 2 levels "O73","O75": 2 2 2 2 2 1 1 1 1 1 ...
$ extract: Factor w/ 2 levels "bean","cucumber": 1 1 1 1 1 1 1 1 1 1 ...
$ germ : int 10 23 23 26 17 8 10 8 23 0 ...
$ n : int 39 62 81 51 39 16 30 28 45 4 ...
</code></pre>
<div>
<br /></div>
We can load the dataset crowder.seeds from agridat. In here the variable germ is the number of seeds that germinated, while n is the total number of seeds. Thus, we can obtain the proportion of seeds that germinated and we can try to model it.<br />
The syntax to do so is the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mod1 = glm(cbind(germ, n) ~ gen + extract, data=data, family="binomial")
> summary(mod1)
Call:
glm(formula = cbind(germ, n) ~ gen + extract, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5431 -0.5006 -0.1852 0.3968 1.4796
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0594 0.1326 -7.989 1.37e-15 ***
genO75 0.1128 0.1311 0.860 0.39
extractcucumber 0.5232 0.1233 4.242 2.22e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33.870 on 20 degrees of freedom
Residual deviance: 14.678 on 18 degrees of freedom
AIC: 104.65
Number of Fisher Scoring iterations: 4
</code></pre>
<br /></div>
<div class="MsoNormal">
Notice the use of cbind within the formula to calculate proportions.<br />
<br />
Another technique you could use when dealing with proportions is the beta-regression. This can be fitted with the function betareg, in the package betareg. You can find more info at the following links:<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.8394&rep=rep1&type=pdf" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.8394&rep=rep1&type=pdf</a><br />
<br />
A good tutorial on beta regression written by Salvatore S. Mangiafico can be found here:<br />
<a href="http://rcompanion.org/handbook/J_02.html" target="_blank">http://rcompanion.org/handbook/J_02.html</a><br />
<br />
The sample code to perform a beta regression on these data is the following:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #Beta-Regression
y.transf.betareg <- function(y){
n.obs <- sum(!is.na(y))
(y * (n.obs - 1) + 0.5) / n.obs
}
y.transf.betareg(data$germ/data$n)
mod2 = betareg(y.transf.betareg(data$germ/data$n) ~ gen + extract, data=data)
summary(mod2)
</code></pre>
<br />
I had to transform the proportion first using a correction suggested here <a href="https://stackoverflow.com/questions/26385617/proportion-modeling-betareg-errors" target="_blank">https://stackoverflow.com/questions/26385617/proportion-modeling-betareg-errors</a><br />
This is because one sample had value 0, and betareg does not work with either values 0 or 1.<br />
<br /></div>
<h3>
Dealing with other distributions and transformation</h3>
<div>
<div class="MsoNormal">
As mentioned, GLM can be used for fitting linear models not
only in the two scenarios we described above, but in any occasion where data do
not comply with the normality assumption. For example, we can look at another
dataset available in agridat, where the variable of interest is slightly
non-normal:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > dat = hughes.grapes
> str(dat)
'data.frame': 270 obs. of 6 variables:
$ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ...
$ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ...
$ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ diseased: int 1 2 0 0 3 0 7 0 1 0 ...
$ total : int 14 12 12 13 8 9 8 10 14 10 ...
</code></pre>
<br />
<div class="MsoNormal">
The variable total presents a skewness of 0.73, which means
that probably with a transformation it should fit with a normal distribution.
However, for the sake of the discussion we will assume it cannot be
transformed. So now our problem is identify the best distribution for our data,
to do so we can use the function descdist in the package fitdistrplus we
already loaded:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> descdist(dat$total, discrete = FALSE)
</code></pre>
<br />
<div class="MsoNormal">
this returns the following plot:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUJ-ONnrbhXUPj7TjRWbIkMjfQmCC9PRru_dAhEsUCdQCLVxZ7vxFFtSEWq4CuT7QxqxEphB8OGi8eOkPoDsOMVsIrKE4s7UWT6JVJcSfWlMdeUTk9NCFW8ZppNXkhnIuxb0R1zD7RHMEU/s1600/Fig10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUJ-ONnrbhXUPj7TjRWbIkMjfQmCC9PRru_dAhEsUCdQCLVxZ7vxFFtSEWq4CuT7QxqxEphB8OGi8eOkPoDsOMVsIrKE4s7UWT6JVJcSfWlMdeUTk9NCFW8ZppNXkhnIuxb0R1zD7RHMEU/s400/Fig10.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
Where we can see that our data (blue dot) are close to
normal and maybe closer to a gamma distribution. So now we can further check
this using another function from the same package:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> plot(fitdist(dat$total,distr="gamma"))
</code></pre>
<br />
<div class="MsoNormal">
which creates the following plot:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU2Dj71-ezVyaFW_Bxgox4QczU_cTUfE5xYx7HU7wki7O1VPkTFwsd2RmXTfwTD9vjNFAvQwNt9GQqp26sUbWv5zsndHEDyHgfBg__91cEtUIDyRrVDKZcuG-N9I51nNCSQB4fkExMpc48/s1600/Fig11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU2Dj71-ezVyaFW_Bxgox4QczU_cTUfE5xYx7HU7wki7O1VPkTFwsd2RmXTfwTD9vjNFAvQwNt9GQqp26sUbWv5zsndHEDyHgfBg__91cEtUIDyRrVDKZcuG-N9I51nNCSQB4fkExMpc48/s400/Fig11.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
From this we can see that in fact our data seem to be close
to a gamma distribution, so now we can proceed with modelling:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity))
</code></pre>
<br />
<div class="MsoNormal">
in the option family we included the name of the
distribution, plus a link function that is used if we want to transform our
data (in this case the function identity is for leaving data not transformed).<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This is what we do to model other types of data that do not
fit with a normal distribution. Other possible families supported by GLM are:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
binomial, gaussian, Gamma, inverse.gaussian, poisson, quasi,
quasibinomial, quasipoisson<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Other possible link functions (which availability depends on
the family) are:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
logit, probit, cauchit, cloglog, identity, log, sqrt,
1/mu^2, inverse.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<h3>
Generalized Linear Mixed Effects models</h3>
<div>
<div class="MsoNormal">
As linear model, linear mixed effects model need to comply
with normality. If our data deviates too much we need to apply the generalized
form, which is available in the package lme4:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("lme4")
library(lme4)
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
For this example we will use
again the dataset <span class="CodeChar">johnson.blight</span>:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat = johnson.blight
</code></pre>
<br />
<div class="MsoNormal">
Now we can fit a GLME model with random effects for area,
and compare it with a model with only fixed effects:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod10 = glm(blight ~ precip.m, data=dat, family="binomial")
mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial")
> AIC(mod10, mod11)
df AIC
mod10 2 37.698821
mod11 3 9.287692
</code></pre>
<br />
<div class="MsoNormal">
As you can see this new model reduces the AIC substantially.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The same function can be used for Poisson regression, but it does not work for quasipoisson overdispersed data. However, within lme4 there is the function glmer.nb for negative binomial mixed effect. The syntax is the same as glmer, except that in glmer.nb we do not need to include family.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com0tag:blogger.com,1999:blog-1442302563171663500.post-23051489023577192642017-06-28T14:54:00.004+02:002018-02-05T15:45:47.033+01:00Linear Models (lm, ANOVA and ANCOVA) in Agriculture<div class="separator" style="clear: both; text-align: center;">
</div>
As part of my new role as Lecturer in Agri-data analysis at Harper Adams University, I found myself applying a lot of techniques based on linear modelling. Another thing I noticed is that there is a lot of confusion among researchers in regards to what technique should be used in each instance and how to interpret the model. For this reason I started reading material from books and on-line to try and create a sort of reference tutorial that researchers can use. This post is the result of my work so far and I will keep updating it with new information.<br />
<br />
Please feel free to comment, provide feedback and constructive criticism!!<br />
<br />
<br />
<br />
<h2>
</h2>
<h2 name="theoreti">
Theoretical Background - Linear Model and ANOVA</h2>
<h3>
</h3>
<h2>
</h2>
<h3>
Linear Model</h3>
<div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB">The classic
linear model forms the basis for ANOVA (with categorical treatments) and ANCOVA
(which deals with continuous explanatory variables). Its basic equation is the
following:<o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"><br /></span></div>
<div class="MsoNormal" style="text-align: justify;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4GtwB_AZEsdHCpEuNEjRsIW4XMLgByV5udoKDi33NWaNwbyWKfgerqBMigKNOZHZZCQl7-WdV_RZ1FcRZLLYM7PbRb8kUhDBphqEUPomkfyGycioPg9bby0_ij9SLqUoqLIeQkj7vXFvw/s1600/Eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="31" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4GtwB_AZEsdHCpEuNEjRsIW4XMLgByV5udoKDi33NWaNwbyWKfgerqBMigKNOZHZZCQl7-WdV_RZ1FcRZLLYM7PbRb8kUhDBphqEUPomkfyGycioPg9bby0_ij9SLqUoqLIeQkj7vXFvw/s640/Eq1.png" width="640" /></a></div>
where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x, which indicates the changes in y as a function of changes in x. For example if the slope is +0.5, we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can also be negative. The last element of the equation is the random error term, which we assume normally distributed with mean zero and constant variance. </div>
<div class="MsoNormal" style="text-align: justify;">
<span lang="EN-GB"></span></div>
<div class="MsoNormal" style="text-align: justify;">
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
This equation can be expanded to accommodate more that one explanatory variable x:</div>
<div class="MsoNormal" style="text-align: justify;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAkS8scY9Ew5rkFAIsC1a0w5CWi7OfRX4qUEQWmAjjMjRU6Lj4krEozAaXmDcWeihxO8x4wnqbO1gGALROdwEjkemxeQlFVkazrnaGxsRtxjcnjK-dF9zeijLUQWF1yZAvVbaGq0P7S8gE/s1600/Eq2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAkS8scY9Ew5rkFAIsC1a0w5CWi7OfRX4qUEQWmAjjMjRU6Lj4krEozAaXmDcWeihxO8x4wnqbO1gGALROdwEjkemxeQlFVkazrnaGxsRtxjcnjK-dF9zeijLUQWF1yZAvVbaGq0P7S8gE/s640/Eq2.png" width="640" /></a></div>
In this case the interpretation is a bit more complex because for example the coefficient β_2 provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2 the target variable y changes by the value of β_2, if the other explanatory variables are kept constant. </div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
In case our model includes interactions, the linear equation would be changed as follows:</div>
<div class="MsoNormal" style="text-align: justify;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjboBJW-FgDzOlSADGfSEGUChiQj8oJbHV3ei2sm2h28fndLyqHqThQisO2z5Jwt-0itshKwky3nmjGXl6mdC9VHg8JpJs6Gdi7HL6TEeabUhbhjFUCdjf0V-asSeY2Dp1vOGYiUp0owCwZ/s1600/Eq3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjboBJW-FgDzOlSADGfSEGUChiQj8oJbHV3ei2sm2h28fndLyqHqThQisO2z5Jwt-0itshKwky3nmjGXl6mdC9VHg8JpJs6Gdi7HL6TEeabUhbhjFUCdjf0V-asSeY2Dp1vOGYiUp0owCwZ/s640/Eq3.png" width="640" /></a></div>
notice the interaction term between x_1 and x_2. In this case the interpretation becomes extremely difficult just by looking at the model. </div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
In fact, if we rewrite the equation focusing for example on x_1:</div>
<div class="MsoNormal" style="text-align: justify;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidn4bcYOfRDAgsgDdNlOtu3c4Ng7RehjS4lh3A18QDW5T_uwTthvrrGL7MCHLL_ITD6xzxE1QcZGPO-vYfLlN1Hee05aQZ0HIWcMzcUYPTbO1sqtYjGbxeEky2aU5d9UIZx-Cs11vgq6RC/s1600/Eq4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="1600" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidn4bcYOfRDAgsgDdNlOtu3c4Ng7RehjS4lh3A18QDW5T_uwTthvrrGL7MCHLL_ITD6xzxE1QcZGPO-vYfLlN1Hee05aQZ0HIWcMzcUYPTbO1sqtYjGbxeEky2aU5d9UIZx-Cs11vgq6RC/s640/Eq4.png" width="640" /></a></div>
we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason the only way we can actually determine how x_1 changes Y, when the other terms are kept constant, is to use the equation with new values of x_1. </div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
This linear model can be applied to continuous target variables, in this case we would talk about an ANCOVA for exploratory analysis, or a linear regression if the objective was to create a predictive model.</div>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
ANOVA</h3>
<div>
<div>
The Analysis of variance is based on the linear model presented above, the only difference is that its reference point is the mean of the dataset. When we described the equations above we said that to interpret the results of the linear model we would look at the slope term; this indicates the rate of changes in Y if we change one variable and keep the rest constant. The ANOVA calculates the effects of each treatment based on the grand mean, which is the mean of the variable of interest. </div>
<div>
<br /></div>
<div>
In mathematical terms ANOVA solves the following equation (Williams, 2004):</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjriFDVClXuzildrSxJ2IS_79By2ih6zp2G5x_Mv3GXUqpxxKoj8Xswvbvsi8F_Y1IxrhCpsuGScL5yPREjuIwAWBHFXs6cy84LOo3WwY4VAWKEwE1OlfFucpQCl5tEZZYDZtMMgFa5oLrE/s1600/Eq5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="1600" height="32" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjriFDVClXuzildrSxJ2IS_79By2ih6zp2G5x_Mv3GXUqpxxKoj8Xswvbvsi8F_Y1IxrhCpsuGScL5yPREjuIwAWBHFXs6cy84LOo3WwY4VAWKEwE1OlfFucpQCl5tEZZYDZtMMgFa5oLrE/s640/Eq5.png" width="640" /></a></div>
</div>
<div>
where y is the effect on group j of treatment τ_1, while μ is the grand mean (i.e. the mean of the whole dataset). From this equation is clear that the effects calculated by the ANOVA are not referred to unit changes in the explanatory variables, but are all related to changes on the grand mean. </div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<h2>
Examples of ANOVA and ANCOVA in R</h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
For this example we are going to
use one of the datasets available in the package <span class="CodeChar">agridat</span>
available in CRAN:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("agridat")
</code></pre>
<br />
We also need to include other packages for the examples below. If some of these are not installed in your system please use again the function install.packages (replacing the name within quotation marks according to your needs) to install them.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(agridat)
library(ggplot2)
library(plotrix)
library(moments)
library(car)
library(fitdistrplus)
library(nlme)
library(multcomp)
library(epade)
library(lme4)
</code></pre>
<br />
Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn yield in a field in Argentina, plus several explanatory variables both factorial (or categorical) and continuous.<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > dat = lasrosas.corn
> str(dat)
'data.frame': 3443 obs. of 9 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ...
$ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ...
$ yield: num 72.1 73.8 77.2 76.3 75.5 ...
$ nitro: num 132 132 132 132 132 ...
$ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ...
$ bv : num 163 170 168 177 171 ...
$ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...
$ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...
</code></pre>
<br />
Important for the purpose of this tutorial is the target variable yield, which is what we are trying to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have rep, which is the blocking factor. <br />
<br />
<h3 style="text-align: justify;">
Checking Assumptions</h3>
<h2 style="text-align: justify;">
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
Since we are planning to use an
ANOVA we first need to check that our data fits with its assumptions. ANOVA is
based on three assumptions:<o:p></o:p></div>
<div class="MsoListParagraphCxSpFirst" style="mso-list: l0 level1 lfo1; text-align: justify; text-indent: -18.0pt;">
<o:p></o:p></div>
<ul>
<li>Independence, in terms of independence of the error term</li>
<li>Normality of the response variable (y) </li>
<li>Normality of the error term (i.e. residuals).</li>
<li>Equality of variances between groups</li>
<li>Balance design (i.e. all groups have the same
number of samples)</li>
</ul>
<div class="MsoNormal" style="text-align: justify;">
NOTE:<br />
Normality of the response variable is a contested point and not all authors agree with this. In my reading I found some author explicitly talk about normality of the response variable, while others only talk about normality of the errors. In the R Book the author states as assumption only normality of error, but says that the ANOVA can be applied to random variables, which in a way should imply normality of the response.<br />
<br />
Let’s see how we can test for
them in R. Clearly we are talking about environmental data so the assumption of
independence is not met, because data are autocorrelated with distance.
Theoretically speaking, for spatial data ANOVA cannot be employed and more
robust methods should be employed (e.g. REML); however, over the years it has
been widely used for analysis of environmental data and it is accepted by the
community. That does not mean that it is the correct method though, and later
on in this tutorial we will see the function to perform linear modelling with
REML.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
The third assumption is the one
that is most easy to assess using the function <span class="CodeChar">tapply</span>:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > tapply(dat$yield, INDEX=dat$nf, FUN=var)
N0 N1 N2 N3 N4 N5
438.5448 368.8136 372.8698 369.6582 366.5705 405.5653
</code></pre>
<br />
In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of nitrogen). There is some variation between groups but in my opinion it is not substantial. Now we can shift our focus on normality. There are tests to check for normality, but again the ANOVA is flexible (particularly where our dataset is big) and can still produce correct results even when its assumptions are violated up to a certain degree. For this reason, it is good practice to check normality with descriptive analysis alone, without any statistical test. For example, we could start by plotting the histogram of yield:
<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)")
</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKAjSbZhvgdfWYMWKYG_u7Pw9aDYCPR746_8EAFP23BSR9TT485EWqtealRbmALpwzhrhrjyR2Q84LRfxW4nfWu3g8gWc7LS0NrmosaovES2JQB2J-SlhiFrQCTRWMw0HbqGlnhyphenhyphent66m8R/s1600/Fig1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="397" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKAjSbZhvgdfWYMWKYG_u7Pw9aDYCPR746_8EAFP23BSR9TT485EWqtealRbmALpwzhrhrjyR2Q84LRfxW4nfWu3g8gWc7LS0NrmosaovES2JQB2J-SlhiFrQCTRWMw0HbqGlnhyphenhyphent66m8R/s400/Fig1.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
By looking at this image it seems
that our data are more or less normally distributed. Another plot we could
create is the QQplot (<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm</a>):<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> qqnorm(dat$yield, main="QQplot of Yield")
qqline(dat$yield)
</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkztBhWnh1Gt7aRE_uS0OUonc5Yy3CF0QobpgZkpEv-EUv5n-KJSeQ3zQELolKjVvT9Kxsl2nyve4DRDd8a9TFUoU7CweGOkkkm0pyVS05Jnhj7h9AJk4zWI9uAB3kurMiQsnqR3J09nW1/s1600/Fig2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkztBhWnh1Gt7aRE_uS0OUonc5Yy3CF0QobpgZkpEv-EUv5n-KJSeQ3zQELolKjVvT9Kxsl2nyve4DRDd8a9TFUoU7CweGOkkkm0pyVS05Jnhj7h9AJk4zWI9uAB3kurMiQsnqR3J09nW1/s400/Fig2.png" width="400" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
For normally distributed data the
points should all be on the line. This is clearly not the case but again the
deviation is not substantial. The final element we can calculate is the skewness
of the distribution, with the function <span class="CodeChar">skewness</span>
in the package <span class="CodeChar">moments</span>:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > skewness(dat$yield)
[1] 0.3875977
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
According to Webster and Oliver (2007)
is the skewness is below 0.5, we can consider the deviation from normality not
big enough to transform the data. Moreover, according to Witte and Witte (2009)
if we have more than 10 samples per group we should not worry too much about
violating the assumption of normality or equality of variances.</div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
To see how many samples we have
for each level of nitrogen we can use once again the function <span class="CodeChar">tapply</span>:</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > tapply(dat$yield, INDEX=dat$nf, FUN=length)
N0 N1 N2 N3 N4 N5
573 577 571 575 572 575
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
As you can see we have definitely
more than 10 samples per group, but our design is not balanced (i.e. some groups
have more samples). This implies that the normal ANOVA cannot be used, this is
because the standard way of calculating the sum of squares is not appropriate
for unbalanced designs (look here for more info: <a href="http://goanna.cs.rmit.edu.au/~fscholer/anova.php">http://goanna.cs.rmit.edu.au/~fscholer/anova.php</a>).
<o:p></o:p><br />
<br />
The same function tapply can be used to check the assumption of equality of variances. We just need to replace the function length with the function var for the option FUN.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
In summary, even though from the
descriptive analysis it appears that our data are close to being normal and
have equal variance, our design is unbalanced, therefore the normal way of doing
ANOVA cannot be used. In other words we cannot function <span class="CodeChar">aov</span>
for this dataset. However, since this is a tutorial we are still going to start
by applying the normal ANOVA with <span class="CodeChar">aov</span>.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
ANOVA with <span style="font-family: "courier new";">aov</span></h3>
<h2 style="text-align: justify;">
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
The first thing we need to do is
think about the hypothesis we would like to test. For example, we could be
interested in looking at nitrogen levels and their impact on yield. Let’s start
with some plotting to better understand our data:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean)
StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)
BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))
segments(BP, means.nf - (2*StdErr.nf), BP,
means.nf + (2*StdErr.nf), lwd = 1.5)
arrows(BP, means.nf - (2*StdErr.nf), BP,
means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90,
code = 3, length = 0.05)
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
This code first uses the function
<span class="CodeChar">tapply</span> to compute mean and
standard error of the mean for yield in each nitrogen group. Then it plots the
means as bars and creates error bars using the standard error (please remember
that with a normal distribution ± twice the standard error provides a
95% confidence interval around the mean value). The result is the following
image:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8cAgR82LXGQmVTyihk_ceXdaYsZgZCuYOKPmGhZapkmOa2kZLHoJiDBsab5EIQMD-E2i6nv87Oa00Ss1nTIlouj9GAEg04tSe2eW-FvYO6gjXsXOLswwmzDhnX6TGFmqbuYxhfNc_VWk4/s1600/Fig3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8cAgR82LXGQmVTyihk_ceXdaYsZgZCuYOKPmGhZapkmOa2kZLHoJiDBsab5EIQMD-E2i6nv87Oa00Ss1nTIlouj9GAEg04tSe2eW-FvYO6gjXsXOLswwmzDhnX6TGFmqbuYxhfNc_VWk4/s400/Fig3.png" width="400" /></a></div>
<div class="MsoNormal" style="text-align: justify;">
By plotting our data we can start
figuring out what is the interaction between nitrogen levels and yield. In
particular, there is an increase in yield with higher level of nitrogen.
However, some of the error bars are overlapping, and this may suggest that
their values are not significantly different. For example, by looking at this
plot N0 and N1 have error bars very close to overlap, but probably not
overlapping, so it may be that N1 provides a significant different from N0. The
rest are all probably significantly different from N0. For the rest their
interval overlap most of the times, so their differences would probably not be
significant.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
We could formulate the hypothesis
that nitrogen significantly affects yield and that the mean of each subgroup
are significantly different. Now we just need to test this hypothesis with a
one-way ANOVA:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod1 = aov(yield ~ nf, data=dat)
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
The code above uses the function <span class="CodeChar">aov</span> to perform an ANOVA; we can specify to perform a
one-way ANOVA simply by including only one factorial term after the tilde (~)
sign. We can plot the ANOVA table with the function <span class="CodeChar">summary</span>:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod1)
Df Sum Sq Mean Sq F value Pr(>F)
nf 5 23987 4797 12.4 6.08e-12 ***
Residuals 3437 1330110 387
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
It is clear from this output that
nitrogen significantly affects yield, so we tested our first hypothesis. To
test the significance for individual levels of nitrogen we can use the Tukey’s
test:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > TukeyHSD(mod1, conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = yield ~ nf, data = dat)
$nf
diff lwr upr p adj
N1-N0 3.6434635 0.3353282 6.951599 0.0210713
N2-N0 4.6774357 1.3606516 7.994220 0.0008383
N3-N0 5.3629638 2.0519632 8.673964 0.0000588
N4-N0 7.5901274 4.2747959 10.905459 0.0000000
N5-N0 7.8588595 4.5478589 11.169860 0.0000000
N2-N1 1.0339723 -2.2770686 4.345013 0.9489077
N3-N1 1.7195004 -1.5857469 5.024748 0.6750283
N4-N1 3.9466640 0.6370782 7.256250 0.0089057
N5-N1 4.2153960 0.9101487 7.520643 0.0038074
N3-N2 0.6855281 -2.6283756 3.999432 0.9917341
N4-N2 2.9126917 -0.4055391 6.230923 0.1234409
N5-N2 3.1814238 -0.1324799 6.495327 0.0683500
N4-N3 2.2271636 -1.0852863 5.539614 0.3916824
N5-N3 2.4958957 -0.8122196 5.804011 0.2613027
N5-N4 0.2687320 -3.0437179 3.581182 0.9999099
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
There are significant differences
between the control and the rest of the levels of nitrogen, plus other
differences between N4 and N5 compared to N1, but nothing else. If you look
back at the bar chart we produced before, and look carefully at the overlaps
between error bars, you will see that for example N1, N2, and N3 have
overlapping error bars, thus they are not significantly different. On the
contrary, N1 has no overlaps with either N4 and N5 , which is what we
demonstrated in the ANOVA.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The function <span class="CodeChar">model.tables</span> provides a quick way to print the table
of effects and the table of means:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > model.tables(mod1, type="effects")
Tables of effects
nf
N0 N1 N2 N3 N4 N5
-4.855 -1.212 -0.178 0.5075 2.735 3.003
rep 573.000 577.000 571.000 575.0000 572.000 575.000
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
<br />
These values are all referred to
the gran mean, which we can simply calculate with the function <span class="CodeChar">mean(dat$yield)</span> and it is equal to 69.83. This means
that the mean for N0 would be 69.83-4.855 = 64.97. we can verify that with
another call to the function <span class="CodeChar">model.tables</span>,
this time with the option <span class="CodeChar">type=”means”</span>:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > model.tables(mod1, type="means")
Tables of means
Grand mean
69.82831
nf
N0 N1 N2 N3 N4 N5
64.97 68.62 69.65 70.34 72.56 72.83
rep 573.00 577.00 571.00 575.00 572.00 575.00
</code></pre>
<br />
<br />
<h4>
Update 05/02/2018</h4>
<h4>
Nonparametric One-Way ANOVA</h4>
<div>
For certain datasets the assumption of normality cannot be met. In such cases we may consider different options, GLM is one of them and it should be a good solution for datasets like counts and proportions. Another option could be to transform the data to normalize them and therefore meet the assumption of normality. However, with transformations we need to be extremely careful because the estimates of the slopes will also be transformed, and so we always need to know how to back-transform our data. The final option would be to use nonparametric tests, which do not assume a normal distribution.</div>
<div>
For the one-way ANOVA the nonparametric alternative is the Kruskal-Wallis test:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> kruskal.test(yield ~ nf, data=dat)
</code></pre>
</div>
<div>
<br /></div>
This function returns the following result:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > kruskal.test(yield ~ nf, data=dat)
Kruskal-Wallis rank sum test
data: yield by nf
Kruskal-Wallis chi-squared = 81.217, df = 5, p-value = 4.669e-16
</code></pre>
<br />
The p-value is very low, which means the nf treatments are significant.<br />
<br />
<br />
<h3 style="text-align: justify;">
Linear Model with 1 factor</h3>
<h2 style="text-align: justify;">
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
The same results can be obtain by
fitting a linear model with the function lm, only their interpretation would be
different. The assumption for fitting a linear models are again independence
(which is always violated with environmental data), and normality.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
Let’s look at the code:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod2 = lm(yield ~ nf, data=dat)
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
This line fits the same model but
with the standard linear equation. This become clearer by looking at the
summary table:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod2)
Call:
lm(formula = yield ~ nf, data = dat)
Residuals:
Min 1Q Median 3Q Max
-52.313 -15.344 -3.126 13.563 45.337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.9729 0.8218 79.060 < 2e-16 ***
nfN1 3.6435 1.1602 3.140 0.0017 **
nfN2 4.6774 1.1632 4.021 5.92e-05 ***
nfN3 5.3630 1.1612 4.618 4.01e-06 ***
nfN4 7.5901 1.1627 6.528 7.65e-11 ***
nfN5 7.8589 1.1612 6.768 1.53e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.67 on 3437 degrees of freedom
Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629
F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
There are several information in
this table that we should clarify. First of all it already provides with some
descriptive measures for the residuals, from which we can see that their
distribution is relatively normal (first and last quartiles have similar but
opposite values and the same is true for minimum and maximum). As you remember when we talked about assumptions, one was that the error term is normal. This first part of the output allows us to check whether we are meeting this assumption.<br />
<br />
Other important information we should look at are the R-squared and Adjusted R-squared (please look at the end of the page to know more about these two values). In essence, R-squared tells us how much variance in the data can be explained by the model, in this case not much. However, this is an exploratory rather than a predictive model, so we know that there may be other factors that affect the variability of yield, but we are not interested in them. We are only interested in understanding in the impact of the level of nitrogen. Another important information is the F-statistics at the end, with the p-value (which is very low). The F-statistic is the ration between the variability between groups (meaning between different level of N) and within groups (meaning the variability with samples with the same value of N). This ratio and the related p-value tell us that our model is significant (because the variability that we obtain increasing N is higher that the normal variability we expect from random variation), which means that nitrogen has an effect on yield.<br />
<br />
Then we have the
table of the coefficients, with the intercept and all the slopes, plus their standard errors. These can be used to build confidence intervals for the coefficients, which are used to assess the uncertainty around their estimation. We can actually compute the confidence intervals with the function confint (the option level is used to specify for example that we are looking at the 95% confidence interval):<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > confint(mod2, level = 0.95)
2.5 % 97.5 %
(Intercept) 63.361592 66.584202
nfN1 1.368687 5.918240
nfN2 2.396712 6.958160
nfN3 3.086217 7.639711
nfN4 5.310402 9.869853
nfN5 5.582112 10.135607
</code></pre>
<br />
<br />
Another important element about coefficients is that there measuring unit is the same as the dependent variable, because it provides an estimate of the effect of a predictor to the dependent variable, i.e. yield.<br />
<br />
As you can
see the level N0 is not shown in the list; this is called the reference level,
which means that all the other are referenced back to it. In other words, the
value of the intercept is the mean of nitrogen level 0 (in fact is the same we
calculated above 64.97). To calculate the means for the other groups we need to
sum the value of the reference level with the slopes. For example N1 is 64.97 +
3.64 = 68.61 (the same calculated from the ANOVA). The p-value and the significance
are again in relation to the reference level, meaning for example that N1 is
significantly different from N0 (reference level) and the p-value is 0.0017.
This is similar to the Tukey’s test we performed above, but it is only valid in
relation to N0. As you can see the p-value is computed from the t-statistic, this is because R computes t-test comparing all the factors to the reference level.<br />
<br />
We need to change the reference level, and fit another model,
to get the same information for other nitrogen levels:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat$nf = relevel(dat$nf, ref="N1")
mod3 = lm(yield ~ nf, data=dat)
summary(mod3)
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
Now the reference level is N1, so
all the results will tell us the effects of nitrogen in relation to N1.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod3)
Call:
lm(formula = yield ~ nf, data = dat)
Residuals:
Min 1Q Median 3Q Max
-52.313 -15.344 -3.126 13.563 45.337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.616 0.819 83.784 < 2e-16 ***
nfN0 -3.643 1.160 -3.140 0.001702 **
nfN2 1.034 1.161 0.890 0.373308
nfN3 1.720 1.159 1.483 0.138073
nfN4 3.947 1.161 3.400 0.000681 ***
nfN5 4.215 1.159 3.636 0.000280 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.67 on 3437 degrees of freedom
Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629
F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
For example, we can see that N0
has a lower value compared to N1, and that only N0, N4 and N5 are significantly
different from N1, which is what we saw from the bar chart and what we found
from the Tukey’s test.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
Interpreting the output of the
function <span class="CodeChar">aov</span> is much easier compare to <span class="CodeChar">lm</span>. However, in many cases we can only use the
function <span class="CodeChar">lm</span> (for example in an ANCOVA
where alongside categorical we have continuous explanatory variables) so it is
important that we learn how to interpret its summary table.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
We can obtain the ANOVA table
with the function <span class="CodeChar">anova</span>:</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(mod2)
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
nf 5 23987 4797.4 12.396 6.075e-12 ***
Residuals 3437 1330110 387.0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
This uses the type I sum of
squares (more info at: http://www.utstat.utoronto.ca/reid/sta442f/2009/typeSS.pdf),
which is the standard way and it is not indicated for unbalanced designs. The
function <span class="CodeChar">Anova</span> in the package <span class="CodeChar">car</span> has the option to select which type of sum of
squares to calculate and we can specify <span class="CodeChar">type=c(“III”)</span>
to correct for the unbalanced design:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > Anova(mod2, type=c("III"))
Anova Table (Type III tests)
Response: yield
Sum Sq Df F value Pr(>F)
(Intercept) 2418907 1 6250.447 < 2.2e-16 ***
nf 23987 5 12.396 6.075e-12 ***
Residuals 1330110 3437
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
In this example the two results
are the same, probably the large sample size helps in this respect. However,
for smaller samples this distinction may become important. For this reason, if
your design is unbalanced please remember not to use the function <span class="CodeChar">aov</span>, but always <span class="CodeChar">lm</span>
and <span class="CodeChar">Anova</span> with the option for type
III sum of squares.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<h3>
Two-way ANOVA</h3>
<h2>
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
So far we have looked on the
effect of nitrogen on yield. However, in the dataset we also have a factorial
variable named topo, which stands for topographic factor and has 4 levels: W =
West slope, HT = Hilltop, E = East slope, LO = Low East. We already formulated
an hypothesis about nitrogen, so now we need to formulate an hypothesis about
topo as well. Once again we can do that by using the function <span class="CodeChar">tapply</span> and a simple bar charts with error bars. Look
at the code below:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> means.topo = tapply(dat$yield, INDEX=dat$topo, FUN=mean)
StdErr.topo = tapply(dat$yield, INDEX=dat$topo, FUN= std.error)
BP = barplot(means.topo, ylim=c(0,max(means.topo)+10))
segments(BP, means.topo - (2*StdErr.topo), BP,
means.topo + (2*StdErr.topo), lwd = 1.5)
arrows(BP, means.topo - (2*StdErr.topo), BP,
means.topo + (2*StdErr.topo), lwd = 1.5, angle = 90,
code = 3, length = 0.05)
</code></pre>
<br />
<div class="MsoNormal">
Here we are using the same exact approach we used before to
formulate an hypothesis about nitrogen. We first calculate mean and standard
error of yield for each level of topo, and then plot a bar chart with error
bars.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The result is the plot below:<o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizNR9cOqhWl9wyb3ZAvh95iJZXcSxSKtNeRxsktMo867_5IrqaUbfPpAMMuj7Ox2lVHuCeHeBg_c5jb6HfJ_24NNAroS7izcPHAN3hFPB3h5t62TMwVns28uXo3vn8yYVCsFvZJz8czSR9/s1600/Fig4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizNR9cOqhWl9wyb3ZAvh95iJZXcSxSKtNeRxsktMo867_5IrqaUbfPpAMMuj7Ox2lVHuCeHeBg_c5jb6HfJ_24NNAroS7izcPHAN3hFPB3h5t62TMwVns28uXo3vn8yYVCsFvZJz8czSR9/s400/Fig4.png" width="400" /></a></div>
<div class="MsoNormal">
From this plot it is clear that the topographic factor has
an effect on yield. In particular, hilltop areas have low yield while the low
east corner of the field has high yield. From the error bars we can say with a
good level of confidence that probably all the differences will be significant,
at least up to an alpha of 95% (significant level, meaning a p-value of 0.05).</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
We can test this hypothesis with a two way ANOVA, by simply
adding the term topo to the equation:</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod1b = aov(yield ~ nf + topo, data=dat)
summary(mod1b)
Df Sum Sq Mean Sq F value Pr(>F)
nf 5 23987 4797 23.21 <2e-16 ***
topo 3 620389 206796 1000.59 <2e-16 ***
Residuals 3434 709721 207
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal">
From the summary table it is clear that both factors have a
significant effect on yield, but just by looking at this it is very difficult
to identify clearly which levels are the significant ones. Top do that we need
the Tukey’s test:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> TukeyHSD(mod1b, conf.level=0.95, which=c("topo"))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = yield ~ nf + topo, data = dat)
$topo
diff lwr upr p adj
HT-LO -36.240955 -38.052618 -34.429291 0
W-LO -18.168544 -19.857294 -16.479794 0
E-LO -6.206619 -8.054095 -4.359143 0
W-HT 18.072411 16.326440 19.818381 0
E-HT 30.034335 28.134414 31.934257 0
E-W 11.961925 10.178822 13.745028 0
</code></pre>
<br />
<div class="MsoNormal">
The zero p-values indicate a large significance for each
combination, as it was clear from the plot. With the function <span class="CodeChar">model.table</span> you can easily obtain a table of means or
effects, if you are interested.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<h3>
Two-Way ANOVA with Interactions</h3>
<h2>
<o:p></o:p></h2>
<div>
<div class="MsoNormal">
One step further we can take to get more insights into our
data is add an interaction between nitrogen and topo, and see if this can
further narrow down the main sources of yield variation. Once again we need to
start our analysis by formulating an hypothesis. Since we are talking about an
interaction we are now concern in finding a way to plot yield responses for
varying nitrogen level and topographic position, so we need a 3d bar chart. We
can do that with the function <span class="CodeChar">bar3d.ade</span> from
the package <span class="CodeChar">epade</span>, so please install this
package and load it. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Then please look at the following R code:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> dat$topo = factor(dat$topo,levels(dat$topo)[c(2,4,1,3)])
means.INT = tapply(dat$yield, INDEX=list(dat$nf, dat$topo), FUN=mean)
bar3d.ade(means.INT, col="grey")
</code></pre>
<br />
<div class="MsoNormal">
The first line is only used to reorder the levels in the
factorial variable topo. This is because from the previous plot we clearly saw
that HT is the one with the lowest yield, followed by W, E and LO. We are doing
this only to make the 3d bar chart more readable.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The next line applies once again the function tapply, this
time to calculate the mean of yield for subgroups divided by nitrogen and
topographic factors. The result is a matrix that looks like this:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > means.INT
LO HT W E
N0 81.03027 41.50652 62.08192 75.13902
N1 83.06276 48.33630 65.74627 78.12808
N2 85.06879 48.79830 66.70848 78.92632
N3 85.23255 50.18398 66.16531 78.99210
N4 87.14400 52.12039 70.10682 80.39213
N5 87.94122 51.03138 69.65933 80.55078
</code></pre>
<br />
<div class="MsoNormal">
This can be used directly within the function <span class="CodeChar">bar3d.ade</span> to create the 3d bar chart below:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUXCehrVXtJT5_FEkgnvKpxcReXS3MxUtX40iqrOYQGfh3GbqDDk8MiClFHm0cqx3c8OxnlglwGLL7kL-0suRpTZTAhCsatCx_fLagD7lKNqY1A39EEVarsyg8frwTQcxiwY_d005KtDn8/s1600/Fig5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUXCehrVXtJT5_FEkgnvKpxcReXS3MxUtX40iqrOYQGfh3GbqDDk8MiClFHm0cqx3c8OxnlglwGLL7kL-0suRpTZTAhCsatCx_fLagD7lKNqY1A39EEVarsyg8frwTQcxiwY_d005KtDn8/s400/Fig5.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
From this plot we can see two things very clearly: the first
is that there is an increase in yield from HT to LO in the topographic factor,
the second is that we have again and increase from N0 to N1 in the nitrogen
levels. These were all expected since we already noticed them before. What we
do not see in these plot is any particular influence from the interaction
between topography and nitrogen. For example, if you look at HT, you have an
increase in yield from N0 to N5 (expected) and overall the yield is lower than
the other bars (again expected). If there was an interaction we would expect
this general pattern to change, for example with relatively high yield on the
hilltop at high nitrogen level, or very low yield in the low east side with N0.
This does not happen and all the bars follow an expected pattern, so we can
hypothesise that the interaction will not be significant.<o:p></o:p><br />
<br />
We can further explore a possible interaction between nf and topo by creating an interaction plot:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> with(dat, interaction.plot(topo, nf, yield))
</code></pre>
<br />
This line applies the function interaction plot within the call to the function with, which indicates to R that the names are referred to the dataset named dat. The result is the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLrzCrvcyjoTggfWFFclJaev7kiHrL30jIVEdKqORNQE827da_Otff0N9v9brwaPki3zx9Bn85Fy2mWer0dSXUCv0ZRqdd-obac0mAYmVpeIXLRoPshP6EfSjaa3-w9pu9cmXZ63Q3hgN6/s1600/Interaction.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLrzCrvcyjoTggfWFFclJaev7kiHrL30jIVEdKqORNQE827da_Otff0N9v9brwaPki3zx9Bn85Fy2mWer0dSXUCv0ZRqdd-obac0mAYmVpeIXLRoPshP6EfSjaa3-w9pu9cmXZ63Q3hgN6/s400/Interaction.jpeg" width="400" /></a></div>
Again, all the lines increase with changes in topography, but there no additional effect provided by changes in nf. In fact, the lines never cross, or just cross slightly: this is a good indication of lack of interaction.<br />
<br /></div>
<div class="MsoNormal">
To formally test our hypothesis of lack of interaction, we need to run another ANOVA with an interaction
term:</div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod1c = aov(yield ~ nf * topo, data=dat)
</code></pre>
<br />
<div class="MsoNormal">
This formula test for both main effects and their
interaction. To see the significance we can use the summary table:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod1c)
Df Sum Sq Mean Sq F value Pr(>F)
nf 5 23987 4797 23.176 <2e-16 ***
topo 3 620389 206796 999.025 <2e-16 ***
nf:topo 15 1993 133 0.642 0.842
Residuals 3419 707727 207
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal">
From this we can conclude that our hypothesis was correct
and that the interaction has no effect on yield.<br />
<br />
<br />
<h4>
Update 05/02/2018</h4>
<h4>
Nonparametric k-way ANOVA</h4>
<div>
Above we looked at the Kruskal-Wallis test for nonparametric one-way ANOVA. However, there may be cases when we have more complex factorial designs and still struggle to meet the assumption of normality.</div>
<div>
In such cases one of the possibilities we have is to use a nonparametric test, from the package Rfit:<br />
<br /></div>
<div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("Rfit")
library(Rfit)
mod.RAOV = raov(yield ~ nf * topo, data=dat)
mod.RAOV
> mod.RAOV
Robust ANOVA Table
DF RD Mean RD F p-value
nf 5 764.56053 152.91211 21.96030 0.00000
topo 3 17418.76333 5806.25444 833.85875 0.00000
nf:topo 15 59.15213 3.94348 0.56634 0.90215
</code></pre>
<br /></div>
<div>
As you can see the function we need to use here is raov, which stands for Robust ANOVA (Please read the book "Nonparametric statistical Methods Using R" by Kloke and McKean).<br />
The syntax is the same as for the function aov, the result table is also very similar. The only difference is that we do not have the stars to indicate significance, but we can easily work that out using the p-values.<br />
<br />
For other models we can use the function rfit, which is similar to lm in syntax and results.<br />
<br /></div>
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
We can have a even better ides of the interaction effect by using some functions in the package phia:</div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(phia)
plot(interactionMeans(mod1c))
</code></pre>
<br />
This function plots the effects of the interactions in a 2 by 2 plot, including the standard error of the coefficients, so that we can readily see which overlap:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe0_ptGrXyS-RHSxuCLrdB7LEua4yM88Jm44tLGmHolD6fbO5RKeF20OoKgSjGdvx6RrVC8PNh5JwEg-bvU6ddYholC-Kb_6cvN2coH52H0IaMDil6Z0PAelaHyWEWhQlbkxWUZKqy7K7i/s1600/Fig6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="893" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe0_ptGrXyS-RHSxuCLrdB7LEua4yM88Jm44tLGmHolD6fbO5RKeF20OoKgSjGdvx6RrVC8PNh5JwEg-bvU6ddYholC-Kb_6cvN2coH52H0IaMDil6Z0PAelaHyWEWhQlbkxWUZKqy7K7i/s400/Fig6.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
We already knew from the 3d plot that there is a general increase between N0 and N5 that mainly drives the changes we see in the data. However, from the top-right plot we can see that topo plays a little role between N0 and the other (in fact the black line only slightly overlap with the other), but it has no effect on N1 to N5.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
We can look at the numerical break-out of what we see in the plot with another function:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > testInteractions(mod1c)
F Test:
P-value adjustment method: holm
Value Df Sum of Sq F Pr(>F)
N0-N1 : HT-W -3.1654 1 377 1.8230 1
N0-N2 : HT-W -2.6652 1 267 1.2879 1
N0-N3 : HT-W -4.5941 1 784 3.7880 1
N0-N4 : HT-W -2.5890 1 250 1.2072 1
N0-N5 : HT-W -1.9475 1 140 0.6767 1
N1-N2 : HT-W 0.5002 1 9 0.0458 1
N1-N3 : HT-W -1.4286 1 76 0.3694 1
N1-N4 : HT-W 0.5765 1 12 0.0604 1
N1-N5 : HT-W 1.2180 1 55 0.2669 1
N2-N3 : HT-W -1.9289 1 139 0.6711 1
N2-N4 : HT-W 0.0762 1 0 0.0011 1
N2-N5 : HT-W 0.7178 1 19 0.0924 1
N3-N4 : HT-W 2.0051 1 149 0.7204 1
</code></pre>
<br />
The table is very long so only the first lines are included. However, from this it is clear that the interaction has no effect (p-value of 1), but if it was this function can give us numerous details about the specific effects.<br />
<br />
Now we could try to compare the two models to see if they are different in the amount of variability they can explain in the data. This can be done with the function anova, and performing an F test:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(mod1b, mod1c, test="F")
Analysis of Variance Table
Model 1: yield ~ nf + topo
Model 2: yield ~ nf * topo
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3434 709721
2 3419 707727 15 1993.2 0.6419 0.8421
</code></pre>
<br />
As we can see from this output, the p-value is not significant. This means that the two model are no different in their explanatory power. This further support the fact that including an interaction does not change the accuracy of the model, and probably decreases it. We could test this last statement for example by looking at the AIC for both models, we will see how to do that later on in the tutorial.<br />
<br />
Please remember that the method we just used can be employed to compare probably all the models we are going to fit in this tutorial, so it is a very powerful method!<br />
<br />
<br />
<h3 style="text-align: justify;">
ANCOVA with lm</h3>
<h2 style="text-align: justify;">
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
The Analysis of covariance
(ANCOVA) fits a new model where the effects of the treatments (or factorial
variables) is corrected for the effect of continuous covariates, for which we
can also see the effects on yield.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The code is very similar to what
we saw before, and again we can perform an ANCOVA with the <span class="CodeChar">lm</span> function; the only difference is that here we are
including an additional continuous explanatory variable in the model:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mod3 = lm(yield ~ nf + bv, data=dat)
> summary(mod3)
Call:
lm(formula = yield ~ nf + bv, data = dat)
Residuals:
Min 1Q Median 3Q Max
-78.345 -10.847 -3.314 10.739 56.835
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 271.55084 4.99308 54.385 < 2e-16 ***
nfN0 -3.52312 0.95075 -3.706 0.000214 ***
nfN2 1.54761 0.95167 1.626 0.103996
nfN3 2.08006 0.94996 2.190 0.028619 *
nfN4 3.82330 0.95117 4.020 5.96e-05 ***
nfN5 4.47993 0.94994 4.716 2.50e-06 ***
bv -1.16458 0.02839 -41.015 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.12 on 3436 degrees of freedom
Multiple R-squared: 0.3406, Adjusted R-squared: 0.3394
F-statistic: 295.8 on 6 and 3436 DF, p-value: < 2.2e-16
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
By printing the summary table we
can already see some differences compared to the model we only nitrogen as
explanatory variable. The first is related to the Adjusted R-squared (which is
simply the R-squared corrected for the number of predictors so that it is less
affected by overfitting), which in this case is around 0.3. If we look back at
the summary table of the model with only nitrogen, the R-squared was only 0.01.
This means that by adding the continuous variable bv we are able to massively
increase the explanatory power of the model; in fact, this new model is capable
of explaining 33% of the variation in yield. Moreover, we can also see that
other terms become significant, for example N3. This is because the inclusion
of bv changes the entire model and its interpretation becomes less obvious
compared to the simple bar chart we plotted at the beginning.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
The interpretation of the ANCOVA
model is more complex that the one for the one-way ANOVA. In fact, the
intercept value has changed and it is not the mean of the reference level N1.
This is because the model now changes based on the covariate bv. The slope can
be used to assess the relative impact of each term; for example, N0 has a
negative impact on yield in relation to its reference level. Therefore,
shifting from a nitrogen level N1 to N0 decreases the yield by -3.52, if bv is
kept constant. </div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
Take a look at the following
code:</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > test.set = data.frame(nf="N1", bv=150)
> predict(mod3, test.set)
1 2
96.86350 38.63438
>
> test.set = data.frame(nf="N0", bv=150)
> predict(mod3, test.set)
1
93.34037
</code></pre>
<br />
<div class="MsoNormal">
Here we are using the model (mod3) to estimate new values of
yield based on set parameters. In the first example we set nf to N1 (reference
level) and bv constant at 150. With the function <span class="CodeChar">predict</span>
we can see estimate these new values using mod3. For N1 and bv equal to 150 the
yield is 96.86. In the second example we did the same but for nitrogen level
N0. The result is a yield equal to 93.34, that is a difference of exactly 3.52,
which is the slope of the model.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
For computing the ANOVA table, we
can again use either the function <span class="CodeChar">anova</span> (if the
design is balanced) or <span class="CodeChar">Anova</span> with type
III (for unbalanced designs).<br />
<br />
Let's now look at some diagnostic plots we can use to test whether our model meets all the assumptions for linear models. We can use the default plot function in R to do so:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> par(mfrow=c(2,2))
plot(mod3)
</code></pre>
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
Here I first used the function par, to divide the plotting window into 2 rows and two columns so that we can plot all four diagnostic plots into the same window. The result is the following:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYykxVLHu_htli2om_pmzcxZ8bOh68qKgS0C48ktZM0___AUHXeeMKYeVW2iwldDB4QHwZBPtBzQTb5CucxNjEe9AND1emE07qlQyZvdog2u_onhTcA7YzO_PiUpwqnUZG1WFaxVVuPbbu/s1600/DiagnosticPlots.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="816" data-original-width="863" height="604" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYykxVLHu_htli2om_pmzcxZ8bOh68qKgS0C48ktZM0___AUHXeeMKYeVW2iwldDB4QHwZBPtBzQTb5CucxNjEe9AND1emE07qlQyZvdog2u_onhTcA7YzO_PiUpwqnUZG1WFaxVVuPbbu/s640/DiagnosticPlots.jpeg" width="640" /></a></div>
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The top left plot is represents the residuals against the fitted values (or the estimates from the model). One of our assumptions was that the error term had mean equal to zero and constant variance. This means that we should see the residuals equally spread around zero. We should see a more or less horizontal line with intercept on the zero. In fact, we actually see that for low values of yield (x axis) we have a sort of equal spread around zero, but this changes with the increase in yield; this is clearly a violation of the assumption. The second plot on the top represents the QQplot of the residuals, which again should be on the middle line because another assumption is that the error should be normal. Again we have some values that do not fit with normality. A good thing is that R prints the ID of these values so that we can evaluate whether we think they are outliers or we have another explanation for the violations of the assumptions.<br />
In the second row on the left we have a plot that again is used to check whether we meet the assumption of constant variance. We should again see a more or less horizontal line, but again we have an increase in variance, which violates the assumption. Finally, we have the residuals vs. leverage plot and the Cook's Distance. Leverage represents the influence of each point on the model; again we see that some points have larger influence on the model. This should not be the case, we should see again a more or less equal spread of point. Another information we can have from this plot is whether the extreme observations may be outliers. If our extreme points would lie outside the Cook's Distance zone we would suspect them to be outliers. However, in this case the zone is not even plotted because it is outside the plotting area, so we probably do not have outliers.<br />
<br />
For more info about diagnostic plots please take a look here:<br />
http://data.library.virginia.edu/diagnostic-plots/<br />
http://www.statmethods.net/stats/rdiagnostics.html<br />
<br />
We can further investigate as to why our model does not meet all assumptions by looking at the residuals vs. fitted values and try to color the points based for example on topo. We can do that easily with ggplot2:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> qplot(fitted(mod3), residuals(mod3), col=dat$topo, geom="point", xlab="Fitted Values", ylab="Residuals")
</code></pre>
<br />
This creates the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWEGbcoHbSZjcT5m4hP0dlvoEspGcL11PJpkQHz4c6Rovbkey64s8spfUIYzSROnUqVSWKgdcnCj88Tx_5qQYEMzRHobLgM0V-p3A7Sj3XA0Kg-c7wtg1qv8UPoXpswgbwxDmO0V4U4QNp/s1600/DiagnosticPlots2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="763" data-original-width="835" height="365" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWEGbcoHbSZjcT5m4hP0dlvoEspGcL11PJpkQHz4c6Rovbkey64s8spfUIYzSROnUqVSWKgdcnCj88Tx_5qQYEMzRHobLgM0V-p3A7Sj3XA0Kg-c7wtg1qv8UPoXpswgbwxDmO0V4U4QNp/s400/DiagnosticPlots2.jpeg" width="400" /></a></div>
<br />
From this image it is clear that all the points that look as possible outliers come from a specific topographic category. This may mean different things depending on the data. In this case I think it only means that we should include topo in our model. However, for other data it may mean that we should exclude certain categories, but the point I want to make is that it is always important not to look only at the summary table but try to explore the model a bit more in details to draw more meaningful conclusions.<br />
<br />
<h4>
Update 24/07/2017</h4>
<div>
In the package agricolae there are functions to compute Tukey's and LSD pairwise comparisons. The code is very simple:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("agricolae")
library(agricolae)
</code></pre>
<div>
<br />
Now we can perform the Tukey's test first:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > HSD.test(mod3, trt="nf", console=T, alpha=0.05)
Study: mod3 ~ "nf"
HSD Test for yield
Mean Square Error: 259.8756
nf, means
yield std r Min Max
N0 64.97290 20.94146 573 12.66 108.84
N1 68.61636 19.20452 577 27.44 110.54
N2 69.65033 19.30984 571 31.79 112.85
N3 70.33586 19.22650 575 19.41 110.12
N4 72.56302 19.14603 572 32.05 117.90
N5 72.83176 20.13865 575 31.79 117.19
alpha: 0.05 ; Df Error: 3436
Critical Value of Studentized Range: 4.032372
Harmonic Mean of Cell Sizes 573.8261
Honestly Significant Difference: 2.713646
Means with the same letter are not significantly different.
Groups, Treatments and means
a N5 72.83
a N4 72.56
ab N3 70.34
b N2 69.65
b N1 68.62
c N0 64.97
</code></pre>
<br /></div>
and the the LSD:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > LSD.test(mod3, trt="nf", console=T, alpha=0.05)
Study: mod3 ~ "nf"
LSD t Test for yield
Mean Square Error: 259.8756
nf, means and individual ( 95 %) CI
yield std r LCL UCL Min Max
N0 64.97290 20.94146 573 63.65249 66.29330 12.66 108.84
N1 68.61636 19.20452 577 67.30054 69.93218 27.44 110.54
N2 69.65033 19.30984 571 68.32762 70.97305 31.79 112.85
N3 70.33586 19.22650 575 69.01776 71.65397 19.41 110.12
N4 72.56302 19.14603 572 71.24147 73.88458 32.05 117.90
N5 72.83176 20.13865 575 71.51365 74.14986 31.79 117.19
alpha: 0.05 ; Df Error: 3436
Critical Value of t: 1.960655
t-Student: 1.960655
Alpha : 0.05
Minimum difference changes for each comparison
Means with the same letter are not significantly different
Groups, Treatments and means
a N5 72.83176
a N4 72.56302
b N3 70.33586
b N2 69.65033
b N1 68.61636
c N0 64.9729
</code></pre>
<br />
The results should be easy to interpret.<br />
<br /></div>
<h3>
</h3>
<h3>
Two-factors and one continuous explanatory variable</h3>
<h2>
<o:p></o:p></h2>
<div>
<div class="MsoNormal" style="text-align: justify;">
Let’s look now at another example
with a slightly more complex model where we include two factorial and one
continuous variable. We also include in the model the variable topo. We can
check these with the function levels:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: justify;">
<o:p></o:p></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > levels(dat$topo)
[1] "E" "HT" "LO" "W"
</code></pre>
<br />
<div class="MsoNormal" style="text-align: justify;">
Please notice that E is the first
and therefore is the reference level for this factor. Now let’s fit the model
and look at the summary table:<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > mod4 = lm(yield ~ nf + bv + topo, data=dat)
>
> summary(mod4)
Call:
lm(formula = yield ~ nf + bv + topo, data = dat)
Residuals:
Min 1Q Median 3Q Max
-46.394 -10.927 -2.211 10.364 47.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 171.20921 5.28842 32.374 < 2e-16 ***
nfN0 -3.73225 0.81124 -4.601 4.36e-06 ***
nfN2 1.29704 0.81203 1.597 0.1103
nfN3 1.56499 0.81076 1.930 0.0537 .
nfN4 3.71277 0.81161 4.575 4.94e-06 ***
nfN5 3.88382 0.81091 4.789 1.74e-06 ***
bv -0.54206 0.03038 -17.845 < 2e-16 ***
topoHT -24.11882 0.78112 -30.877 < 2e-16 ***
topoLO 3.13643 0.70924 4.422 1.01e-05 ***
topoW -10.77758 0.66708 -16.156 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.75 on 3433 degrees of freedom
Multiple R-squared: 0.5204, Adjusted R-squared: 0.5191
F-statistic: 413.8 on 9 and 3433 DF, p-value: < 2.2e-16
</code></pre>
<br />
<div class="MsoNormal">
The adjusted R-squared increases again and now we are able
to explain around 52% of the variation in yield.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
The interpretation is similar to what we said before, the
only difference is that here both factors have a reference level. So for
example, the effect of topoHT is related to the reference level, which is the
one not shown E. So if we change the topographic position from E to HT, while
keeping everything else in the model constant (meaning same value of bv and
same nitrogen level), we obtain a decrease in yield of 24.12.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
Another thing we can see from this table is that the p-value
change, and for example N3 becomes less significant. This is probably because
when we consider more variables the effect of N3 on yield is explained by other
variables, maybe partly bv and partly topo.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
One last thing we can check, and this is something we should
check every time we perform an ANOVA or a linear model is the normality of the
residuals. We already saw that the summary table provides us with some data
about the residuals distribution (minimum, first quartile, median, third
quartile and maximum) that gives us a good indication of normality, since the
distribution is centred around 0. However, we can also use other tools to check
this. For example a QQ plot:</div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> RES = residuals(mod4)
qqnorm(RES)
qqline(RES)
</code></pre>
<br />
<div class="MsoNormal">
The function residuals automatically extract the residuals
from the model, which can then be used to create the following plot:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOaQVwd8r7zQfXMsrx9IzFGLbzdZ9moQLJka-1_kbX07pvZ2rgdlSWzdbkWGKgqopM4lKCGlkYvk2uiM1x8u0JXG4w3WxYkrXwigU2zEFBbK1o26B5Fam8ZVzw6SfItt7O0DQgpDFrtUrh/s1600/Fig7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOaQVwd8r7zQfXMsrx9IzFGLbzdZ9moQLJka-1_kbX07pvZ2rgdlSWzdbkWGKgqopM4lKCGlkYvk2uiM1x8u0JXG4w3WxYkrXwigU2zEFBbK1o26B5Fam8ZVzw6SfItt7O0DQgpDFrtUrh/s400/Fig7.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
It looks approximately normal, but to have a further
confirmation we can use again the function <span class="CodeChar">skewness</span>,
which returns a value below 0.5, so we can consider this a normal distribution.<o:p></o:p><br />
<br />
<h4>
Update 26/07/2017</h4>
<div>
The function lsmeans computes the predicted marginal means for combinations of factors and also allows the pairwise comparison. Let's look at the code:</div>
<div>
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> install.packages("lsmeans")
library(lsmeans)
</code></pre>
<div>
<br />
After installing lsmeans we can run the function lsmeasn to compute the marginal means for nf and topo:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > lsmeans(mod4, specs=c("nf","topo"), adjust="tukey")
nf topo lsmean SE df lower.CL upper.CL
N0 E 72.93305 0.7320960 3433 71.49766 74.36844
N1 E 76.66530 0.7303482 3433 75.23334 78.09726
N2 E 77.96234 0.7309577 3433 76.52919 79.39550
N3 E 78.23029 0.7333588 3433 76.79243 79.66815
N4 E 80.37807 0.7334009 3433 78.94012 81.81602
N5 E 80.54912 0.7352323 3433 79.10758 81.99065
N0 HT 48.81423 0.7696117 3433 47.30529 50.32317
N1 HT 52.54648 0.7631689 3433 51.05017 54.04279
N2 HT 53.84352 0.7709208 3433 52.33201 55.35503
N3 HT 54.11147 0.7751608 3433 52.59164 55.63129
N4 HT 56.25925 0.7695510 3433 54.75042 57.76807
N5 HT 56.43029 0.7781928 3433 54.90453 57.95606
N0 LO 76.06948 0.7358242 3433 74.62678 77.51218
N1 LO 79.80173 0.7372548 3433 78.35622 81.24723
N2 LO 81.09877 0.7374276 3433 79.65293 82.54461
N3 LO 81.36672 0.7285789 3433 79.93823 82.79521
N4 LO 83.51450 0.7382983 3433 82.06695 84.96205
N5 LO 83.68554 0.7269067 3433 82.26033 85.11076
N0 W 62.15548 0.6765945 3433 60.82891 63.48204
N1 W 65.88772 0.6768164 3433 64.56072 67.21473
N2 W 67.18477 0.6778270 3433 65.85578 68.51375
N3 W 67.45271 0.6749090 3433 66.12945 68.77598
N4 W 69.60049 0.6749314 3433 68.27719 70.92380
N5 W 69.77154 0.6726693 3433 68.45267 71.09041
Confidence level used: 0.95
</code></pre>
<br />
Now we can use the function cld to obtain the letters specifying which combinations are significantly different:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > cld(lsmeans(mod4, specs=c("nf","topo"), adjust="tukey"),
+ alpha = 0.05,
+ Letters = letters,
+ adjust = "tukey")
nf topo lsmean SE df lower.CL upper.CL .group
N0 HT 48.81423 0.7696117 3433 46.44912 51.17934 a
N1 HT 52.54648 0.7631689 3433 50.20116 54.89179 b
N2 HT 53.84352 0.7709208 3433 51.47439 56.21266 bc
N3 HT 54.11147 0.7751608 3433 51.72930 56.49363 bc
N4 HT 56.25925 0.7695510 3433 53.89432 58.62417 c
N5 HT 56.43029 0.7781928 3433 54.03881 58.82178 c
N0 W 62.15548 0.6765945 3433 60.07622 64.23473 d
N1 W 65.88772 0.6768164 3433 63.80778 67.96766 e
N2 W 67.18477 0.6778270 3433 65.10172 69.26781 ef
N3 W 67.45271 0.6749090 3433 65.37863 69.52679 ef
N4 W 69.60049 0.6749314 3433 67.52635 71.67464 fg
N5 W 69.77154 0.6726693 3433 67.70434 71.83874 fg
N0 E 72.93305 0.7320960 3433 70.68323 75.18287 g
N0 LO 76.06948 0.7358242 3433 73.80820 78.33076 h
N1 E 76.66530 0.7303482 3433 74.42085 78.90975 h
N2 E 77.96234 0.7309577 3433 75.71602 80.20867 hij
N3 E 78.23029 0.7333588 3433 75.97659 80.48399 hi k
N1 LO 79.80173 0.7372548 3433 77.53605 82.06740 ijkl
N4 E 80.37807 0.7334009 3433 78.12424 82.63190 ijklm
N5 E 80.54912 0.7352323 3433 78.28966 82.80858 ijkl n
N2 LO 81.09877 0.7374276 3433 78.83257 83.36498 klmno
N3 LO 81.36672 0.7285789 3433 79.12770 83.60573 j lmno
N4 LO 83.51450 0.7382983 3433 81.24562 85.78338 no
N5 LO 83.68554 0.7269067 3433 81.45167 85.91942 m o
Confidence level used: 0.95
Conf-level adjustment: sidak method for 24 estimates
P value adjustment: tukey method for comparing a family of 24 estimates
significance level used: alpha = 0.05
</code></pre>
<br /></div>
</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<h3>
ANCOVA with interactions</h3>
<h2>
<o:p></o:p></h2>
<div>
<div class="MsoNormal">
Let’s now add a further layer of complexity by adding an
interaction term between bv and topo. Once again we need to formulate an
hypothesis before proceeding to test it. Since we are interested in an
interaction between a continuous variable (bv) and a factorial variable (topo)
on yield, we could try to create scatterplots of yield versus bv, for the
different levels in topo. We can easily do that with the package ggplot2:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> qplot(yield, bv, data=dat, geom="point", xlab="Yield", ylab="bv") +
facet_wrap(~topo)+
geom_smooth(method = "lm", se = TRUE)
</code></pre>
<br />
<div class="MsoNormal">
Explaining every bit of the three lines of code above would
require some time and it is beyond the scope of this tutorial. In essence,
these lines create a scatterplot yield versus bv for each subgroup of topo and
then fit a linear regression line through the points. For more info about the
use of ggplot2 please start by looking here: <a href="http://www.statmethods.net/advgraphs/ggplot2.html">http://www.statmethods.net/advgraphs/ggplot2.html</a>
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This create the plot below:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI_5ZpahSi-fxDETgTza343A12KlUxmqMno33XJjP6ktfXrQ8AEl9zgVPjTzfi-j_08nAVrVTBZ5CD-jN_eXQzt2bSjI2T3e8pZVEKDm0FUTu9PITFm6sqzSKBrnX5VB1K-2HZgh6SYWQl/s1600/Fig8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI_5ZpahSi-fxDETgTza343A12KlUxmqMno33XJjP6ktfXrQ8AEl9zgVPjTzfi-j_08nAVrVTBZ5CD-jN_eXQzt2bSjI2T3e8pZVEKDm0FUTu9PITFm6sqzSKBrnX5VB1K-2HZgh6SYWQl/s400/Fig8.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="MsoNormal">
From this plot it is clear that the four lines have
different slopes, so the interaction between bv and topo may well be
significant and help us further increase the explanatory power of our model. We
can test that by adding this interaction:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod5 = lm(yield ~ nf + bv * topo, data=dat)
</code></pre>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
We can use the function Anova to check the significance:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > Anova(mod5, type=c("III"))
Anova Table (Type III tests)
Response: yield
Sum Sq Df F value Pr(>F)
(Intercept) 20607 1 115.225 < 2.2e-16 ***
nf 23032 5 25.758 < 2.2e-16 ***
bv 5887 1 32.920 1.044e-08 ***
topo 40610 3 75.691 < 2.2e-16 ***
bv:topo 36059 3 67.209 < 2.2e-16 ***
Residuals 613419 3430
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</code></pre>
<br />
<div class="MsoNormal">
As you can see this interaction is significant. To check the
details we can look at the summary table:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod5)
Call:
lm(formula = yield ~ nf + bv * topo, data = dat)
Residuals:
Min 1Q Median 3Q Max
-46.056 -10.328 -1.516 9.622 50.184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.45783 8.70646 10.734 < 2e-16 ***
nfN1 3.96637 0.78898 5.027 5.23e-07 ***
nfN2 5.24313 0.79103 6.628 3.93e-11 ***
nfN3 5.46036 0.79001 6.912 5.68e-12 ***
nfN4 7.52685 0.79048 9.522 < 2e-16 ***
nfN5 7.73646 0.79003 9.793 < 2e-16 ***
bv -0.27108 0.04725 -5.738 1.04e-08 ***
topoW 88.11105 12.07428 7.297 3.63e-13 ***
topoE 236.12311 17.12941 13.785 < 2e-16 ***
topoLO -15.76280 17.27191 -0.913 0.3615
bv:topoW -0.41393 0.06726 -6.154 8.41e-10 ***
bv:topoE -1.21024 0.09761 -12.399 < 2e-16 ***
bv:topoLO 0.28445 0.10104 2.815 0.0049 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.37 on 3430 degrees of freedom
Multiple R-squared: 0.547, Adjusted R-squared: 0.5454
F-statistic: 345.1 on 12 and 3430 DF, p-value: < 2.2e-16
</code></pre>
<br />
<div class="MsoNormal">
The R-squared is a bit higher, which means that we can
explain more of the variability in yield by adding the interaction. For the
interpretation, once again everything is related to the reference levels in the
factors, even the interaction. So for example, bv:topoW tells us that the
interaction between bv and topo changes the yield negatively if we change from
HT to W, keeping everything else constant.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For information about individual changes we would need to
use the model to estimate new data as we did for mod3.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<h3>
GLS – For violations of independence</h3>
<div>
<div class="MsoNormal">
As we mentioned, there are certain assumptions we need to
check before starting an analysis with linear models. Assumptions about
normality and equality of variance can be relaxed, particularly if sample sizes
are large enough. However, other assumptions for example balance in the design and
independence tend to be stricter, and we need to be careful in violating them.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
We can check for independence by looking at the correlation
between predictors and coefficient directly in the summary table. We do that because we need to check the independence of the error (i.e. the residuals). If residuals are independent the correlation will be low.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > summary(mod5, cor=T)
[…]
Correlation of Coefficients:
(Intercept) nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO
nfN1 -0.05
nfN2 -0.04 0.50
nfN3 -0.05 0.50 0.50
nfN4 -0.05 0.50 0.50 0.50
nfN5 -0.05 0.50 0.50 0.50 0.50
bv -1.00 0.01 -0.01 0.01 0.00 0.00
topoW -0.72 0.00 0.00 0.00 0.00 0.00 0.72
topoE -0.51 0.01 0.02 0.03 0.01 0.02 0.51 0.37
topoLO -0.50 -0.02 -0.01 0.02 -0.01 0.02 0.50 0.36 0.26
bv:topoW 0.70 0.00 0.00 0.00 0.00 0.00 -0.70 -1.00 -0.36 -0.35
bv:topoE 0.48 -0.01 -0.02 -0.03 -0.01 -0.02 -0.48 -0.35 -1.00 -0.24
bv:topoLO 0.47 0.02 0.01 -0.02 0.01 -0.02 -0.47 -0.34 -0.24 -1.00
bv:topoW bv:topoE
nfN1
nfN2
nfN3
nfN4
nfN5
bv
topoW
topoE
topoLO
bv:topoW
bv:topoE 0.34
bv:topoLO 0.33 0.23
</code></pre>
<br />
<div class="MsoNormal">
If we exclude the interaction, which would clearly be
correlated with the single covariates, the rest of the coefficients are not
much correlated. From this we may conclude that our assumption of independence
holds true for this dataset.<br />
<br />
We can also graphically check the independence of the error by simply plotting the residuals agains the fitted values and then fit a line through the points:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> qplot(fitted(mod5), residuals(mod5), geom="point", xlab="Fitted Values", ylab="Residuals") +
geom_smooth(method = "lm", se = TRUE)
</code></pre>
<br />
The result is the image below, which is simply another way to obtain the same result we saw for the ANCOVA but this time in ggplot2:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhT0xlU7R4FSqbWsP0w7VU6CuvS7aRLRvyNFH6Xeh1SCt0TlTd96v1pZwvMaxgD9rNWrq0gH65HNKky16gaW4a6HmNNpLJ3gtu3HAsykaazlckfD0Uj8-O_BDayyN6zR5SEgVzAGTT0piLg/s1600/Residuals_Independence.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="672" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhT0xlU7R4FSqbWsP0w7VU6CuvS7aRLRvyNFH6Xeh1SCt0TlTd96v1pZwvMaxgD9rNWrq0gH65HNKky16gaW4a6HmNNpLJ3gtu3HAsykaazlckfD0Uj8-O_BDayyN6zR5SEgVzAGTT0piLg/s400/Residuals_Independence.jpeg" width="400" /></a></div>
<br />
<br />
As you can see the line is horizontal, which means the residuals have no trend. Moreover, their spread is more or less constant for the whole range of fitted values. As you remember we assume the error term of the linear model to have zero mean and constant variance. For both these reason I think we can consider that the model meets both assumptions. However, if we color the points based on the variable topo (which is not shown but it is very easy to do with the option col in qplot) we can see that the 3-4 smaller clouds we see in the plot above are produced by particular topographical categories. This coupled with the fact that our data are probably autocorrelated, since they are sampled in space, may let us conclude that we should not assume independence and therefore GLS would be the best method.<br />
<br />
In cases where the assumption of independence is violated, we would need to use a more robust method
of maximum likelihood (ML) and residuals maximum likelihood (REML) for
computing the coefficients. This can be done with the function <span class="CodeChar">gls</span> in the package <span class="CodeChar">nlme</span>,
using the same syntax as for <span class="CodeChar">lm</span>:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> mod6 = gls(yield ~ nf + bv * topo, data=dat, method="REML")
Anova(mod6, type=c("III"))
summary(mod6)
</code></pre>
<br />
<div class="MsoNormal">
As you can see despite the different function (<span class="CodeChar">gls</span> instead of <span class="CodeChar">lm</span>),
the rest of the syntax is the same as before. We can still use the function <span class="CodeChar">Anova</span> to print the ANOVA table and summary to check
the individual coefficients.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Moreover, we can also use the function <span class="CodeChar">anova</span>
to compare the two models (the one from <span class="CodeChar">gls</span>
and <span class="CodeChar">lm</span>) and see which is the best
performer:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > anova(mod6, mod5)
Model df AIC BIC logLik
mod6 1 14 27651.21 27737.18 -13811.61
mod5 2 14 27651.21 27737.18 -13811.61
</code></pre>
<br />
<div class="MsoNormal">
The indexes AIC, BIC and logLik are all used to check the
accuracy of the model and should be as low as possible. For more info please
look at the appendix about assessing the accuracy of our model. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<h2>
References and Further Reading</h2>
Finch, W.H., Bolin, J.E. and Kelley, K., 2014. <i>Multilevel modeling using R</i>. Crc Press.<br />
<br />
Yan, X. and Su, X., 2009. <i>Linear regression analysis: theory and computing</i>. World Scientific.<br />
<br />
James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. <i>An introduction to statistical learning</i> (Vol. 6). New York: springer. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf<br />
<br />
Long, J. Scott. 1997. <i>Regression Models for Categorical and Limited Dependent Variables</i>. Sage. pp104-106. [For pseudo R-Squared equations, page available on google books]<br />
<br />
Webster, R. and Oliver, M.A., 2007. <i>Geostatistics for environmental scientists</i>. John Wiley & Sons.<br />
<br />
West, B.T., Galecki, A.T. and Welch, K.B., 2014. <i>Linear mixed models</i>. CRC Press.<br />
<br />
Gałecki, A. and Burzykowski, T., 2013. <i>Linear mixed-effects models using R: A step-by-step approach</i>. Springer Science & Business Media.<br />
<br />
Williams, R., 2004. <i>One-Way Analysis of Variance</i>. URL: https://www3.nd.edu/~rwilliam/stats1/x52.pdf<br />
<br />
Witte, R. and Witte, J. 2009. <i>Statistics. 9th ed. </i>Wiley.<br />
<h2>
</h2>
<div>
<br /></div>
Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com7tag:blogger.com,1999:blog-1442302563171663500.post-44816786472058765472016-07-19T12:59:00.000+02:002016-07-22T23:03:40.880+02:00Spatio-Temporal Point Pattern Analysis in ArcGIS with RThis post would probably be the last in my series about merging R and ArcGIS. In August unfortunately I would have to work for real and I will not have time to play with <a href="https://r-arcgis.github.io/" target="_blank">R-Bridge</a> any more.<br />
In this post I would like to present a toolbox to perform some introductory point pattern analysis in R through ArcGIS. Basically, I developed a toolbox to perform the tests I presented in my previous post about <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-point-pattern-analysis-of.html" target="_blank">point pattern analysis</a>. In there, you can find some theoretical concepts that you need to know to understand what this toolbox can do.<br />
I will start by introducing the sample dataset we are going to use, and then simply show the packages available.<br />
<br />
<span style="font-size: large;">Dataset</span><br />
For presenting this toolbox I am using the same dataset I used for my <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-point-pattern-analysis-of.html" target="_blank">previous post</a>, namely the open crime data from the UK. For this post I downloaded crimes in the London area from the whole 2015. As you can see from the image below we are talking about more than 950'000 crimes of several categories, all across London.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyTcDmFfDWw9h1Y30xpSm_4E7SNxKUbrLXF829HyYCtCcTWPdZ-Fo0cjdZpkejt9qEBIL_mTp4YjI52VfsmpX_YBP77w3p7T2opORhPaSUHD6B7MFviiHmPCgkqyP3CvnFfxamMkMPlSY8/s1600/Fig1b.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="342" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyTcDmFfDWw9h1Y30xpSm_4E7SNxKUbrLXF829HyYCtCcTWPdZ-Fo0cjdZpkejt9qEBIL_mTp4YjI52VfsmpX_YBP77w3p7T2opORhPaSUHD6B7MFviiHmPCgkqyP3CvnFfxamMkMPlSY8/s640/Fig1b.jpg" width="640" /></a></div>
<br />
I also included a polygon shapefile with the area around London and all its boroughs, this should be visible as blue lines around the city. I included this because point pattern analysis requires the user to set the border of the study area, as I mentioned in my <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-point-pattern-analysis-of.html" target="_blank">previous post</a>.<br />
<br />
<br />
<span style="font-size: large;">Spatio-Temporal Subset</span><br />
The first package I would like to present is a simple spatio-temporal subsetting tool. This is completely based on R but it is basically a more flexible version of the selection tools available in ArcGIS.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGfkG3_junWEY-h6mZehhTzIMwyUZhGxX3vmkkmbAKfHqUmpmZmgyQoFhInC02Dn-zQ05r4PIXIs8Yo3GJz8tD7UOSRqbkaeOyW_FMx1TSIQDJXemtQhmKjU8gV4nDCpwjMZQXtnhmLGir/s1600/Fig4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGfkG3_junWEY-h6mZehhTzIMwyUZhGxX3vmkkmbAKfHqUmpmZmgyQoFhInC02Dn-zQ05r4PIXIs8Yo3GJz8tD7UOSRqbkaeOyW_FMx1TSIQDJXemtQhmKjU8gV4nDCpwjMZQXtnhmLGir/s640/Fig4.jpg" width="640" /></a></div>
<br />
Here users can select points based on various parameters at once. For example, they can subset the polygon shapefile, for example here I'm extracting the borough of Ealing, and extract points only for this area. Then they can subset by time, with the same strings I presented in my previous post about a <a href="http://r-video-tutorial.blogspot.ch/2016/07/time-series-analysis-in-arcgis.html" target="_blank">toolbox for time series analysis</a>. Optionally, they can also subset the dataset itself based on some categories. In this example I'm extracting only the drug related crimes, committed in Ealing in May 2015.<br />
It is important to point out that in this first version of the toolbox users can only select one element in the SQL statements. For example here I have "name" = 'Ealing'. In ArcGIS users could also put an AND and then specify another option. However, in the R code I did not put a way to deal with multiple inputs and conditions (e.g. AND, OR) and therefore only one option can be specified.<br />
The result is a new shapefile, plotted directly on the ArcGIS console with the required subset of the data, as shown below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2IMvQEvNm4siH4G_VnTlheuaiLJ95oYCd5w77vNGEm2dUQtfMkbvAEZxsXF01iUzS1blj57Mu5rf7JdlDJno3siD0XB5-74cMM7vcWrR1Y_2DLWOCaYUXcjyFcWekNSg_NfzPrKbsElCK/s1600/Fig5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2IMvQEvNm4siH4G_VnTlheuaiLJ95oYCd5w77vNGEm2dUQtfMkbvAEZxsXF01iUzS1blj57Mu5rf7JdlDJno3siD0XB5-74cMM7vcWrR1Y_2DLWOCaYUXcjyFcWekNSg_NfzPrKbsElCK/s640/Fig5.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Spatio-Temporal Centroid</span><br />
As you may already know, ArcGIS provides a function to calculate the centroid of a point pattern. However, if we wanted to test for changes in the centroid location with time we would need to first subset our data and then compute the centroid. What I did in this package is merge these two actions into one. This package, presented in the image below, loops through the dataset, subsetting the point pattern by time (users can choose between daily, monthly and yearly subsets) and then calculates the centroid for each time unit. Moreover, I also added an option to select the statistics to use between mean, median and mode.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj24h0AZEPOiH29HEG75s9R2mGaxZj20-A_C8oPr7pG76YoPht5P7l5EBEruFnSv45bna-k__FCWWz4hz7GVSSmyFBUKswtnDzaKAvZsVLR3NlA0sK8UYpzSFYATdb_6mdfbBTTRPX16GO3/s1600/Fig2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj24h0AZEPOiH29HEG75s9R2mGaxZj20-A_C8oPr7pG76YoPht5P7l5EBEruFnSv45bna-k__FCWWz4hz7GVSSmyFBUKswtnDzaKAvZsVLR3NlA0sK8UYpzSFYATdb_6mdfbBTTRPX16GO3/s640/Fig2.jpg" width="640" /></a></div>
<br />
The results for the three statistics are presented below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKVeuXb9EXnSdgoISX-4rtSV11HbSkFBwN313NAhFf-cTDAaqYxZPXJLov_e50I-OTtxiZR8449Kp4Q0eLPtoUBrRhWIb_DoZGXGWhtFBtoILZTOJyX5ERpn7fkNu7bONfcVdLkkUGZNHu/s1600/Fig3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="452" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKVeuXb9EXnSdgoISX-4rtSV11HbSkFBwN313NAhFf-cTDAaqYxZPXJLov_e50I-OTtxiZR8449Kp4Q0eLPtoUBrRhWIb_DoZGXGWhtFBtoILZTOJyX5ERpn7fkNu7bONfcVdLkkUGZNHu/s640/Fig3.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Spatio-Temporal Density</span><br />
This tool calculates the point density for specific regions and time frames by subsetting your dataset. This is something that you may be able to obtain directly from ArcGIS, but users would need to first subset their data and then perform the density analysis, this tool groups those two things into one. Moreover, the package <i>spatstat</i>, which is used in R for point pattern analysis has some clear advantages compared to the tool available in ArcGIS. For example, as I mentioned in my post it provides ways to calculate the best bandwidth for the density estimation. In the script this is achieved using the function <a href="http://www.inside-r.org/packages/cran/spatstat/docs/bw.ppl">bw.ppl</a>, but this can be changed if you need a different method, you just need to replace this function with another. Moreover, as pointed out in this <a href="https://www.e-education.psu.edu/geog586/book/export/html/1734">tutorial</a>, ArcGIS does not correct for edge effects.<br />
Working with this package is very similar to the others I presented before:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7qA17-ktZL6B9yUqEsY1UsWujjmwUjZB3hrR2IWhN0pUHzLrdKlQzPtek6CuoZKfNeGXWg38q2zbMCw5DRK_tdLjb4nN0CFAou6ltcW2YoJ7yTgZwRa8_XaK5KVWPiKkMlKl2yAn8T6Te/s1600/ST_Density1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="368" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7qA17-ktZL6B9yUqEsY1UsWujjmwUjZB3hrR2IWhN0pUHzLrdKlQzPtek6CuoZKfNeGXWg38q2zbMCw5DRK_tdLjb4nN0CFAou6ltcW2YoJ7yTgZwRa8_XaK5KVWPiKkMlKl2yAn8T6Te/s640/ST_Density1.jpg" width="640" /></a></div>
<br />
Users need to specify the input point pattern, then a polygon shapefile for the study area, which can be subset to reduce the area under investigation. Then users can include a temporal subsetting (here I used the string "2015-10/" which means from October to the end of the year, please refer to this <a href="http://r-video-tutorial.blogspot.ch/2016/07/time-series-analysis-in-arcgis.html">post </a>for more info) and subset their data extracting a certain category of crimes. Again here the SQL statements cannot include more than one category.<br />
Finally, users need to provide a raster dataset for saving the density result. This needs to be a .tif file, otherwise in my tests the result did not appear on screen. The output of this script is the image below, for the borough of Bromley and only for robberies:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhI5apIeWPHJ8RwNG6ykCEBwPIDq-tc384PBkhzNRTTppfGa-CwKWiNV6pnfOju35F-VJk5MsILr6Sjs0ZZbVEMy1dPFVo6uonvAiCq2wcYfrIuxgHt0Bbmg0EC4vQxnZ490sQ2IvPad2U3/s1600/ST_Density2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhI5apIeWPHJ8RwNG6ykCEBwPIDq-tc384PBkhzNRTTppfGa-CwKWiNV6pnfOju35F-VJk5MsILr6Sjs0ZZbVEMy1dPFVo6uonvAiCq2wcYfrIuxgHt0Bbmg0EC4vQxnZ490sQ2IvPad2U3/s640/ST_Density2.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Spatio-Temporal Randomness</span><br />
This is another tool to perform a test for spatial randomness, the G function I explained in my previous <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-point-pattern-analysis-of.html">post</a>, but on a subset of the main dataset. In fact, this test is available in ArcGIS under "Multi-Distance Spatial Cluster Analysis (Ripleys K Function)", but in this case we are again performing it on a particular subset of our data.<br />
The GUI is very similar to the other I presented before:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYO01PlmgDINe97quv3s_WkhrXxCn_SFvGRSWG6NRW6EjlK2hA2d3wTxc07ziPzA2tpVNhiAkasuAIuf1ZkKK47zlaYORaimANG6LnPpUoRqJiJYFoCjAPKwsTPjY2N8nuwdcF-V0kiRNe/s1600/ST_Random1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="412" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYO01PlmgDINe97quv3s_WkhrXxCn_SFvGRSWG6NRW6EjlK2hA2d3wTxc07ziPzA2tpVNhiAkasuAIuf1ZkKK47zlaYORaimANG6LnPpUoRqJiJYFoCjAPKwsTPjY2N8nuwdcF-V0kiRNe/s640/ST_Random1.jpg" width="640" /></a></div>
<br />
The only difference is that here users also need to provide an output folder, where the plot created by R will be saved in jpeg at 300 dpi. Moreover, this tool also provides users with the point shapefile created by subsetting the main dataset.<br />
The output for the borough of Tower Hamlets and only for drug related crimes in March 2015 is the plot below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYAOargifGK6AD9yjLKSd0rhyKN0yRwgA86vkUTqVz9-Dqx_d0D4pwqulSzdv5q8gw_SThr-H-YIu-kltozrcWTih1cbGbgwOJdQxu3gZViMl2tg8ZwhG9Kh9a-fj1yi3mkJUZ4zOxPRkP/s1600/ST_Random2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYAOargifGK6AD9yjLKSd0rhyKN0yRwgA86vkUTqVz9-Dqx_d0D4pwqulSzdv5q8gw_SThr-H-YIu-kltozrcWTih1cbGbgwOJdQxu3gZViMl2tg8ZwhG9Kh9a-fj1yi3mkJUZ4zOxPRkP/s640/ST_Random2.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Spatio-Temporal Correlogram</span><br />
As the name suggests I develop this tool to calculate and plot a correlogram on a spatio-temporal subset of my data. For this example I could not use the crime dataset, since I do not have a continuous variable in it. Therefore I loaded the dataset of ozone measurements from sensors installed on trams here in Zurich that I used for my post about <a href="http://r-video-tutorial.blogspot.hu/2015/08/spatio-temporal-kriging-in-r.html">spatio-temporal kriging</a>. This tool uses the function <a href="http://www.inside-r.org/packages/cran/ncf/docs/correlog">correlog </a>from the package <i>xts </i>to calculate the correlogram. This function takes several arguments among which an increment, the number of permutations and a TRUE/FALSE flag if data are unprojected or not. These are all data that users will need to input once they use the tool and are additional options in the GUI, which for the other points is more or less identical to what I presented before, except for the selection of the variable of interest:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1XyQKrpCvCMNJjwzNCzcSfB0EHpwyWdtl_BF3eCM_a3MFeONacbMVz0lVhtVP8CQPz5d51T93U2IwENf-ygg-xk1kDx87FPLHQEMnzgeDBCJQCZQZpXql9Z5nof-ciURXeDQ875sXVbe/s1600/CorrelogramGUI.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="486" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1XyQKrpCvCMNJjwzNCzcSfB0EHpwyWdtl_BF3eCM_a3MFeONacbMVz0lVhtVP8CQPz5d51T93U2IwENf-ygg-xk1kDx87FPLHQEMnzgeDBCJQCZQZpXql9Z5nof-ciURXeDQ875sXVbe/s640/CorrelogramGUI.jpg" width="640" /></a></div>
<br />
The result is the image below, which is again saved in jpeg at 300 dpi. As for the spatio-temporal randomness tool, a shapefile with the spatio-temporal subset used to calculate the correlogram is also saved and opened in ArcGIS directly.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyGblD4YVl0mt1DqfGfFjr8ABZ2rYVMr6FzRNyVIQfGm4OCKve5AmJXpDbEogYprrxvQkFJCQQTfIIfRHwGxdhZyGUsJxi9QNHerrF3NUdnfTjessrl0VWn3jws8wMbgO2V7DRIqkYLVPf/s1600/Correlogram.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyGblD4YVl0mt1DqfGfFjr8ABZ2rYVMr6FzRNyVIQfGm4OCKve5AmJXpDbEogYprrxvQkFJCQQTfIIfRHwGxdhZyGUsJxi9QNHerrF3NUdnfTjessrl0VWn3jws8wMbgO2V7DRIqkYLVPf/s640/Correlogram.jpg" width="640" /></a></div>
<br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Download </span><br />
The tool is available, along with the sample data, from my GitHub archive:<br />
<a href="https://github.com/fveronesi/Spatio-Temporal-Point-Pattern-Analysis">https://github.com/fveronesi/Spatio-Temporal-Point-Pattern-Analysis</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com13tag:blogger.com,1999:blog-1442302563171663500.post-42746079630278341482016-07-15T18:51:00.001+02:002016-07-15T18:51:47.085+02:00Time Series Analysis in ArcGISIn this post I will introduce another toolbox I created to show the functions that can be added to ArcGIS by using R and the <a href="https://r-arcgis.github.io/" target="_blank">R-Bridge</a> technology.<br />
In this toolbox I basically implemented the functions I showed in the previous post about <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-time-series-analysis-of-us.html" target="_blank">time series analysis in R</a>.<br />
Once again I prepared a sample dataset that I included in the GitHub archive so that you can reproduce the experiment I'm presenting here. I will start my description from there.<br />
<br />
<span style="font-size: large;">Dataset</span><br />
As for my previous post, here I'm also including open data in shapefile from the EPA, which I downloaded for free using the custom R function I presented <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-time-series-analysis-of-us.html" target="_blank">here</a>.<br />
I downloaded only temperature data (in F) from 2013, but I kept two categorical variables: State and Address.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7vBWZClceA58C0hwxPtVJ3cn3ewVBSOj28VEU4xqJyEc0sxyIBpPM93_s1WV9htnNpx2OU1HHlbLDA20-K0etBdrv7FW3nhbODZOGCeXY3qGKU1i6bDw1dYmxCizGmE08AZ42UzYEXIEM/s1600/Fig1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="292" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7vBWZClceA58C0hwxPtVJ3cn3ewVBSOj28VEU4xqJyEc0sxyIBpPM93_s1WV9htnNpx2OU1HHlbLDA20-K0etBdrv7FW3nhbODZOGCeXY3qGKU1i6bDw1dYmxCizGmE08AZ42UzYEXIEM/s640/Fig1.jpg" width="640" /></a></div>
<br />
As you can see from the image above the time variable is in the format year-month-day. As I mentioned in the post about the <a href="http://r-video-tutorial.blogspot.ch/2016/07/the-power-of-ggplot2-in-arcgis-plotting.html" target="_blank">plotting toolbox</a>, it is important to set this format correctly so that R can recognize it. Please refer to <a href="http://www.inside-r.org/r-doc/base/strptime" target="_blank">this page</a> for more information about the formats that R recognizes.<br />
<br />
<br />
<span style="font-size: large;">Time Series Plot</span><br />
This type of plot is available in several packages, including <code>ggplot2</code>, which I used to create the <a href="http://r-video-tutorial.blogspot.ch/2016/07/the-power-of-ggplot2-in-arcgis-plotting.html" target="_blank">plotting toolbox</a>. However, in my post about time series analysis I presented the package <code>xts</code>, which is very powerful for handling and plotting time-series data. For this toolbox I decided to maintain the same package and refer everything to <code>xts</code> for several reasons that I would explain along the text.<br />
The first reason is related to the plotting capabilities of this package. Let's take a look for example at the first script in the toolbox, specific for plotting time series.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ6aYIxs8ujoyxVrigi2Cf5AK0LygDkKG57Nib3Uwa7efBpBx9UVwgyYj4L8mFc6dnuIbERJul3bUMXBHBdNFiVsHImWkYbZUxX3p9JZA0BGf8L9oObrg-Mi6OJMhnWn8Uw9lS5C06Z4mg/s1600/Fig2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ6aYIxs8ujoyxVrigi2Cf5AK0LygDkKG57Nib3Uwa7efBpBx9UVwgyYj4L8mFc6dnuIbERJul3bUMXBHBdNFiVsHImWkYbZUxX3p9JZA0BGf8L9oObrg-Mi6OJMhnWn8Uw9lS5C06Z4mg/s640/Fig2.jpg" width="640" /></a></div>
<br />
Similarly to the script for time series in the plotting toolbox, here users need to select the dataset (which can be a shapefile or a CSV, or any other table format that can be accessed in ArcGIS). Then they need to select the variable of interest, in the sample dataset that is Temp, which clearly stands for temperature. Another important information for R is the data/time column and its format, again please refer to my <a href="http://r-video-tutorial.blogspot.ch/2016/07/the-power-of-ggplot2-in-arcgis-plotting.html" target="_blank">previous post</a> for more information. Finally, I inserted an SQL call to subset the dataset. In this case I'm subsetting a particular station.<br />
The result is the plot below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUmEZZ18QgoyBHbRxXVuHZa2u0fZ8eosg66bPadQYVL1tHRNRnWy1PtI0BmpO7FxusE5x2dS6dYniDxyk5GdEJs2dD-oewE9pKQkzS9_RdiJQJPg_3W_9o-agwOaz3eP11Hi9PcVCcQbDf/s1600/Fig3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUmEZZ18QgoyBHbRxXVuHZa2u0fZ8eosg66bPadQYVL1tHRNRnWy1PtI0BmpO7FxusE5x2dS6dYniDxyk5GdEJs2dD-oewE9pKQkzS9_RdiJQJPg_3W_9o-agwOaz3eP11Hi9PcVCcQbDf/s640/Fig3.jpg" width="640" /></a></div>
<br />
As you can see there are quite a few missing values in the dataset related to the station I subset. The very nice thing about the package <code>xts</code> is that with this plot it is perfectly clear where are the missing data, since along the X axis these are evident by the lack of grey tick marks.<br />
<br />
<br />
<span style="font-size: large;">Time Histogram</span><br />
This is a simple bar chart that basically plots time against frequency of samples. The idea behind this plot is to allow users to explore the number of samples for specific time intervals in the dataset.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9JdWtzRc3aWS-y8D9It9PG49r6yZvem3posKGgnM6IE2MAcqq0hccyAajTy62rApWR2UDQCEQcSYIPRn7PD8mxoDvBpTkDVy2AYR6OLg7mavaGIrqe7WlQImAgpvt1f1TgDd93q8OINjZ/s1600/Fig4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9JdWtzRc3aWS-y8D9It9PG49r6yZvem3posKGgnM6IE2MAcqq0hccyAajTy62rApWR2UDQCEQcSYIPRn7PD8mxoDvBpTkDVy2AYR6OLg7mavaGIrqe7WlQImAgpvt1f1TgDd93q8OINjZ/s640/Fig4.jpg" width="640" /></a></div>
<br />
The user interface is similar to the previous scrips. Users need to select the dataset, then the variable and then the time column and specify its format. I also included an option to select a subset of the dataset with a SQL selection. At this point I included a list to select the averaging period, and users can select between day, month or year. In this case I selected month, which means that R will loop through the months and subset the dataset for each of these. Then it will basically count the number of data sampled in each month and plot this information against the month itself. The result is the plot below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0DK9C-FGbWUodMfdGRL5vlA0Xi9ykRhYttUby6PCu0xbQtvSjp70q8qvJEJic0KDhMvA0OyxOmF2edLAAEdk_i3t2uiplxx2FUrxU9ergNyZlT9QM3GHNjgbYLmJHsEnZh5GQ7wF-sOBP/s1600/Fig5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="336" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0DK9C-FGbWUodMfdGRL5vlA0Xi9ykRhYttUby6PCu0xbQtvSjp70q8qvJEJic0KDhMvA0OyxOmF2edLAAEdk_i3t2uiplxx2FUrxU9ergNyZlT9QM3GHNjgbYLmJHsEnZh5GQ7wF-sOBP/s640/Fig5.jpg" width="640" /></a></div>
<br />
As you can see we can definitely gather some useful information from this plot; for example we can determine that basically this station, in the year 2013, did not have any problem.<br />
<br />
<br />
<span style="font-size: large;">Time Subset</span><br />
In some cases we may need to subset our dataset according to a particular time period. This can be done in ArcGIS with the "Select by Attribute" tool and by using an SQL string similar to what you see in the image below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEYOy5nT1lgX_qRAMeuaRKeqvj4XcjrUKM0ymXp1wtgqAu4gPS16hxy2xB5jetJUXez3GFZFQazqPimpMnE918MjR1rvMaMF1YCEfmEyxqs-8AMCRKsJKj6dhQOMAKACLQph1WFA_lX7im/s1600/Fig7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="394" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEYOy5nT1lgX_qRAMeuaRKeqvj4XcjrUKM0ymXp1wtgqAu4gPS16hxy2xB5jetJUXez3GFZFQazqPimpMnE918MjR1rvMaMF1YCEfmEyxqs-8AMCRKsJKj6dhQOMAKACLQph1WFA_lX7im/s640/Fig7.jpg" width="640" /></a></div>
<br />
The package <code>xts</code> however, provides much more powerful and probably faster ways to subset by time. For example, in ArcGIS if we want to subset the whole month of June we would need to specify an SQL string like this:<br />
"Time" >= '2013-06-01' AND "Time" < '2013-07-01'<br />
<br />
On the contrary, in R and with the package <code>xts</code> if we wanted to do the same we would just need to use the string <code>'2013-06'</code>, and R would know to keep only the month of June. Below are some other examples of successful time subset with the package <code>xts</code> (from <a href="http://www.inside-r.org/packages/cran/xts/docs/xts" target="_blank">http://www.inside-r.org/packages/cran/xts/docs/xts</a>):<br />
<br />
<code>sample.xts['2013'] # all of 2013<br />
sample.xts['2013-03'] # just March 2013<br />
sample.xts['2013-03/'] # March 2013 to the end of the data set<br />
sample.xts['2013-03/2013'] # March 2013 to the end of 2013<br />
sample.xts['/'] # the whole data set<br />
sample.xts['/2013'] # the beginning of the data through 2013<br />
sample.xts['2013-01-03'] # just the 3rd of January 2013</code>
<br />
<br />
With this in mind I created the following script:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0S7a268WDgscEjM-CDoXJsNUBxRioPkftlRMgzC3cnxJeKkF_4wiRqLbvE-NMssbolQ4YSuZVKGTfbdIxsHwkHv3QJKpJaC1ZBE2AYEPOzbJ8gNbbZKHSyeDxd-2Gspiza6GdZi7k-Iwi/s1600/Fig6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0S7a268WDgscEjM-CDoXJsNUBxRioPkftlRMgzC3cnxJeKkF_4wiRqLbvE-NMssbolQ4YSuZVKGTfbdIxsHwkHv3QJKpJaC1ZBE2AYEPOzbJ8gNbbZKHSyeDxd-2Gspiza6GdZi7k-Iwi/s640/Fig6.jpg" width="640" /></a></div>
<br />
As you can see from the image above, here there is an option named "Subset" where users can insert one of the strings from the examples above (just the text within square brackets) and select time intervals with the same flexibility allowed in R and the package xts.<br />
The result of this script is a new shapefile containing only the time included in the Subset call.<br />
<br />
<br />
<br />
<span style="font-size: large;">Time Average Plots</span><br />
As I showed in my previous post about <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-time-series-analysis-of-us.html" target="_blank">time series analysis</a>, with the package <code>xts</code> is possible to perform custom functions on specific time intervals with the following commands: <code>apply.daily</code>, <code>apply.weekly</code>, <code>apply.monthly</code> and <code>apply.yearly</code>.<br />
In this toolbox I used these functions to compute the average, 25th and 75th percentiles for specific time intervals, which the user may choose. This is the toolbox:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdW_-aUxn9B-y8wKGQeGun5A0GB4V6i3Wae_0_ZAs8GdpCtv_MQlNts66pF9cdOBAIaKaCxKrvcSzBR_4d4UlSHBl_fwCnfqDztUJQrrlup-rZeH7S9lkyUAGk38kAIPD9EQjHTAhaszCj/s1600/Fig8.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdW_-aUxn9B-y8wKGQeGun5A0GB4V6i3Wae_0_ZAs8GdpCtv_MQlNts66pF9cdOBAIaKaCxKrvcSzBR_4d4UlSHBl_fwCnfqDztUJQrrlup-rZeH7S9lkyUAGk38kAIPD9EQjHTAhaszCj/s640/Fig8.jpg" width="640" /></a></div>
<br />
The only differences from the other scripts are the "Average by", with which the user can select between day, week, month or year. Each of these will trigger the appropriate apply function. Then there is also the possibility to select the position for the plot legend: between topright, topleft, bottomright and bottomleft. Finally, users can select the output folder where the plot below will be saved, along with a CSV with the numerical values for mean, q25 and q75.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh45IIv3Y1Kp_2VpGgWFNpdGleih3t18PY2viBMXijVflxs4_Dv_ZKzahwf95XOS2KdrGtgBmo3zr7IvgSjhFuAWpzN7PZeMyXkYi0k7bZIegKHdd1-NqbsaqD8x91tcsPFb-lrpwejCTQT/s1600/Fig9.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="342" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh45IIv3Y1Kp_2VpGgWFNpdGleih3t18PY2viBMXijVflxs4_Dv_ZKzahwf95XOS2KdrGtgBmo3zr7IvgSjhFuAWpzN7PZeMyXkYi0k7bZIegKHdd1-NqbsaqD8x91tcsPFb-lrpwejCTQT/s640/Fig9.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Time Function</span><br />
This is another script that provides direct access to the apply functions I presented before. Here the output is not a plot but a CSV with the results of the function, and users can input their own function directly in the GUI. Let's take a look:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsY-haDd_GocIPf-rRtkYUUaIfEyZ80oTrjoT7hRMxcWL-_O7bLJsl6hb5WBv2XYyMrMKn7AyPjXOBBMuv8v1Ehqb-sdux6DEkNDxYSG5Wh_jchynnAdfEDannUyQbKaeYq2ZpMh8lt3em/s1600/Fig10.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsY-haDd_GocIPf-rRtkYUUaIfEyZ80oTrjoT7hRMxcWL-_O7bLJsl6hb5WBv2XYyMrMKn7AyPjXOBBMuv8v1Ehqb-sdux6DEkNDxYSG5Wh_jchynnAdfEDannUyQbKaeYq2ZpMh8lt3em/s640/Fig10.jpg" width="640" /></a></div>
<br />
As you can see there is a field named "Function". Here users can insert their own custom function, written in the R language. This function takes a vector (<i>x</i>) and returns a vector and it is in the form:<br />
<br />
<code>function(x){sum(x>70}</code><br />
<br />
Only the string within curly brackets needs to be written in the GUI. This will then be passed to the script and applied to the values averaged by day, week, month or year. Users can select this last aspect in the field "Average by". Here for example I am calculating the number of days, for each month, with a temperature above 70 degrees Fahrenheit (21 degrees celsius) in Alaska. The results are saved in CSV in the output folder and printed on screen, as you can see from the image below.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsMi9aCY84eiFLYDO-ojT6AaFdI7MMMiVSI4ObgcixlADPacooomlrlbVjffzzPaU1NWEJ3rfFu5UKkqKObfVmUhg-90ev7QjET-7qDa3nNOnA7FuAURRZpgldJgsL09yqe2VD61p1l-Ch/s1600/Fig11.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="488" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsMi9aCY84eiFLYDO-ojT6AaFdI7MMMiVSI4ObgcixlADPacooomlrlbVjffzzPaU1NWEJ3rfFu5UKkqKObfVmUhg-90ev7QjET-7qDa3nNOnA7FuAURRZpgldJgsL09yqe2VD61p1l-Ch/s640/Fig11.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Trend Analysis</span><br />
In this last script I included access to the function decompose, which I briefly described in my <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-time-series-analysis-of-us.html" target="_blank">previous post</a>. This function does not work with <code>xts</code> time series, so the time series needs to be loaded with the standard method, <code>ts</code>, in R. This method requires the user to include the frequency of the time series. For this reason I had to add an option for this in the GUI.<br />
Unfortunately, the dataset I created for this experiment only has one full year and thus making a decomposition does not make much sense, but you are invited to try with your data and it should work fine and provide you with results similar to the image below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7r7LesTtSLZZxhriGYdmqjjb_MEmnSLOiynS0swB1MdZXDiu6QwsKMqmiltIS1_rgPSlc0KpUOAuNQswz14NHfGB_7F8mGPfdwmFb0Dz_biI4NyVrI_3BNfqx4FqL-XzhQ9EeAwCibq2X/s1600/Decompose.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="377" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7r7LesTtSLZZxhriGYdmqjjb_MEmnSLOiynS0swB1MdZXDiu6QwsKMqmiltIS1_rgPSlc0KpUOAuNQswz14NHfGB_7F8mGPfdwmFb0Dz_biI4NyVrI_3BNfqx4FqL-XzhQ9EeAwCibq2X/s400/Decompose.jpeg" width="400" /></a></div>
<br />
<br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Download</span><br />
Once again the time-series toolbox is available for free from my GitHub page at:<br />
<a href="https://github.com/fveronesi/TimeSeries_Toolbox/" target="_blank">https://github.com/fveronesi/TimeSeries_Toolbox/</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com2tag:blogger.com,1999:blog-1442302563171663500.post-50555661239098958182016-07-11T13:26:00.003+02:002016-07-11T13:26:34.760+02:00The Power of ggplot2 in ArcGIS - The Plotting ToolboxIn this post I present my third experiment with <a href="https://r-arcgis.github.io/" target="_blank">R-Bridge</a>. The plotting toolbox is a plug-in for ArcGIS 10.3.x that allows the creation of beautiful and informative plot, with ggplot2, directly from the ESRI ArcGIS console.<br />
As always I not only provide the toolbox but also a dataset to try it out. Let's start from here...<br />
<br />
<span style="font-size: large;">Data</span><br />
For testing the plotting tool, I downloaded some air pollution data from EPA (US Environmental Protection Agency), which provides open access to its database. I created a custom function to download data from EPA that you can find in <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-time-series-analysis-of-us.html" target="_blank">this post</a>.<br />
Since I wanted to provide a relatively small dataset, I extracted values from only four states: California, New York, Iowa and Ohio. For each of these, I included time series for Temperature, CO, NO2, SO2 and Barometric Pressure. Finally, the coordinates of the points are the centroid for each of these four states. The image below depicts the location and the first lines of the attribute table. This dataset is provided in shapefile and CSV, both can be used with the plotting toolbox.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdLaixRpw0ueQ523xDaOY52eN9wI6qKCIbfYJG7izS2cMdMtrkXYFLdZg_1EAokonx_0IyxgSMfxWzdPTXzdTmFXG4atZPeAYg8FgeMrNESUGQBNUQpDZoGp5zvYDq-mS5bseWPepjCzva/s1600/Fig1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdLaixRpw0ueQ523xDaOY52eN9wI6qKCIbfYJG7izS2cMdMtrkXYFLdZg_1EAokonx_0IyxgSMfxWzdPTXzdTmFXG4atZPeAYg8FgeMrNESUGQBNUQpDZoGp5zvYDq-mS5bseWPepjCzva/s640/Fig1.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Toolbox</span><br />
Now that we have seen the sample dataset, we can take a look at the toolbox. I included 5 packages to help summarize and visualize spatial data directly from the ArcGIS console.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoAQGbPP-ZmmArh0uL1K0IQDNSpmn5t8dNRNwcGBsgyPiG18YFyMyoeUjAJodeWF5Ekr7J7zQoMDX4xbBD9bWDDgtbM5CrhVlr6Ue1RH2toAptLvh59DVLJFj6ovB9Zi2gtu-1O6aAxBwO/s1600/Fig2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="189" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoAQGbPP-ZmmArh0uL1K0IQDNSpmn5t8dNRNwcGBsgyPiG18YFyMyoeUjAJodeWF5Ekr7J7zQoMDX4xbBD9bWDDgtbM5CrhVlr6Ue1RH2toAptLvh59DVLJFj6ovB9Zi2gtu-1O6aAxBwO/s320/Fig2.jpg" width="320" /></a></div>
<br />
I first included a package for summarizing our variables, this creates a table with some descriptive statistics. Then I included all the major plotting types I presented in my book "<a href="http://r-video-tutorial.blogspot.ch/2016/04/learning-r-for-data-visualization-video.html" target="_blank">Learning R for Data Visualization [VIDEO]</a>" edited by <a href="https://www.packtpub.com/big-data-and-business-intelligence/learning-r-data-visualization-video" target="_blank">Packt Publishing</a>, with some modifications to the scripts for adapting them to the <a href="https://r-arcgis.github.io/" target="_blank">R-Bridge</a> Technology from <a href="https://blogs.esri.com/esri/esri-insider/2015/07/20/building-a-bridge-to-the-r-community/" target="_blank">ESRI</a>. For more information, practical and theoretical, about each of these plots please refer to the book.<br />
<br />
<b>The tool can be downloaded from my GitHub page at:</b><br />
<a href="https://github.com/fveronesi/PlottingToolbox" target="_blank">https://github.com/fveronesi/PlottingToolbox</a><br />
<br />
<br />
<br />
<span style="font-size: large;">Summary </span><br />
This is the first package I would like to describe simply because a data analysis should always start with a look at our variables with some descriptive statistics. As for all the tools presented here its use is simple and straightforward with the GUI presented below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwDJw0wD66xSyZ50zFRdqzs5MrgY4wPNIKrFT2xHQCgiIMJ0LaLqQMDlPDeaszbXYJ5Tk4e4ko5tljIT1R0wmCT6P0GrNdDPVZOPrG6G61YlMvF8CKDW1gcjN4phCz8E57u8vOzbOsCG8C/s1600/Fig3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwDJw0wD66xSyZ50zFRdqzs5MrgY4wPNIKrFT2xHQCgiIMJ0LaLqQMDlPDeaszbXYJ5Tk4e4ko5tljIT1R0wmCT6P0GrNdDPVZOPrG6G61YlMvF8CKDW1gcjN4phCz8E57u8vOzbOsCG8C/s640/Fig3.jpg" width="640" /></a></div>
<br />
Here the user has to point to the dataset s/he wants to analyze in "Input Table". This can be a shapefile, either loaded already in ArcGIS or not, but it can also be a table, for example a CSV. That is the reason why I included a CSV version of the EPA dataset as well.<br />
At this point the area in "Variable" will fill up with the column names of the input file, from here the user can select the variable s/he is interested in summarizing. The final step is the selection of the "Output Folder". <b>Important</b>: users need to first create the folder and then select it. This is because this parameter is set as input, so the folder needs to exist. I decided to do it this way because otherwise for each new plot a new folder would have need to be created. This way all the summaries and plots can go into the same directory.<br />
Let's take a look at the results:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0F7Jmmh-X1wKdFcxNQxRbLIJDBFRUgbeXot0E1KLltNiO1uTI97O39VOUmK2oVo0ArLjaVzOmQ-I0-yVFCn_rVAyAlk8JY_ruEdng0_HqGe0JHeNo42QeZ6mYhLYun_KLvAYm3fuOPPxe/s1600/Fig4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0F7Jmmh-X1wKdFcxNQxRbLIJDBFRUgbeXot0E1KLltNiO1uTI97O39VOUmK2oVo0ArLjaVzOmQ-I0-yVFCn_rVAyAlk8JY_ruEdng0_HqGe0JHeNo42QeZ6mYhLYun_KLvAYm3fuOPPxe/s640/Fig4.jpg" width="640" /></a></div>
<br />
The Summary package presents two tables, with all the variables we wanted to summarize, arranged one above the other. This is the output users will see on screen once the toolbox has competed its run. Then in the output folder, the R script will save this exact table in PDF format, plus a CSV with the raw data.<br />
<br />
<span style="font-size: large;">Histogram</span><br />
As the name suggest this tool provides access to the histogram plot in ggplot2. ArcGIS provides a couple of ways to represent data in histograms. The first way is by clicking with the right mouse button on one of the column in the attribute table; a drop-down menu will appear and from there users can click on "Statistics" to access some descriptive statistics and a histogram. Another way is through the "Geostatistical Analyst", which is an add-on for which users need an additional license. This has an "Explore Data" package from which it is possible to create histograms. The problem with both these methods is that the results are not, in my opinion at least, suitable for any publication. You can maybe share them with your colleagues, but I would not suggest using them for any article. This implies that ArcGIS users need to open another software, maybe Excel, to create the plots they require, and we all know how painful it is in Excel to produce any decent plot, and histogram are probably the most painful to produce.<br />
This changes now!!<br />
<br />
By combining the power of R and ggplot2 with ArcGIS we can provide users with an easy to use way to produce beautiful visualizations directly within the ArcGIS environment and have them saved in jpeg at 300 dpi. This is what this set of packages does.<br />
Let's now take a look at the interface for histograms:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOtsJf3U2JEPBFnT5EGkeftl-IGBfq57Wufy190ml9rC5yNE12zFcLnSYAVMbSwQkzslFdGR6sGSPFLTC-t-rQX1uR1-bJOleDchtJHjagd0jaGectNaGz38h_t5yac4Hx6lB6nBe0vg1b/s1600/Fig5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOtsJf3U2JEPBFnT5EGkeftl-IGBfq57Wufy190ml9rC5yNE12zFcLnSYAVMbSwQkzslFdGR6sGSPFLTC-t-rQX1uR1-bJOleDchtJHjagd0jaGectNaGz38h_t5yac4Hx6lB6nBe0vg1b/s640/Fig5.jpg" width="640" /></a></div>
<br />
As for Summary we first need to insert the input dataset, and then select the variable/s we want to plot. If two or more variables are selected, several plots will be created and saved in jpeg.<br />
I also added some optional values to further customize the plots. The first is the <a href="http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/" target="_blank">faceting</a> variable, which is a categorical variable. If this is selected the plot will have a histogram for each category; for example, in the sample dataset I have the categorical variable "state", with the name of the four states I included. If I select this for faceting the result will be the figure below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYrhXDZqrpjS5awJqv1L9fF6jfXRBkfDxI2DdXOf2cNnBr_LRNOdEVfiwgh-pgCuXkPuaLPUyAjlnBIl5VMNCw_VQ1kTkOQglLJaTDnaBduapw2voiksSDWizs5cnukjA-PtKyuJUOTH1U/s1600/Fig6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="354" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYrhXDZqrpjS5awJqv1L9fF6jfXRBkfDxI2DdXOf2cNnBr_LRNOdEVfiwgh-pgCuXkPuaLPUyAjlnBIl5VMNCw_VQ1kTkOQglLJaTDnaBduapw2voiksSDWizs5cnukjA-PtKyuJUOTH1U/s640/Fig6.jpg" width="640" /></a></div>
<br />
Another option available here is the binwidth. The package ggplot2 usually sets this number as the range of the variable divided by 30. However, this can be customized by the user and this is what you can do with this option. Finally users need to specify an output folder where R will save a jpeg of the plots shown on screen.<br />
<br />
<br />
<span style="font-size: large;">Box-Plot</span><br />
This is another great way to compare variables' distributions and as far as I know it cannot be done directly from ArcGIS. Let's take a look at this package:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOWXf9kzynT4GhYfPPuwTMSbb5VQhyBzm78YBv6aCuor8p9Ps2g-0oFQ70um1jMplUVoc1UFmbNXvQ0obd2_CkyKaYMr5WoLW0GCENVp_Y_zAwiqzIq0xGUhUG2oDe07BDDsy4qgYmh0mR/s1600/Fig7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOWXf9kzynT4GhYfPPuwTMSbb5VQhyBzm78YBv6aCuor8p9Ps2g-0oFQ70um1jMplUVoc1UFmbNXvQ0obd2_CkyKaYMr5WoLW0GCENVp_Y_zAwiqzIq0xGUhUG2oDe07BDDsy4qgYmh0mR/s640/Fig7.jpg" width="640" /></a></div>
<br />
This is again very easy to use. Users just need to set the input file, then the variable of interest and the categorical variable for the grouping. Here I am again using the states, so that I compare the distribution of NO2 across the four US states in my dataset.<br />
The results is ordered by median values and it is shown below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi76BtynCzkBQkTVoQcTpG1McBjAadY9gO9vQRUVJ20_iaJG2Ere5voVadOxhNifqcDvQ23YUDhobV5kGUrm-lIcz56-dJ08ZJdeRo40iejx7XM2frgiDMhyphenhyphenGZD2xwpDb6F7xJtJPoKYz94/s1600/Fig8.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi76BtynCzkBQkTVoQcTpG1McBjAadY9gO9vQRUVJ20_iaJG2Ere5voVadOxhNifqcDvQ23YUDhobV5kGUrm-lIcz56-dJ08ZJdeRo40iejx7XM2frgiDMhyphenhyphenGZD2xwpDb6F7xJtJPoKYz94/s640/Fig8.jpg" width="640" /></a></div>
<br />
As you can see I decided to plot the categories vertically, as to accomodate long names. This of course can be changed by tweaking the R script. As for each package, this plot is saved in the output folder.<br />
<br />
<br />
<span style="font-size: large;">Bar Chart</span><br />
This tool is generally used to compare different values for several categories, but generally we have one value for each category. However, it may happen that the dataset contains multiple measurements for some categories, and I implemented a way to deal with that here. Below is presented the GUI to this package:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzrhDTYr3DaN6d2TZPjmen6oPbnko5tzhW38yVsRgggtEzqysUHrO6NxviAORAGYOqGpB1nm9ulp8KQLhIcUS2wxteNFoOqa4Jr9huBUX0djPckQdbegEGD6mTwz2GenqnhMXbHSJmRNQa/s1600/Fig9.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="504" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzrhDTYr3DaN6d2TZPjmen6oPbnko5tzhW38yVsRgggtEzqysUHrO6NxviAORAGYOqGpB1nm9ulp8KQLhIcUS2wxteNFoOqa4Jr9huBUX0djPckQdbegEGD6mTwz2GenqnhMXbHSJmRNQa/s640/Fig9.jpg" width="640" /></a></div>
<br />
<br />
The inputs are basically the same as for box-plots, the only difference is in the option "Average Values". If this is set, R will average the values of the variable for each unique category in the dataset. The results are again ordered, and are saved in jpeg in the output folder:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV_6TT_KlrzqD2RltOnIi6ISsieSF1d-bXMlz410yIj9G6YPkOsRGWmN39Miq7dEOZItOSelAcquUuDq64Mx58rAvZJytd8ksbuIhgq53DQr9muPWx564YWPgfJ0xrsKJ5dNxgPyHKaDw2/s1600/Fig10.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV_6TT_KlrzqD2RltOnIi6ISsieSF1d-bXMlz410yIj9G6YPkOsRGWmN39Miq7dEOZItOSelAcquUuDq64Mx58rAvZJytd8ksbuIhgq53DQr9muPWx564YWPgfJ0xrsKJ5dNxgPyHKaDw2/s640/Fig10.jpg" width="640" /></a></div>
<br />
<br />
<span style="font-size: large;">Scatterplot</span><br />
This is another great way to visually analyze our data. The package ggplot2 allows the creation of highly customized plots and I tried to implement as much as I could in terms of customization in this toolbox. Let's take a look:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjY7wwOTSWQiW3Y8YrPWM6696PFpPkNU3_hVtJxaPwE__GW3VU4UgcnoICgjeTkdyFVI_8IngUhU8hjV0264vAblSXgLaRqAp025hc2CR0GsJmpRRvdPmxZYc7-AeDgwvx0u_LkRXZIMgo_/s1600/Fig11.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="498" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjY7wwOTSWQiW3Y8YrPWM6696PFpPkNU3_hVtJxaPwE__GW3VU4UgcnoICgjeTkdyFVI_8IngUhU8hjV0264vAblSXgLaRqAp025hc2CR0GsJmpRRvdPmxZYc7-AeDgwvx0u_LkRXZIMgo_/s640/Fig11.jpg" width="640" /></a></div>
<br />
After selecting the input dataset the user can select what to plot. S/he can choose to only plot two variables, one on the X axis and one on the Y axis, or further increase the amount of information presented in the plot by including a variable that changes the color of the points and one for their size. Moreover, there is also the possibility to include a regression line. Color, size and regression line are optional, but I wanted to include them to present the full range of customizations that this package allows.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv_ZVr_nHRxeWFpCy_LiLcxPAO3naCj8d9L27ingdIcnceWg__wiDRyyt_170AyvfKIii8Uu3aFZthvLpYHOqUqNcSDhtY-mtKXUSocqStVLDAGlT1mr4i1hoRIoiHKrRZ06a3pqF_nGow/s1600/Fgi12.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv_ZVr_nHRxeWFpCy_LiLcxPAO3naCj8d9L27ingdIcnceWg__wiDRyyt_170AyvfKIii8Uu3aFZthvLpYHOqUqNcSDhtY-mtKXUSocqStVLDAGlT1mr4i1hoRIoiHKrRZ06a3pqF_nGow/s640/Fgi12.jpg" width="640" /></a></div>
<br />
Once again this plot is saved in the output folder.<br />
<br />
<br />
<span style="font-size: large;">Time Series</span><br />
The final type of plots I included here is the time series, which is also the one with the highest number of inputs from the user side. In fact, many spatial datasets include a temporal component but often time this is not standardized. By that I mean that in some cases the time variable has only a date, and in some cases it includes a time; in other cases the format changes from dataset to dataset. For this reason it is difficult to create an R script that works with most datasets, therefore for time-series plots users need to do some pre-processing. For example, <i>if date and time are in separate columns, these need to be merged into one for this R script to work.</i><br />
At this point the TimeSeries package can be started:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFLWn_cTGYnPyxFLDs9HDcDc5flXIhsqAa1BkIjfCrEnf8KakTH9k7syeUwLiKQT8XvZjsrk3BftSsK71O6L2Igh402RMrpma5dt8mUvMzVeglAJESHQNiW7RnYKBNknaULgNvrV6h6LOe/s1600/Fig12.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="502" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFLWn_cTGYnPyxFLDs9HDcDc5flXIhsqAa1BkIjfCrEnf8KakTH9k7syeUwLiKQT8XvZjsrk3BftSsK71O6L2Igh402RMrpma5dt8mUvMzVeglAJESHQNiW7RnYKBNknaULgNvrV6h6LOe/s640/Fig12.jpg" width="640" /></a></div>
<br />
The first two columns are self explanatory. Then users need to select the column with the temporal information, and then input manually the format of this column.<br />
In this case the format I have in the sample dataset is the following: 2014-01-01<br />
Therefore I have the year with century, a minus sign, the month, another minus sign and the day. I need to use the symbols for each of these to allow R to recognize the temporal format of the file.<br />
Common symbol are:<br />
%Y - Year with century<br />
%y - Year without century<br />
%m - Month<br />
%d - Day<br />
%H - Hour as decimal number (00-23)<br />
%M - Minute as decimal number (00-59)<br />
%S - Second as decimal number<br />
<br />
More symbols can be found at this page: <a href="http://www.inside-r.org/r-doc/base/strptime" target="_blank">http://www.inside-r.org/r-doc/base/strptime</a><br />
<br />
The remaining two inputs are optional, but if one is selected the other needs to be provided. For "Subsetting Column" I intend a column with categorical information. For example, in my dataset I can generate a time-series for each US state, therefore my subsetting column is state. In the option "Subset" users need to write manually the category they want to use to subset their data. Here I just want to have the time-series for California, so I write California. You need to be careful here to write exactly the name you see in the attribute table because <b>R is case sensitive</b>, thus if you write california with a lower case c R will be unable to produce the plot.<br />
The results, again saved in jpeg automatically, is presented below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-6m47fj1j8HWrcNJ81Is3Q5Z_CrUG91S4jo4VEXnxml1RRQVV8WSbf0ievOLxfBtWnM7fPrYxJyVYKpApimh0WjfwrqZd7riJwR1QwyMyozlWAn6is9nXtFaniyFMvBnRf1BCdTLBHhz9/s1600/Fig13.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-6m47fj1j8HWrcNJ81Is3Q5Z_CrUG91S4jo4VEXnxml1RRQVV8WSbf0ievOLxfBtWnM7fPrYxJyVYKpApimh0WjfwrqZd7riJwR1QwyMyozlWAn6is9nXtFaniyFMvBnRf1BCdTLBHhz9/s640/Fig13.jpg" width="640" /></a></div>
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com3tag:blogger.com,1999:blog-1442302563171663500.post-3522192451314942052016-07-08T10:18:00.003+02:002016-07-08T10:18:51.129+02:00Time Averages of NetCDF files from ECMWF in ArcGIS with R-BridgeWith this post I would like to talk again about R-Bridge, which allows a direct communication between ArcGIS and R.<br />
In the <a href="http://r-video-tutorial.blogspot.ch/2016/07/combine-arcgis-and-r-clustering-toolbox.html" target="_blank">previous post</a>, I presented a very simple application of R-Bridge where I built a toolbox to perform k-means clustering on point shapefiles. The purpose of that post was to explain the basics of the technology, but its scientific scope was limited. However, I would like to try step by step to translate more of my R scripts in ArcGIS, so that more people can use them even if they are not experts in R.<br />
In this post I will start presenting a toolbox to handle NetCDF files downloaded from the European Centre for Medium-Range Weather Forecasts (ECMWF).<br />
<br />
<span style="font-size: large;">ECMWF</span><br />
The European Centre for Medium-Range Weather Forecasts provides free access to numerous weather data through their website. You can go directly to this page to take a look at the data available: <a href="http://www.ecmwf.int/en/research/climate-reanalysis/browse-reanalysis-datasets" target="_blank">http://www.ecmwf.int/en/research/climate-reanalysis/browse-reanalysis-datasets</a><br />
The data are freely accessible and downloadable (for research purposes!!) but you need to register to the website to be able to do so.<br />
<br />
<br />
<span style="font-size: large;">My Research</span><br />
For a research I'm doing right now I downloaded the ERA Interim dataset from 2010 to 2015 from this access page: <a href="http://apps.ecmwf.int/datasets/data/interim-full-daily/levtype=sfc/" target="_blank">http://apps.ecmwf.int/datasets/data/interim-full-daily/levtype=sfc/</a><br />
<br />
The data are provided in large NetCDF files, which include all the weather variables I selected and for the entire time frame. In R, NetCDF files can be easily imported, using the packages <code>raster</code> and <code>ncdf4</code>, as a raster brick. This will have a X and Y dimensions, plus a time dimension for each of the variables I decided to download. Since I wanted 5 years and the ERA dataset include four datasets per day, I had quite a lot of data.<br />
I decided that for the research I was planning to do I did not need each and every one of these rasters, but rather some time averages, for example the monthly averages for each variable. Therefore I created an R script to do the job.<br />
<br />
Now I decided to use R-Bridge to implement the R script I developed in ArcGIS. This should allow people not familiar with R to easily create time averages of weather reanalysis data provided by ECMWF.<br />
<br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Toolbox</span><br />
I already covered in the previous post the installation process of R-Bridge and explained how to create a new toolbox with a script, so if you do not know how to do these thing please refer to <a href="http://r-video-tutorial.blogspot.ch/2016/07/combine-arcgis-and-r-clustering-toolbox.html" target="_blank">this post</a>.<br />
For this script I created a simple GUI with two inputs and one output:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjslVUFX79Cof41yiZrCsnkRp4upYZEG_hWtwtFcnUv9vRCTIVqqE4dldgv8Xbr6ajbFj-i_ErK6IrBkhaDQfaHxACU39LVllvmGdCG6OZQW1x6sCyBzL0cWJqhORzik-7m-4D8XOuw8s3/s1600/Fig1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjslVUFX79Cof41yiZrCsnkRp4upYZEG_hWtwtFcnUv9vRCTIVqqE4dldgv8Xbr6ajbFj-i_ErK6IrBkhaDQfaHxACU39LVllvmGdCG6OZQW1x6sCyBzL0cWJqhORzik-7m-4D8XOuw8s3/s640/Fig1.jpg" width="640" /></a></div>
<br />
The first is used to select the NetCDF file, downloaded from the ECMWF website, on the users' computer. The second is a list of values from which the user can select what type of time average to perform:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihSx9a-ZlLy4UodLfC5lZf9q1q7ASO-MRxXJ6Agm2pi8hqsktRRPwataNPTDj1FooZVKLQeNg3VHfhj-48SgGnP4oYI83IdC38mY6-8vROuU_0fNA1EYvkb6hBUx-bBO9WaFMHN-KgPdC2/s1600/Fig2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="390" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihSx9a-ZlLy4UodLfC5lZf9q1q7ASO-MRxXJ6Agm2pi8hqsktRRPwataNPTDj1FooZVKLQeNg3VHfhj-48SgGnP4oYI83IdC38mY6-8vROuU_0fNA1EYvkb6hBUx-bBO9WaFMHN-KgPdC2/s640/Fig2.jpg" width="640" /></a></div>
<br />
Users can select between four types: hourly, daily, monthly and yearly averages. In each case, R will select the unique values for each of these categories and create average rasters. Please remember that ECMWF does not provide hourly data, but only observations at specific time of day, every 6 hours or so; therefore, do not expect the script to generate hourly rasters, but only averages for these time slots.<br />
<br />
The final thing that users need to provide is the output folder, where R will save all the rasters. This needs to be a new folder!! R will then create the new folder on disk and then save the rasters in there.<br />
For the time being I do not have a way to plot these rasters onto ArcGIS after the script is completed. In theory with the function <code>writeRaster</code> in the <code>raster</code> package it is possible to export a raster to ArcGIS directly, but users would need to provide the name of this output raster in the Toolbox GUI and in this case this is not possible because many rasters are created at once. I also tried to create another toolbox in Model Builder where the R script was followed by an iterator that should have opened the rasters directly from the output folder, but it does not work. If you have any suggestion for doing this I would like to hear them. In any case this is not an issue, the important thing is being able to produce average rasters from NetCDF files.<br />
<br />
<br />
<span style="font-size: large;">R Script</span><br />
In the final part of the post I will present the R script I used for this Toolbox. Here is the code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;">### NetCDF Time Average Toolbox</span>
<span style="color: #666666; font-style: italic;">##Author: Fabio Veronesi</span>
tool_exec <- <a href="http://inside-r.org/r-doc/base/function"><span style="color: #003399; font-weight: bold;">function</span></a><span style="color: #009900;">(</span>in_params<span style="color: #339933;">,</span> out_params<span style="color: #009900;">)</span>
<span style="color: #009900;">{</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"ncdf4"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"ncdf4"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/ncdf4">ncdf4</a><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"reshape2"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"reshape2"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span>reshape2<span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"sp"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"sp"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/sp">sp</a><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"raster"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"raster"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span>raster<span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"rgdal"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"rgdal"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/rgdal">rgdal</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Time Averages of ECMWF Datasets"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Author: Fabio Veronesi"</span><span style="color: #009900;">)</span>
source_nc = in_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
time_average = in_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
out_folder = out_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<a href="http://inside-r.org/r-doc/base/dir.create"><span style="color: #003399; font-weight: bold;">dir.create</span></a><span style="color: #009900;">(</span>out_folder<span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">### Read Data</span>
arc.progress_label<span style="color: #009900;">(</span><span style="color: blue;">"Reading the NetCDF Dataset..."</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Opening NC..."</span><span style="color: #009900;">)</span>
nc <- nc_open<span style="color: #009900;">(</span>source_nc<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a> <- <a href="http://inside-r.org/r-doc/base/names"><span style="color: #003399; font-weight: bold;">names</span></a><span style="color: #009900;">(</span>nc$var<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"NetCDF Variable: "</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Creating Average Rasters ..."</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Please note that this process can be time-consuming."</span><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>VAR1 <span style="color: black; font-weight: bold;">in</span> <a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Executing Script for Variable: "</span><span style="color: #339933;">,</span> VAR1<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
var.nc <- brick<span style="color: #009900;">(</span>source_nc<span style="color: #339933;">,</span> varname=VAR1<span style="color: #339933;">,</span> layer=<span style="color: blue;">"time"</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Divide by Month</span>
<a href="http://inside-r.org/packages/cran/time">TIME</a> <- <a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span>var.nc@<a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>@<a href="http://inside-r.org/r-doc/base/names"><span style="color: #003399; font-weight: bold;">names</span></a><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/stats/start"><span style="color: #003399; font-weight: bold;">start</span></a>=<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/stop"><span style="color: #003399; font-weight: bold;">stop</span></a>=<span style="color: #cc66cc;">20</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a>=<span style="color: blue;">"%Y.%m.%d.%H.%M.%S"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a> <- <a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>INDEX = <span style="color: #cc66cc;">1</span>:<a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/packages/cran/time">TIME</a>=<a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span><span style="color: #009900;">(</span>time_average==<span style="color: blue;">"Daily Averages"</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
days <- <a href="http://inside-r.org/r-doc/base/unique"><span style="color: #003399; font-weight: bold;">unique</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #339933;">,</span> <span style="color: blue;">"%d"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#LOOP</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>DAY <span style="color: black; font-weight: bold;">in</span> days<span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a> <- <a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a>$TIME<span style="color: #339933;">,</span> <span style="color: blue;">"%d"</span><span style="color: #009900;">)</span> == DAY<span style="color: #339933;">,</span><span style="color: #009900;">]</span>
sub.var <- var.nc<span style="color: #009900;">[</span><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a>$INDEX<span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Executing Average for Day: "</span><span style="color: #339933;">,</span>DAY<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
av.var <- calc<span style="color: #009900;">(</span>sub.var<span style="color: #339933;">,</span> fun=<a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #339933;">,</span> filename=paste0<span style="color: #009900;">(</span>out_folder<span style="color: #339933;">,</span><span style="color: blue;">"/"</span><span style="color: #339933;">,</span>VAR1<span style="color: #339933;">,</span><span style="color: blue;">"_Day"</span><span style="color: #339933;">,</span>DAY<span style="color: #339933;">,</span><span style="color: blue;">".tif"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Raster for Day "</span><span style="color: #339933;">,</span>DAY<span style="color: #339933;">,</span><span style="color: blue;">" Ready in the Output Folder"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span> <span style="color: black; font-weight: bold;">else</span> <span style="color: black; font-weight: bold;">if</span><span style="color: #009900;">(</span>time_average==<span style="color: blue;">"Monthly Averages"</span><span style="color: #009900;">)</span> <span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/months"><span style="color: #003399; font-weight: bold;">months</span></a> <- <a href="http://inside-r.org/r-doc/base/unique"><span style="color: #003399; font-weight: bold;">unique</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #339933;">,</span> <span style="color: blue;">"%m"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#LOOP</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>MONTH <span style="color: black; font-weight: bold;">in</span> <a href="http://inside-r.org/r-doc/base/months"><span style="color: #003399; font-weight: bold;">months</span></a><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a> <- <a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a>$TIME<span style="color: #339933;">,</span> <span style="color: blue;">"%m"</span><span style="color: #009900;">)</span> == MONTH<span style="color: #339933;">,</span><span style="color: #009900;">]</span>
sub.var <- var.nc<span style="color: #009900;">[</span><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a>$INDEX<span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Executing Average for Month: "</span><span style="color: #339933;">,</span>MONTH<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
av.var <- calc<span style="color: #009900;">(</span>sub.var<span style="color: #339933;">,</span> fun=<a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #339933;">,</span> filename=paste0<span style="color: #009900;">(</span>out_folder<span style="color: #339933;">,</span><span style="color: blue;">"/"</span><span style="color: #339933;">,</span>VAR1<span style="color: #339933;">,</span><span style="color: blue;">"_Month"</span><span style="color: #339933;">,</span>MONTH<span style="color: #339933;">,</span><span style="color: blue;">".tif"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Raster for Month "</span><span style="color: #339933;">,</span>MONTH<span style="color: #339933;">,</span><span style="color: blue;">" Ready in the Output Folder"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span> <span style="color: black; font-weight: bold;">else</span> <span style="color: black; font-weight: bold;">if</span><span style="color: #009900;">(</span>time_average==<span style="color: blue;">"Yearly Averages"</span><span style="color: #009900;">)</span> <span style="color: #009900;">{</span>
years <- <a href="http://inside-r.org/r-doc/base/unique"><span style="color: #003399; font-weight: bold;">unique</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #339933;">,</span> <span style="color: blue;">"%Y"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#LOOP</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>YEAR <span style="color: black; font-weight: bold;">in</span> years<span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a> <- <a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a>$TIME<span style="color: #339933;">,</span> <span style="color: blue;">"%Y"</span><span style="color: #009900;">)</span> == YEAR<span style="color: #339933;">,</span><span style="color: #009900;">]</span>
sub.var <- var.nc<span style="color: #009900;">[</span><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a>$INDEX<span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Executing Average for Year: "</span><span style="color: #339933;">,</span>YEAR<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
av.var <- calc<span style="color: #009900;">(</span>sub.var<span style="color: #339933;">,</span> fun=<a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #339933;">,</span> filename=paste0<span style="color: #009900;">(</span>out_folder<span style="color: #339933;">,</span><span style="color: blue;">"/"</span><span style="color: #339933;">,</span>VAR1<span style="color: #339933;">,</span><span style="color: blue;">"_Year"</span><span style="color: #339933;">,</span>YEAR<span style="color: #339933;">,</span><span style="color: blue;">".tif"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Raster for Year "</span><span style="color: #339933;">,</span>YEAR<span style="color: #339933;">,</span><span style="color: blue;">" Ready in the Output Folder"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span> <span style="color: black; font-weight: bold;">else</span> <span style="color: #009900;">{</span>
hours <- <a href="http://inside-r.org/r-doc/base/unique"><span style="color: #003399; font-weight: bold;">unique</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/time">TIME</a><span style="color: #339933;">,</span> <span style="color: blue;">"%H"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#LOOP</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>HOUR <span style="color: black; font-weight: bold;">in</span> hours<span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a> <- <a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a>$TIME<span style="color: #339933;">,</span> <span style="color: blue;">"%H"</span><span style="color: #009900;">)</span> == HOUR<span style="color: #339933;">,</span><span style="color: #009900;">]</span>
sub.var <- var.nc<span style="color: #009900;">[</span><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/subset"><span style="color: #003399; font-weight: bold;">subset</span></a>$INDEX<span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Executing Average for Hour: "</span><span style="color: #339933;">,</span>HOUR<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
av.var <- calc<span style="color: #009900;">(</span>sub.var<span style="color: #339933;">,</span> fun=<a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #339933;">,</span> filename=paste0<span style="color: #009900;">(</span>out_folder<span style="color: #339933;">,</span><span style="color: blue;">"/"</span><span style="color: #339933;">,</span>VAR1<span style="color: #339933;">,</span><span style="color: blue;">"_Hour"</span><span style="color: #339933;">,</span>HOUR<span style="color: #339933;">,</span><span style="color: blue;">".tif"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Raster for Hour "</span><span style="color: #339933;">,</span>HOUR<span style="color: #339933;">,</span><span style="color: blue;">" Ready in the Output Folder"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
As I described in my <a href="http://r-video-tutorial.blogspot.ch/2016/07/combine-arcgis-and-r-clustering-toolbox.html" target="_blank">previous post</a>, the R code for an ArcGIS Toolbox needs to be included in a function that takes inputs and outputs from the ArcGIS GUI.<br />
In this function the very first thing we need to do is load the required packages, with an option to install them if necessary. Then we need to assign object names to the inputs and output parameters.<br />
As you can see I included quite a few <code>print</code> calls so that ArcGIS users can easily follow the process.<br />
<br />
At this point we can move to the NetCDF file. As I mentioned, I will be using the <code>raster</code> package to import the NetCDF file, but first I need to open it directly with the package <code>ncdf4</code> and the function <code>nc_open</code>. This is necessary to obtain the list of variables included in the file. In my tests I downloaded the temperature at 2m above ground and the albedo, therefore the variables' names were d2m and alb. Since these names are generally not known by end users, we need a way to extract them from the NetCDF file, which is provided by these lines.<br />
<br />
Once we have that, we can start a <code>for</code> loop when we iterate through the variables. As you can see the first line within the loop needs to import the nc file with the function <code>brick</code> in the package <code>raster</code>. Within this function we need to specify the name of the variable to use. The raster names include the temporal information we are going to need to create the time averages. For this reason I created an object called <code>TIME</code> with <code>POSIXct</code> values created starting from these names, and then I collected these value into a <code>data.frame</code>. This will be used later on to extract only the indexes of the rasters with the correct date/time.<br />
<br />
Now I set up a series of <code>if</code> statements that trigger certain actions depending on what the user selected in the Time Average list on the GUI. Let us assume that the user selected "Daily Averages".<br />
At this point R first uses the function <code>format</code> to extract the days from the <code>data.frame</code> with date/time, named <code>df</code>. Then extracts the <code>unique</code> values from this list. The next step involves iterating through these days and create an average raster for each of these. This can be done with the function <code>calc</code> in <code>raster</code>. This function takes a series of rasters, a function (in this case mean) and can also save a raster object on disk. For the correct address for the output file I simply used the function <code>paste</code> to name the file according to the variable and day. The exact same process is performed for the other time averages.<br />
<br />
<span style="font-size: large;">Download</span><br />
The R script and Toolbox to perform the time average on ECMWF datasets is available on my GitHub page at:<br />
<a href="https://github.com/fveronesi/ECMWF_Toolbox/tree/v0.1" target="_blank">https://github.com/fveronesi/ECMWF_Toolbox/tree/v0.1</a><br />
<br />
As I said I have other ideas to further enhance this toolbox, thus I created a branch named v0.1. I hope to have time to do these other scripts.<br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com3tag:blogger.com,1999:blog-1442302563171663500.post-79805226369349515492016-07-02T09:50:00.000+02:002016-07-02T18:42:38.259+02:00Combining ArcGIS and R - Clustering ToolboxLast year at the ESRI User Conference in San Diego, there was an announcement of an initiative to bridge ArcGIS and R. This became reality I think early this year with<a href="https://r-arcgis.github.io/" target="_blank"> R-Bridge</a>.<br />
Basically, ESRI has created an R library that is able to communicate and exchange data between ArcGIS and R, so that we can create ArcGIS toolboxes using R scripts.<br />
<br />
I am particularly interested in this application because R has become quite powerful for spatial data analysis in the last few years. However, I have the impression that within the geography community, R is still considered a bit of an outsider. This is because the main GIS application, i.e. ArcGIS, is based on Python and therefore courses in departments of geography and geomatics tend to focus on teaching Python, neglecting R. This I think is a mistake, since R in my opinion is easier to learn for people without a background in computer science, and has very powerful libraries for spatio-temporal data analysis.<br />
For these reasons, the creation of the R-Bridge is particularly welcomed from my side because it will allow me to teach students how to create powerful new Toolboxes for ArcGIS based on scripts created in R. For example, this Autumn semester we will implement in the course of GIS III a module about geo-sensors, and then I will teach spatio-temporal data analysis using R within ArcGIS. This way students will learn the power of R starting from the familiar environment and user interface of ArcGIS.<br />
Since I never worked with R-Bridge before, Today I started doing some testing and I decided that the best way to learn it was to create a simple Toolbox to do K-Means clustering on point shapefiles, which I think is a function not available in ArcGIS. In this post I will describe in details how to create the Toolbox and the R Script to perform the analysis.<br />
<br />
<span style="font-size: large;">R-Bridge Installation</span><br />
Installing R-Bridge is extremely simple. TYou only need a recent version of R (I have the 3.0.0) installed on your PC (32-bit or 64-bit, consistent with the version of ArcGIS you have installed) and ArcGIS 10.3.1 or more recent.<br />
At this point you can download the installation files from the R-Bridge GitHub page: <a href="https://github.com/R-ArcGIS/r-bridge-install" target="_blank">https://github.com/R-ArcGIS/r-bridge-install</a><br />
You can unzip its content anywhere on your PC. At this point you need to run ArcGIS as aministrator (this is very important!!), and then in the ArcCatalog navigate to the folder where you unzip the zip you just downloaded.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgddE7yivvRWYhjR0onm9RNEfXAf7N1BH4RTUOBKSxGaNE1KBz9AvNOy7QhIncKovM6x_gL0IRaKZNmJxzQ79utEOmOZ9ctl6sjUwagbhx3f_lWEUussX4w6cAV8htsr-2gQr4F7NkE1MHb/s1600/Install_Fig1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgddE7yivvRWYhjR0onm9RNEfXAf7N1BH4RTUOBKSxGaNE1KBz9AvNOy7QhIncKovM6x_gL0IRaKZNmJxzQ79utEOmOZ9ctl6sjUwagbhx3f_lWEUussX4w6cAV8htsr-2gQr4F7NkE1MHb/s1600/Install_Fig1.jpg" /></a></div>
<br />
Now you just need to run the script "Install R Bindings" and ArcGIS will take care of the rest. I found the process extremely easy!!<br />
<br />
<span style="font-size: large;">Getting Started</span><br />
ESRI created two examples to help us get started with the development of packages for ArcGIS written in the R language. You can find it here: <a href="https://github.com/R-ArcGIS/r-sample-tools" target="_blank">https://github.com/R-ArcGIS/r-sample-tools</a><br />
When you unzip this you will find a folder named "Scripts" where you can find R scripts optimized for the use in ArcGIS. I started from these to learn how to create scripts that work.<br />
<br />
<br />
<span style="font-size: large;">Clustering Example - R Script</span><br />
As I said, ESRI created a specific library for R to be able to communicate back and forth with ArcGIS: it is called "arcbinding" and it is installed during the installation process we completed before. This library has a series of functions that allow the R script to be run starting from the ArcGIS console and its GUI. For this reason the R script is a bit different compared to the one you would do to reach the same result outside of ArcGIS. Probably it is better if I just start including some code so that you can better understand.<br />
Below is the full R script I used for this example:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;">### KMeans Clustering Toolbox</span>
<span style="color: #666666; font-style: italic;">##Author: Fabio Veronesi</span>
tool_exec <- <a href="http://inside-r.org/r-doc/base/function"><span style="color: #003399; font-weight: bold;">function</span></a><span style="color: #009900;">(</span>in_params<span style="color: #339933;">,</span> out_params<span style="color: #009900;">)</span>
<span style="color: #009900;">{</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!requireNamespace<span style="color: #009900;">(</span><span style="color: blue;">"sp"</span><span style="color: #339933;">,</span> quietly = <span style="color: black; font-weight: bold;">TRUE</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/install.packages"><span style="color: #003399; font-weight: bold;">install.packages</span></a><span style="color: #009900;">(</span><span style="color: blue;">"sp"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/require"><span style="color: #003399; font-weight: bold;">require</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/sp">sp</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"K-Means Clustering of Shapefiles"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Author: Fabio Veronesi"</span><span style="color: #009900;">)</span>
source_dataset = in_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
nclust = in_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
variable = in_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
out_shape = out_params<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>
<span style="color: #666666; font-style: italic;">### Read Data</span>
arc.progress_label<span style="color: #009900;">(</span><span style="color: blue;">"Loading Dataset"</span><span style="color: #009900;">)</span>
d <- arc.open<span style="color: #009900;">(</span>source_dataset<span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">### Create a Data.Frame with the variables to cluster</span>
<a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a> <- arc.select<span style="color: #009900;">(</span>d<span style="color: #339933;">,</span> variable<span style="color: #009900;">)</span>
data_clust <- <a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #009900;">[</span><span style="color: #339933;">,</span>variable<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">if</span><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>variable<span style="color: #009900;">)</span>><span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>i <span style="color: black; font-weight: bold;">in</span> <span style="color: #cc66cc;">2</span>:<a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>variable<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
data_clust <- <a href="http://inside-r.org/r-doc/base/cbind"><span style="color: #003399; font-weight: bold;">cbind</span></a><span style="color: #009900;">(</span>data_clust<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #009900;">[</span><span style="color: #339933;">,</span>variable<span style="color: #009900;">[</span>i<span style="color: #009900;">]</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
<span style="color: #009900;">}</span>
<a href="http://inside-r.org/r-doc/base/names"><span style="color: #003399; font-weight: bold;">names</span></a><span style="color: #009900;">(</span>data_clust<span style="color: #009900;">)</span> <- variable
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>i <span style="color: black; font-weight: bold;">in</span> <span style="color: #cc66cc;">1</span>:<a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>variable<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/grDevices/dev.new"><span style="color: #003399; font-weight: bold;">dev.new</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/graphics/hist"><span style="color: #003399; font-weight: bold;">hist</span></a><span style="color: #009900;">(</span>data_clust<span style="color: #009900;">[</span><span style="color: #339933;">,</span>i<span style="color: #009900;">]</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>main=paste0<span style="color: #009900;">(</span><span style="color: blue;">"Histogram of "</span><span style="color: #339933;">,</span>variable<span style="color: #009900;">[</span>i<span style="color: #009900;">]</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>xlab=variable<span style="color: #009900;">[</span>i<span style="color: #009900;">]</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
clusters <- <a href="http://inside-r.org/r-doc/stats/kmeans"><span style="color: #003399; font-weight: bold;">kmeans</span></a><span style="color: #009900;">(</span>data_clust<span style="color: #339933;">,</span> nclust<span style="color: #009900;">)</span>
result <- <a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/survival/cluster">cluster</a>=clusters$cluster<span style="color: #009900;">)</span>
arc.write<span style="color: #009900;">(</span>out_shape<span style="color: #339933;">,</span> result<span style="color: #339933;">,</span> coords = arc.shape<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/print"><span style="color: #003399; font-weight: bold;">print</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Done!!"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/return"><span style="color: #003399; font-weight: bold;">return</span></a><span style="color: #009900;">(</span>out_params<span style="color: #009900;">)</span>
<span style="color: #009900;">}</span></pre>
</div>
</div>
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
As you can see, the whole script is wrapped in a function called <code>tool_exec</code> with two arguments <code>in_params</code> and <code>out_params</code>. These are the list of input and output parameters that will be passed to R from ArcGIS.<br />
The next three lines are taken directly from the script that ESRI provides. Basically, if the user does not have the package <code>sp</code> installed, R will download, install and load it. You can copy and paste these lines if you need other packages installed on the user's machine to perform your analysis. In this case I am only using the function <code>kmeans</code>, available in the package <code>stats</code>, which is loaded by default in R.<br />
At this point I inserted two <code>print</code> calls with the title and my name on them. This has no real purpose except to let you know that you can <code>print</code> information from R directly onto the dialog in ArcGIS with simple <code>print</code> calls. We will see at the end how they look.<br />
Now we need to create an object for each input and output parameter you need. We will need to specify these in ArcGIS once we create the Toolbox. Since I want to cluster a shapefile, the first input parameter will be this object. Then I want the user to select the number of clusters, so I will create another option for this. Then I would like the user to be able to select the variables s/he wants to use for clustering, so I will need to create an option for this in ArcGIS and then collect it into the object variable. Finally, ArcGIS will save another shapefile with the points plus their cluster. This will be the only output parameter, and I collect it into the object <code>out_shape</code>.<br />
<br />
Now I can start the real computation. The function <code>arc.open</code> allows to import in R the shapefile from the Toolbox in ArcGIS. If you want you can take a look at the structure of this object by simply inserting <code>print(str(d))</code> right after it. This will print the structure of the object <code>d</code> in the dialog created in ArcGIS.<br />
Now we have the function <code>arc.select</code>, which allows to extract from <code>d</code> only the variables we need and that are selected by the user on the Toolbox GUI.<br />
At this point we need to create a <code>data.frame</code> that we are going to fill with only the variables the user selected in the Toolbox. The object <code>variable</code> is a list of strings, therefore we can use its elements to extract single columns from the object <code>data</code>, with the syntax <code>data[,variable[1]]</code>.<br />
Since we do not know how many variables will users select and we do not want to limit them, I created an <code>if</code> statement with a loop to attach additional columns to the object <code>data_clust</code>. Then I replaced the column names in <code>data_clust</code> with the names of the variables, this will help me in the next phase.<br />
Now in fact, I want to produce histograms of the variables the user selected. This will allow me to check whether what I am about to do makes sense, and it is one of those things for which R excels. For this I can simply call the function <code>plot</code> and R will show it even if it is called from ArcGIS, as simple as that!! We only need to remember to insert <code>dev.new()</code> so that each plot is created separately and the user can see/save them all.<br />
After this step we can call the function <code>kmeans</code> to cluster our data. Then we can collect the results in a new <code>data.frame</code>, and finally use the function <code>arc.write</code> to write the object <code>out_shape</code> with the results. As you can see we also need to specify the coordinates of each point and this can be done calling the function <code>arc.shape</code>.<br />
Then we print the string "Done!!" and return the output parameters, that will be taken from ArcGIS and shown to the user.<br />
<br />
<br />
<span style="font-size: large;">Toolbox</span><br />
Now that we've seen how to create the R script we can take a look at the Toolbox, since both things need to be done in parallel.<br />
Creating a new Toolbox in ArcGIS is very simple, we just need to open ArcCatalog, click where we want to create it with the right mouse button and then select New->Toolbox.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7P9is__a5Ej9e3g3lr558KCT7bih65LR7XVsPW5AaF_RFeIr1cU7A0ThYEMRKeY7Kn-vA11Nl_sg67svDNskWJfpqt6LPWMtzQg20PDrj8QEp-yFRQhHAb8AGAVsT15bWEdC-6mp9lx4L/s1600/Fig1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7P9is__a5Ej9e3g3lr558KCT7bih65LR7XVsPW5AaF_RFeIr1cU7A0ThYEMRKeY7Kn-vA11Nl_sg67svDNskWJfpqt6LPWMtzQg20PDrj8QEp-yFRQhHAb8AGAVsT15bWEdC-6mp9lx4L/s320/Fig1.jpg" width="262" /></a></div>
<br />
Once this is done we will then need to add, within this Toolbox, a script. To do this we can again click with the right button on the Toolbox we just created and then select Add->Script...<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfN-JLtY3ANSOJchKaWfIX5roK953LkXik2xS1c8BaaNnXyHJwubAtC0xmb4CLFOgJpXe-aeEELjaWSRZqozKvW7-1TRqnebNiON6UMtXFAagqP6jOMpVKNHArZxVcQaOQqdx6b4zr_JtD/s1600/Fig2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfN-JLtY3ANSOJchKaWfIX5roK953LkXik2xS1c8BaaNnXyHJwubAtC0xmb4CLFOgJpXe-aeEELjaWSRZqozKvW7-1TRqnebNiON6UMtXFAagqP6jOMpVKNHArZxVcQaOQqdx6b4zr_JtD/s320/Fig2.jpg" width="307" /></a></div>
<br />
At this point a dialog will appear where we can set the parameters of this script. First we add a title and a label and click proceed (my PC runs with Italian as the local language, sorry!!)<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2casdHRf2fONOXN5dDFRUv_gCeSptjAYDnfbvq2HmCEeE_8JTh6GkxVvM7PhhlN101YmosYJg68_EUQXTY9cCt3uJLD-0PEjfIcFIbG0ZbXKB57Q_jlCkaiTfynS8HcgpJr2Po3Yn12Ud/s1600/Fig3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2casdHRf2fONOXN5dDFRUv_gCeSptjAYDnfbvq2HmCEeE_8JTh6GkxVvM7PhhlN101YmosYJg68_EUQXTY9cCt3uJLD-0PEjfIcFIbG0ZbXKB57Q_jlCkaiTfynS8HcgpJr2Po3Yn12Ud/s320/Fig3.jpg" width="257" /></a></div>
<br />
Then we need to select the R script we need to run. Since the creation of the Toolbox can also be done before taking care of the script, here we can select an R script not completed and ArcGIS will not have any problem. This is what I did to create this example, so that I could debug the R script using <code>print</code> calls and looking at the results on the ArcGIS dialog.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyZ6piqEOH1XeccMt3vnW4EM2rAFiA6UK53u1XhaYViHldJuSOv0sHGQQoYSWameVS8FMGrVhjDZ23FrNbPy5DDF0mrY6_rEQ5xDBhu2aoqcjKYrG2XIPLO3NMucT_0aO0obYKdltWTM5B/s1600/Fig4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyZ6piqEOH1XeccMt3vnW4EM2rAFiA6UK53u1XhaYViHldJuSOv0sHGQQoYSWameVS8FMGrVhjDZ23FrNbPy5DDF0mrY6_rEQ5xDBhu2aoqcjKYrG2XIPLO3NMucT_0aO0obYKdltWTM5B/s320/Fig4.jpg" width="255" /></a></div>
<br />
The next window is very important, because it allows us to set the input and output parameters that will then be passed to R. As we saw in the R script here I set 4 parameters, 3 inputs and 1 output. It is important that the order matches what we have in the R script, so for example the number of clusters is the second input.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4hbGFBqNoDFjr-p3ir68touHDnFm0wthhfzqbExREBa-7Iy_k2rQLVlaASKF06_JR1o4fIx1quzJOQy1U1DROg6iMi0ILyQZq7f3f8jzinmXjexZFqAkfBlTjGs6J5ui0kRme9VjpgRrY/s1600/Fig5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4hbGFBqNoDFjr-p3ir68touHDnFm0wthhfzqbExREBa-7Iy_k2rQLVlaASKF06_JR1o4fIx1quzJOQy1U1DROg6iMi0ILyQZq7f3f8jzinmXjexZFqAkfBlTjGs6J5ui0kRme9VjpgRrY/s320/Fig5.jpg" width="303" /></a></div>
<br />
The first parameter is the input data. For this I used the type "Table View", which allows the user to select a dataset s/he already imported in ArcGIS. I selected this because usually I first load data into ArcGIS, check them, and then perform some analysis. However, if you prefer I think you could also select the type shapefile, to allow users to select a shp directly from their PC.<br />
The next parameter is the number of clusters, which is a simple number. Then we have the field variables. This is very important because we need to set it in a way that allow users to select variables directly from the dataset we are importing.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNbZhEUEyQTwORBztA53k739jdJkQkqvWMSErqrLxHeYpOt0uw7m37xi-unjMAAosU00-Ux10OgO4NZgd1qyuUFsx09cmSwk0q1JnEry_fmRXsoX4SAqp8LBu5__QBlMLM048ayWt3PBr3/s1600/Fig6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNbZhEUEyQTwORBztA53k739jdJkQkqvWMSErqrLxHeYpOt0uw7m37xi-unjMAAosU00-Ux10OgO4NZgd1qyuUFsx09cmSwk0q1JnEry_fmRXsoX4SAqp8LBu5__QBlMLM048ayWt3PBr3/s320/Fig6.jpg" width="299" /></a></div>
<br />
We can do that by setting the options "Filter" and "Obtained from" that you see in the image above. It is important that we set "Obtained from" with the name of our input data.<br />
At this point we can set the output file, which is a shapefile.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3TYt9Teeejczic497OkfY_ARIU_uSBsjA6xfAfh1Uqttu-pBJqSCJHOYCPslM16x52IBQvpBlj373plsPJ8FW6CKCJNm-hWZQ4s8ViOTmH9uHYDHv7F4Wvs78QbqJ9tOCnFEq4N-0Ck8M/s1600/Fig7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3TYt9Teeejczic497OkfY_ARIU_uSBsjA6xfAfh1Uqttu-pBJqSCJHOYCPslM16x52IBQvpBlj373plsPJ8FW6CKCJNm-hWZQ4s8ViOTmH9uHYDHv7F4Wvs78QbqJ9tOCnFEq4N-0Ck8M/s320/Fig7.jpg" width="309" /></a></div>
<br />
One thing we could do is set the symbology for the shapefile that will be created at the end of the script. To do so we need to create and set a layer file. I did it by changing the symbology to another shapefile and then export it. The only problem is that this technique is not really flexible, meaning that if the layer is set for 5 clusters and users select 10, the symbology will still be with 5 colors. I am not sure if that can be changed or adapted somehow. If the symbology file is not provided the R script will still run correctly and produce a result, but this will not have any colors and users will need to set these afterwards, which probably is not a big deal.<br />
Once this final step is done we can finish the creation of the tool and take a look at the resulting GUI:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfC-Bzf8Y-XA-Mi6Ur8KHZGMPz_DrVVKwDPku-nqp-5D1UnOmH5PsoGMkGjPjcfZCqWH98A80hSbG2MCiSi9iNdWVTpQeJ5vH-3ayOPlnD4Y4LqAojWgMx2erGcY-e9pTTlV0d9VSCov11/s1600/Fig8.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfC-Bzf8Y-XA-Mi6Ur8KHZGMPz_DrVVKwDPku-nqp-5D1UnOmH5PsoGMkGjPjcfZCqWH98A80hSbG2MCiSi9iNdWVTpQeJ5vH-3ayOPlnD4Y4LqAojWgMx2erGcY-e9pTTlV0d9VSCov11/s320/Fig8.jpg" width="320" /></a></div>
<br />
<br />
<span style="font-size: large;">Run the Toolbox</span><br />
Now that we have created both the script and the Toolbox to run it, we can test it. I included a shapefile with the location of Earthquakes that I downloaded from the USGS website yesterday (01 July 2016). This way you can test the tool with real data. As variables you can select: depth, magnitude, distance from volcanoes, faults and tectonic plates. For more info on this dataset please look at one of my previous posts: <a href="http://r-video-tutorial.blogspot.ch/2015/06/cluster-analysis-on-earthquake-data.html" target="_blank">http://r-video-tutorial.blogspot.ch/2015/06/cluster-analysis-on-earthquake-data.html</a><br />
We only need to fill the values in the GUI and then click OK. You can see the result in the image below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig9tAdQO_GBJFWe_ZUaYOiDlQ8cDtXct6QPJljBmRUeOTrBhGJZfAvTDd4kQpO6VLa8_7rDXr_lTIdtLz2QtXEobCUuOavu66d9ms7emFgwA10ZdIlubUUX-L4j5y-NsEcgVB2OrfGfxuu/s1600/Fig9.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig9tAdQO_GBJFWe_ZUaYOiDlQ8cDtXct6QPJljBmRUeOTrBhGJZfAvTDd4kQpO6VLa8_7rDXr_lTIdtLz2QtXEobCUuOavu66d9ms7emFgwA10ZdIlubUUX-L4j5y-NsEcgVB2OrfGfxuu/s640/Fig9.jpg" width="640" /></a></div>
<br />
As you can see R first produces a histogram of the variable/s the user selects, which can be saved. Then creates a shapefile that is automatically imported in ArcGIS. Moreover, as you can see from the dialog box, we can use the function print to provide messages to the user. Here I put only some simple text, but it may well be some numerical results.<br />
<br />
<span style="font-size: large;">Source Code</span><br />
The source code for this Toolbox is provided in my GitHub at this link:<br />
<a href="https://github.com/fveronesi/Clustering_Toolbox" target="_blank">https://github.com/fveronesi/Clustering_Toolbox</a><br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com8tag:blogger.com,1999:blog-1442302563171663500.post-10421538889177206672016-04-25T13:50:00.000+02:002016-04-25T13:51:24.840+02:00Learning R for Data Visualization [Video]Last year Packt asked me to develop a video course to teach various techniques of data visualization in R. Since I love the idea of video courses and tutorials, and I also enjoy plotting data, I readily agreed.<br />
The result is this course, published last March, which I will briefly present below.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.packtpub.com/sites/default/files/bookretailers/9781785882890.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://www.packtpub.com/sites/default/files/bookretailers/9781785882890.jpg" width="259" /></a></div>
<br />
The course is available here:<br />
<a href="https://www.packtpub.com/big-data-and-business-intelligence/learning-r-data-visualization-video" target="_blank">https://www.packtpub.com/big-data-and-business-intelligence/learning-r-data-visualization-video</a><br />
<br />
I wanted to create a course that was easy to follow, and at the same time could provide a good basis even for the most advanced forms of data visualization available today in R.<br />
Packt was interested in presenting ggplot2, which is definitely the most advanced way of creating static plots. Since I regularly use ggplot2 and I find it a tremendous tool, I was glad to be able to present its functionalities more in details. Three chapters are dedicated to this package. Here I present all the most important types of plots: histograms, box-plots, scatterplots, bar-charts and time-series. Moreover, a whole chapter is dedicated to embellish the default plots by adding elements, such as text labels and much more.<br />
<br />
However, I am also very interested in interactive plotting, which I believe is now rapidly becoming commonplace for lots of applications. For this reason two chapters are completely dedicated to interactive plots. In the first I present the package rCharts, which is extremely powerful but also a bit tricky to use at times. In many cases there is little documentation to work with, and for developing the course I found myself often wondering through stackoverflow searching for answers. Luckily for all of us, Prof. Ramnath Vaidyanathan, the creator of rCharts, is always available to answer all the users' questions quickly and clearly. In chapter 5 the viewer will be able to start from zero and quickly create nice interactive versions of all the plots I covered with ggplot2. <br />
<br />
The last chapter is dedicated to Shiny and it is aimed at the creation of a full website for importing and plotting data. Here the reader will first learn the basics of Shiny and then will write the code to create the website and add lots of interesting functionalities.<br />
<br />
I hope this video course will help R users become familiar with data visualization.<br />
I would also like to take this opportunity to stress that I am open to support viewers throughout the learning process, meaning that if you have any question about the material in the course you should not hesitate one second in contacting me at <a href="mailto:%20info@fabioveronesi.net" target="_blank">info@fabioveronesi.net</a><br />
<br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com1tag:blogger.com,1999:blog-1442302563171663500.post-72542429895195853892015-12-31T17:35:00.001+01:002015-12-31T17:36:18.139+01:00Wind Resource Assessment This is an article we recently published on "Renewable and Sustainable Energy Reviews". It starts with a thorough review of the methods used for wind resource assessment: from algorithms based on physical laws to other based on statistics, plus mixed methods.<br />
In the second part of the manuscript we present a method for wind resource assessment based on the application of Random Forest, coded completely in R.<br />
<br />
Elsevier allows to download the full paper for FREE until the 12th of February, so if you are interested please download a copy.<br />
This is the link: <a href="http://authors.elsevier.com/a/1SG5a4s9HvhNZ6" target="_blank">http://authors.elsevier.com/a/1SG5a4s9HvhNZ6</a><br />
<br />
Below is the abstract.<br />
<br />
<h4>
Abstract</h4>
Wind resource assessment is fundamental when selecting a site for wind energy projects. Wind is influenced by several environmental factors and understanding its spatial variability is key in determining the economic viability of a site. Numerical wind flow models, which solve physical equations that govern air flows, are the industry standard for wind resource assessment. These methods have been proven over the years to be able to estimate the wind resource with a relatively high accuracy. However, measuring stations, which provide the starting data for every wind estimation, are often located at some distance from each other, in some cases tens of kilometres or more. This adds an unavoidable amount of uncertainty to the estimations, which can be difficult and time consuming to calculate with numerical wind flow models. For this reason, even though there are ways of computing the overall error of the estimations, methods based on physics fail to provide planners with detailed spatial representations of the uncertainty pattern. In this paper we introduce a statistical method for estimating the wind resource, based on statistical learning. In particular, we present an approach based on ensembles of regression trees, to estimate the wind speed and direction distributions continuously over the United Kingdom (UK), and provide planners with a detailed account of the spatial pattern of the wind map uncertainty.Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com2tag:blogger.com,1999:blog-1442302563171663500.post-5375202377847668292015-08-27T11:46:00.002+02:002015-08-28T14:51:49.525+02:00Spatio-Temporal Kriging in R<h2>
Preface</h2>
<div style="text-align: justify;">
I am writing this post more for reminding to myself some theoretical background and the steps needed to perform spatio-temporal kriging in <b>gstat</b>. </div>
<div style="text-align: justify;">
This month I had some free time to spend on small projects not specifically related to my primary occupation. I decided to spend some time trying to learn this technique since it may become useful in the future. However, I have never used it before so I had to first try to understand its basics both in terms of theoretical background and programming.</div>
<div style="text-align: justify;">
Since I have used several resources to get a handle on it, I decided to share my experience and thoughts on this blog post because they may become useful for other people trying the same method. However, this post cannot be considered a full review of spatio-temporal kriging and its theoretical basis. I just mentioned some important details to guide myself and the reader through the comprehension of the topic, but these are clearly not exhaustive. At the end of the post I included some references to additional material you may want to browse for more details.
</div>
<br />
<h2>
Introduction</h2>
<div style="text-align: justify;">
This is the first time I considered spatio-temporal interpolation. Even though many datasets are indexed in both space and time, in the majority of cases time is not really taken into account for the interpolation. As an example we can consider temperature observations measured hourly from various stations in a determined study area. There are several different things we can do with such a dataset. We could for instance create a series of maps with the average daily or monthly temperatures. Time is clearly considered in these studies, but not explicitly during the interpolation phase. If we want to compute daily averages we first perform the averaging and then kriging. However, the temporal interactions are not considered in the kriging model.
An example of this type of analysis is provided by (Gräler, 2012) in the following image, which depicts monthly averages for some environmental parameter in Germany:</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgELv7VVJiCahyphenhyphenMTJ_yjrNtHfLxMpeBm00cm5WIRNBF9O-lI50OzVA2eRkzlVPz81zYa_JNlL5kWfJXkM8F9n5Bry4gn92GVXbZxwD1b00u1wWcQsa9QUSAQGqRYcDXMwLZusxDVzarHUHM/s1600/Fig1.tif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="303" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgELv7VVJiCahyphenhyphenMTJ_yjrNtHfLxMpeBm00cm5WIRNBF9O-lI50OzVA2eRkzlVPz81zYa_JNlL5kWfJXkM8F9n5Bry4gn92GVXbZxwD1b00u1wWcQsa9QUSAQGqRYcDXMwLZusxDVzarHUHM/s400/Fig1.tif" width="400" /></a></div>
<br />
<div style="text-align: justify;">
There are cases and datasets in which performing 2D kriging on “temporal slices” may be appropriate. However, there are other instances where this is not possible and therefore the only solution is take time into account during kriging. For doing so two possible solutions are suggested in literature: using time as a third dimension, or fit a covariance model with both spatial and temporal components (Gräler et al., 2013).
<br />
<br /></div>
<h3>
Time as the third dimension</h3>
<div style="text-align: justify;">
The idea behind this technique is extremely easy to grasp. To better understand it we can simply take a look at the equation to calculate the sample semivariogram, from Sherman (2011):
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgS_-RynDmoZxUiAfSzToFghFV8oxnEKRO_E4WxfAlDXxcpAuk0n2QDnaAyRU-Omj0aCU1sJDUhNyVt2sqxMIWlzsUovFqaXawYOLjGR5BnafsrHkV0_WIPnXk-hEzadWj-EJpQ7B8Lo6W3/s1600/Eq1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="67" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgS_-RynDmoZxUiAfSzToFghFV8oxnEKRO_E4WxfAlDXxcpAuk0n2QDnaAyRU-Omj0aCU1sJDUhNyVt2sqxMIWlzsUovFqaXawYOLjGR5BnafsrHkV0_WIPnXk-hEzadWj-EJpQ7B8Lo6W3/s640/Eq1.jpg" width="640" /></a></div>
<br />
<div style="text-align: justify;">
Under Matheron’s Intrinsic Hypothesis (Oliver et al., 1989) we can assume that the variance between two points, <i>s<sub>i</sub></i> and <i>s<sub>j</sub></i>, depends only on their separation, which we indicate with the vector <i>h</i> in Eq.1. If we imagine a 2D example (i.e. purely spatial), the vector <i>h</i> is simply the one that connects two points, <i>i</i> and <i>j</i>, with a line, and its value can be calculated with the Euclidean distance:
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghaacxbPrYVDIqmQs2_AS9owEx8UebWopl5jVE1G-3iQCLgIItHnKdpgvong-gGm7Sovi4I_eJep5QLDhpOuwV_c0aS6r79cGPJUo-sLkvCmAygu7mSxaISXt5xXLuX9DQVcMPDPz1dpP3/s1600/Eq2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghaacxbPrYVDIqmQs2_AS9owEx8UebWopl5jVE1G-3iQCLgIItHnKdpgvong-gGm7Sovi4I_eJep5QLDhpOuwV_c0aS6r79cGPJUo-sLkvCmAygu7mSxaISXt5xXLuX9DQVcMPDPz1dpP3/s640/Eq2.jpg" /></a></div>
<br />
<div style="text-align: justify;">
If we consider a third dimension, which can be depth, elevation or time; it is easy to imagine Eq.2 be adapted to accommodate an additional dimension.
The only problem with this method is that in order for it to work properly the temporal dimension needs to have a range similar to the spatial dimension. For this reason time needs to be scaled to align it with the spatial dimension. In Gräler et al. (2013) they suggest several ways to optimize the scaling and achieve meaningful results. Please refer to this article for more information.
<br />
<br /></div>
<h3>
Spatio-Temporal Variogram </h3>
<div style="text-align: justify;">
The second way of taking time into account is to adapt the covariance function to the time component. In this case for each point <i>s<sub>i</sub></i> there will be a time <i>t<sub>i</sub></i> associated with it, and to calculate the variance between this point and another we would need to calculate their spatial separation <i>h</i> and their temporal separation <i>u</i>. Thus, the spatio-temporal variogram can be computed as follows, from Sherman (2011):
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijaW2ucwQznn17ODBpanH5JD2M1ZsXR3oR9GJDzH2SBUR2jXchbcgGHEnSHAmuW3eW0GsX7n0bV9kwy0oxN_mMLHikTHqYRUwqHjZ1CFYSY2m6SrAxeT5DV8y3KkhSANO376DazfVGZdrd/s1600/Eq3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijaW2ucwQznn17ODBpanH5JD2M1ZsXR3oR9GJDzH2SBUR2jXchbcgGHEnSHAmuW3eW0GsX7n0bV9kwy0oxN_mMLHikTHqYRUwqHjZ1CFYSY2m6SrAxeT5DV8y3KkhSANO376DazfVGZdrd/s640/Eq3.jpg" /></a></div>
<br />
<div style="text-align: justify;">
With this equation we can compute a variogram taking into account every pair of points separated by distance <i>h</i> and time <i>u</i>.<br />
<br />
<br /></div>
<h2>
Spatio-Temporal Kriging in R</h2>
<div style="text-align: justify;">
In R we can perform spatio-temporal kriging directly from <b>gstat</b> with a set of functions very similar to what we are used to in standard 2D kriging. The package <b>spacetime</b> provides ways of creating objects where the time component is taken into account, and <b>gstat</b> uses these formats for its space-time analysis. Here I will present an example of spatio-temporal kriging using sensors’ data.
</div>
<h3>
Data</h3>
<div style="text-align: justify;">
In 2011, as part of the OpenSense project, several wireless sensors to measure air pollution (O3, NO2, NO, SO2, VOC, and fine particles) were installed on top of trams in the city of Zurich. The project now is in its second phase and more information about it can be found here: <a href="http://www.opensense.ethz.ch/trac/wiki/WikiStart">http://www.opensense.ethz.ch/trac/wiki/WikiStart</a> </div>
<div style="text-align: justify;">
In this page some examples data about Ozone and Ultrafine particles are also distributed in csv format. These data have the following characteristics: time is in UNIX format, while position is in degrees (WGS 84). I will use these data to test spatio-temporal kriging in R.
</div>
<h3>
Packages</h3>
<div style="text-align: justify;">
To complete this exercise we need to load several packages. First of all <b>sp</b>, for handling spatial objects, and <b>gstat</b>, which has all the function to actually perform spatio-temporal kriging. Then <b>spacetime</b>, which we need to create the spatio-temporal object. These are the three crucial packages. However, I also loaded some others that I used to complete smaller tasks. I loaded the <b>raster</b> package, because I use the functions <code>coordinates</code> and <code>projection</code> to create spatial data. There is no need of loading it, since the same functions are available under different names in <b>sp</b>. However, I prefer these two because they are easier to remember. The last packages are <b>rgdal</b> and <b>rgeos</b>, for performing various operations on geodata.</div>
<div style="text-align: justify;">
The script therefore starts like: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/gstat">gstat</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/sp">sp</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>spacetime<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>raster<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/packages/cran/rgdal">rgdal</a><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>rgeos<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<h3>
Data Preparation</h3>
<div style="text-align: justify;">
There are a couple of issues to solve before we can dive into kriging. The first is that we need to do is translating the time from UNIX to <code>POSIXlt</code> or <code>POSIXct</code>, which are standard ways of representing time in R. This very first thing we have to do is of course setting the working directory and loading the csv file:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/setwd"><span style="color: #003399; font-weight: bold;">setwd</span></a><span style="color: #009900;">(</span><span style="color: blue;">"..."</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a> <- <a href="http://inside-r.org/r-doc/utils/read.table"><span style="color: #003399; font-weight: bold;">read.table</span></a><span style="color: #009900;">(</span><span style="color: blue;">"ozon_tram1_14102011_14012012.csv"</span><span style="color: #339933;">,</span> sep=<span style="color: blue;">","</span><span style="color: #339933;">,</span> header=T<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Now we need to address the UNIX time. So what is UNIX time anyway? </div>
<div style="text-align: justify;">
It is a way of tracking time as the number of seconds between a particular time and the UNIX epoch, which is January the 1st 1970 GMT. Basically, I am writing the first draft of this post on August the 18th at 16:01:00 CET. If I count the number of seconds from the UNIX epoch to this exact moment (there is an app for that!!) I find the UNIX time, which is equal to: 1439910060 </div>
<div style="text-align: justify;">
Now let's take a look at one entry in the column “<i>generation_time</i>” of our dataset: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> <a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$generation_time<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: blue;">"1318583686494"</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
As you may notice here the UNIX time is represented by 13 digits, while in the example above we just had 10. The UNIX time here represents also the milliseconds, which is something we cannot represent in R (as far as I know). So we cannot just convert each numerical value into <code>POSIXlt</code>, but we first need to extract only the first 10 digits, and then convert it. This can be done in one line of code but with multiple functions:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$TIME <- <a href="http://inside-r.org/r-doc/base/as.POSIXlt"><span style="color: #003399; font-weight: bold;">as.POSIXlt</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$generation_time<span style="color: #009900;">)</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> origin=<span style="color: blue;">"1970-01-01"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<div style="text-align: justify;">
We first need to transform the UNIX time from numerical to character format, using the function <code>paste(data$generation_time)</code>. This creates the character string shown above, which we can then subset using the function <code>substr</code>. This function is used to subtract characters from a string and takes three arguments: a string, a starting character and a stopping character. In this case we want to basically delete the last 3 numbers from our string, so we set the start on the first number (<code>start=1</code>), and the stop at 10 (<code>stop=10</code>). Then we need to change the numerical string back to a numerical format, using the function <code>as.numeric</code>. Now we just need one last function to tell R that this particular number is a Date/Time object. We can do this using the function <code>as.POSIXlt</code>, which takes the actual number we just created plus an origin. Since we are using UNIX time, we need to set the starting point at "<i>1970-01-01</i>". We can test this function of the first element of the vector <i>data$generation_time</i> to test its output: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> <a href="http://inside-r.org/r-doc/base/as.POSIXlt"><span style="color: #003399; font-weight: bold;">as.POSIXlt</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$generation_time<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/stats/start"><span style="color: #003399; font-weight: bold;">start</span></a>=<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/stop"><span style="color: #003399; font-weight: bold;">stop</span></a>=<span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> origin=<span style="color: blue;">"1970-01-01"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: blue;">"2011-10-14 11:14:46 CEST"</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Now the <code>data.frame</code> data has a new column named <i>TIME</i> where the Date/Time information are stored. </div>
<div style="text-align: justify;">
Another issue with this dataset is in the formats of latitude and longitude. In the csv files these are represented in the format below: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> <a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$longitude<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">832.88198</span>
<span style="color: #cc66cc;">76918</span> Levels: <span style="color: #cc66cc;">829.4379</span> <span style="color: #cc66cc;">829.43822</span> <span style="color: #cc66cc;">829.44016</span> <span style="color: #cc66cc;">829.44019</span> <span style="color: #cc66cc;">829.4404</span> ... <span style="color: black; font-weight: bold;">NULL</span>
> <a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$latitude<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">4724.22833</span>
<span style="color: #cc66cc;">74463</span> Levels: <span style="color: #cc66cc;">4721.02182</span> <span style="color: #cc66cc;">4721.02242</span> <span style="color: #cc66cc;">4721.02249</span> <span style="color: #cc66cc;">4721.02276</span> ... <span style="color: black; font-weight: bold;">NULL</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Basically geographical coordinates are represented in degrees and minutes, but without any space. For example, for this point the longitude is 8°32.88’, while the latitude is 47°24.22’. For obtaining coordinates with a more manageable format we would again need to use strings.</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$LAT <- <a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$latitude<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>+<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$latitude<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">3</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>/<span style="color: #cc66cc;">60</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$LON <- <a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$longitude<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>+<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/substr"><span style="color: #003399; font-weight: bold;">substr</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$longitude<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>/<span style="color: #cc66cc;">60</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
We use again a combination of <code>paste</code> and <code>substr</code> to extract only the numbers we need. For converting this format into degrees, we need to sum the degrees with the minutes divided by 60. So in the first part of the equation we just need to extract the first two digits of the numerical string and transform them back to numerical format. In the second part we need to extract the remaining of the strings, transform them into numbers and then divided them by 60.This operation creates some <code>NA</code>s in the dataset, for which you will get a warning message. We do not have to worry about it as we can just exclude them with the following line: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a> <- <a href="http://inside-r.org/r-doc/stats/na.omit"><span style="color: #003399; font-weight: bold;">na.omit</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<br />
<h3>
Subset</h3>
<div style="text-align: justify;">
The ozone dataset by OpenSense provides ozone readings every minute or so, from October the 14th 2011 at around 11 a.m., up until January the 14th 2012 at around 2 p.m.</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> <a href="http://inside-r.org/r-doc/base/min"><span style="color: #003399; font-weight: bold;">min</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$TIME<span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: blue;">"2011-10-14 11:14:46 CEST"</span>
> <a href="http://inside-r.org/r-doc/base/max"><span style="color: #003399; font-weight: bold;">max</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$TIME<span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: blue;">"2012-01-14 13:40:43 CET"</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
The size of this dataset is 200183 rows, which makes it kind of big for perform kriging without a very powerful machine. For this reason before we can proceed with this example we have to subset our data to make them more manageable. To do so we can use the standard subsetting method for <code>data.frame</code> objects using Date/Time:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> <a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a> <- <a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a><span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>$TIME>=<a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span><span style="color: blue;">'2011-12-12 00:00 CET'</span><span style="color: #009900;">)</span>&data$TIME<=<a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span><span style="color: blue;">'2011-12-14 23:00 CET'</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>
> <a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">6734</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Here I created an object named sub, in which I used only the readings from midnight on December the 12th to 11 p.m. on the 14th. This creates a subset of 6734 observations, for which I was able to perform the whole experiment using around 11 Gb of RAM. </div>
<div style="text-align: justify;">
After this step we need to transform the object <i>sub</i> into a spatial object, and then I changed its projection into UTM so that the variogram will be calculated on metres and not degrees. These are the steps required to achieve all this: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;">#Create a SpatialPointsDataFrame</span>
coordinates<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #009900;">)</span>=~LON+LAT
projection<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #009900;">)</span>=CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:4326"</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Transform into Mercator Projection</span>
ozone.UTM <- spTransform<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Now we have the object <i>ozone.UTM</i>, which is a <code>SpatialPointsDataFrame</code> with coordinates in metres.
</div>
<h3>
Spacetime Package</h3>
<div style="text-align: justify;">
<b>Gstat</b> is able to perform spatio-temporal kriging exploiting the functionalities of the package <b>spacetime</b>, which was developed by the same team as <b>gstat</b>. In <b>spacetime</b> we have two ways to represent spatio-temporal data: <code>STFDF</code> and <code>STIDF</code> formats. The first represents objects with a complete space time grid. In other words in this category are included objects such as the grid of weather stations presented in Fig.1. The spatio-temporal object is created using the <i>n</i> locations of the weather stations and the <i>m</i> time intervals of their observations. The spatio-temporal grid is of size <i>n</i>x<i>m</i>.<code> </code></div>
<div style="text-align: justify;">
<code>STIDF</code> objects are the one we are going to use for this example. These are unstructured spatio-temporal objects, where both space and time change dynamically. For example, in this case we have data collected on top of trams moving around the city of Zurich. This means that the location of the sensors is not consistent throughout the sampling window. </div>
<div style="text-align: justify;">
Creating <code>STIDF</code> objects is fairly simple, we just need to disassemble the <code>data.frame</code> we have into a spatial, temporal and data components, and then merge them together to create the <code>STIDF</code> object.</div>
<div style="text-align: justify;">
The first thing to do is create the <code>SpatialPoints</code> object, with the locations of the sensors at any given time:
</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">ozoneSP <- SpatialPoints<span style="color: #009900;">(</span>ozone.UTM@coords<span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This is simple to do with the function <code>SpatialPoints</code> in the package <b>sp</b>. This function takes two arguments, the first is a <code>matrix</code> or a <code>data.frame</code> with the coordinates of each point. In this case I used the coordinates of the <code>SpatialPointsDataFrame</code> we created before, which are provided in a <code>matrix</code> format. Then I set the projection in UTM.<br />
At this point we need to perform a very important operation for kriging, which is check whether we have some duplicated points. It may happen sometime that there are points with identical coordinates. Kriging cannot handle this and returns an error, generally in the form of a “singular matrix”. Most of the time in which this happens the problem is related to duplicated locations. So we now have to check if we have duplicates here, using the function <code>zerodist</code>:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">dupl <- zerodist<span style="color: #009900;">(</span>ozoneSP<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
It turns out that we have a couple of duplicates, which we need to remove. We can do that directly in the two lines of code we would need to create the data and temporal component for the <code>STIDF</code> object:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">ozoneDF <- <a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>PPB=ozone.UTM$ozone_ppb<span style="color: #009900;">[</span>-dupl<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
In this line I created a <code>data.frame</code> with only one column, named <i>PPB</i>, with the ozone observations in part per billion. As you can see I removed the duplicated points by excluding the rows from the object <i>ozone.UTM</i> with the indexes included in one of the columns of the object <i>dupl</i>. We can use the same trick while creating the temporal part:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">ozoneTM <- <a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span>ozone.UTM$TIME<span style="color: #009900;">[</span>-dupl<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span><span style="color: #339933;">,</span>tz=<span style="color: blue;">"CET"</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Now all we need to do is combine the objects <i>ozoneSP</i>, <i>ozoneDF</i> and <i>ozoneTM</i> into a <code>STIDF</code>:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">timeDF <- STIDF<span style="color: #009900;">(</span>ozoneSP<span style="color: #339933;">,</span>ozoneTM<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>=ozoneDF<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This is the file we are going to use to compute the variogram and perform the spatio-temporal interpolation. We can check the raw data contained in the <code>STIDF</code> object by using the spatio-temporal version of the function <code>spplot</code>, which is <code>stplot</code>:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">stplot<span style="color: #009900;">(</span>timeDF<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhd7r1TyYsgwIR_fG52hzBVQMWJ2LHHpzwurjydxp11WrYZXDzsj0M4Gn0W3Cv2H9UVUD0WQDQ3eTCw1oE7_KDEQiIL0uzSgtGf6jdvxAtojVsv-rrvVEteE2ezgf-gNSjP-1r-gVnjWcGy/s1600/Fig2.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhd7r1TyYsgwIR_fG52hzBVQMWJ2LHHpzwurjydxp11WrYZXDzsj0M4Gn0W3Cv2H9UVUD0WQDQ3eTCw1oE7_KDEQiIL0uzSgtGf6jdvxAtojVsv-rrvVEteE2ezgf-gNSjP-1r-gVnjWcGy/s640/Fig2.tiff" /></a></div>
<br />
<h3>
Variogram</h3>
<div style="text-align: justify;">
The actual computation of the variogram at this point is pretty simple, we just need to use the appropriate function: <code>variogramST</code>. Its use is similar to the standard function for spatial kriging, even though there are some settings for the temporal component that need to be included.</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a> <- variogramST<span style="color: #009900;">(</span>PPB~<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>=timeDF<span style="color: #339933;">,</span>tunit=<span style="color: blue;">"hours"</span><span style="color: #339933;">,</span>assumeRegular=F<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/na.omit"><span style="color: #003399; font-weight: bold;">na.omit</span></a>=T<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
As you can see here the first part of the call to the function <code>variogramST</code> is identical to a normal call to the function <code>variogram</code>; we first have the formula and then the data source. However, then we have to specify the time units (<code>tunits</code>) or the time lags (<code>tlags</code>). I found the documentation around this point a bit confusing to be honest. I tested various combinations of parameters and the line of code I presented is the only one that gives me what appear to be good results. I presume that what I am telling to the function is to aggregate the data to the hours, but I am not completely sure. I hope some of the readers can shed some light on this!!<br />
I must warn you that this operation takes quite a long time, so please be aware of that. I personally ran it overnight.</div>
<br />
<br />
<h3>
Plotting the Variogram</h3>
<div style="text-align: justify;">
Basically the spatio-temporal version of the variogram includes different temporal lags. Thus what we end up with is not a single variogram but a series, which we can plot using the following line:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span>map=F<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
which return the following image: </div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOVRczz-YddaAQ41WnNFTkMA0Tg2pwn4a7uXgw8hefBF0sJ9n5aEE1B3BfrSiNcQom_bkd6W_yQQK4cC-2GAWaVUuOg7CPZAgqKLpcAnVYNuqXG871PozgoFH5IBNbcDkTCvM8vZqwFhSh/s1600/Fig3.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOVRczz-YddaAQ41WnNFTkMA0Tg2pwn4a7uXgw8hefBF0sJ9n5aEE1B3BfrSiNcQom_bkd6W_yQQK4cC-2GAWaVUuOg7CPZAgqKLpcAnVYNuqXG871PozgoFH5IBNbcDkTCvM8vZqwFhSh/s640/Fig3.tiff" /></a></div>
<br />
<div style="text-align: justify;">
Among all the possible types of visualizations for spatio-temporal variogram, this for me is the easiest to understand, probably because I am used to see variogram models. However, there are also other ways available to visualize it, such as the variogram map: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span>map=T<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj93tvy0BRAJJyeEsd_r2ShRZHfY9gSvJxEJbUIm2WOz1LZlYjbm13y_819KQDphUkxTqOLm_EKHDpAat1rETn9CeXDy-09pqZNnG51jMK08KDTyQLE-QDqvYlAQboHNeg6mrnppKj4rW9M/s1600/Fig5.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj93tvy0BRAJJyeEsd_r2ShRZHfY9gSvJxEJbUIm2WOz1LZlYjbm13y_819KQDphUkxTqOLm_EKHDpAat1rETn9CeXDy-09pqZNnG51jMK08KDTyQLE-QDqvYlAQboHNeg6mrnppKj4rW9M/s640/Fig5.tiff" /></a></div>
<br />
<div style="text-align: justify;">
And the 3D wireframe: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/lattice/wireframe"><span style="color: #003399; font-weight: bold;">wireframe</span></a>=T<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7Y5UfnE13SJJimSpwZ8gnyrLhKfv5zR32EpvcH1q-thcNK3gxE8l2EjYkKaMws9L0NQh1EfQu1oQXfkdFNA5zff-lUsD1QFQvlUUqAOFfFL21IdgYoavn9JfsXEPKHJW5TiLXY-6bSyOP/s1600/Fig4.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7Y5UfnE13SJJimSpwZ8gnyrLhKfv5zR32EpvcH1q-thcNK3gxE8l2EjYkKaMws9L0NQh1EfQu1oQXfkdFNA5zff-lUsD1QFQvlUUqAOFfFL21IdgYoavn9JfsXEPKHJW5TiLXY-6bSyOP/s640/Fig4.tiff" /></a></div>
<br />
<h3>
Variogram Modelling</h3>
<div style="text-align: justify;">
As in a normal 2D kriging experiment, at this point we need to fit a model to our variogram. For doing so we will use the function <code>vgmST</code> and <code>fit.StVariogram</code>, which are the spatio-temporal matches for <code>vgm</code> and <code>fit.variogram</code>.<br />
Below I present the code I used to fit all the models. For the automatic fitting I used most of the settings suggested in the following demo: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/demo"><span style="color: #003399; font-weight: bold;">demo</span></a><span style="color: #009900;">(</span>stkrige<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Regarding the variogram models, in <b>gstat</b> we have 5 options: separable, product sum, metric, sum metric, and simple sum metric. You can find more information to fit these model, including all the equations presented below, in (Gräler et al., 2015), which is available in pdf (I put the link in the "More Information" section).
</div>
<h4>
Separable</h4>
<div style="text-align: justify;">
This covariance model assumes separability between the spatial and the temporal component, meaning that the covariance function is given by: </div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLlKHdiNFgDWwYjfCusTUZh_p65MU2dnvrFif8HIpvPzAsSKvXeoh69RT4MC2taLQkPpYILJdufnokgRfCYWU4jkfQktcxVp3QNgcIkm6fIOBXj8oS_Q9UdOYaIJO9wZBGEQFnrhZnZieI/s1600/Eq4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLlKHdiNFgDWwYjfCusTUZh_p65MU2dnvrFif8HIpvPzAsSKvXeoh69RT4MC2taLQkPpYILJdufnokgRfCYWU4jkfQktcxVp3QNgcIkm6fIOBXj8oS_Q9UdOYaIJO9wZBGEQFnrhZnZieI/s640/Eq4.jpg" /></a></div>
<br />
<div style="text-align: justify;">
According to (Sherman, 2011): “While this model is relatively parsimonious and is nicely interpretable, there are many physical phenomena which do not satisfy the separability”. Many environmental processes for example do not satisfy the assumption of separability. This means that this model needs to be used carefully.<br />
The first thing to set are the upper and lower limits for all the variogram parameters, which are used during the automatic fitting: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: #666666; font-style: italic;"># lower and upper bounds</span>
pars.l <- <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span>sill.s = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> range.s = <span style="color: #cc66cc;">10</span><span style="color: #339933;">,</span> nugget.s = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>sill.t = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> range.t = <span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> nugget.t = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>sill.st = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> range.st = <span style="color: #cc66cc;">10</span><span style="color: #339933;">,</span> nugget.st = <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> anis = <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span>
pars.u <- <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span>sill.s = <span style="color: #cc66cc;">200</span><span style="color: #339933;">,</span> range.s = <span style="color: #cc66cc;">1000</span><span style="color: #339933;">,</span> nugget.s = <span style="color: #cc66cc;">100</span><span style="color: #339933;">,</span>sill.t = <span style="color: #cc66cc;">200</span><span style="color: #339933;">,</span> range.t = <span style="color: #cc66cc;">60</span><span style="color: #339933;">,</span> nugget.t = <span style="color: #cc66cc;">100</span><span style="color: #339933;">,</span>sill.st = <span style="color: #cc66cc;">200</span><span style="color: #339933;">,</span> range.st = <span style="color: #cc66cc;">1000</span><span style="color: #339933;">,</span> nugget.st = <span style="color: #cc66cc;">100</span><span style="color: #339933;">,</span>anis = <span style="color: #cc66cc;">700</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
To create a separable variogram model we need to provide a model for the spatial component, one for the temporal component, plus the overall sill:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">separable <- vgmST<span style="color: #009900;">(</span><span style="color: blue;">"separable"</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/packages/cran/space">space</a> = vgm<span style="color: #009900;">(</span>-<span style="color: #cc66cc;">60</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/time"><span style="color: #003399; font-weight: bold;">time</span></a> = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">35</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> sill=<span style="color: #cc66cc;">0.56</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This line creates a basic variogram model, and we can check how it fits our data using the following line:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span>separable<span style="color: #339933;">,</span>map=F<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU-Ux6Fem20J2GCWjRLBQbuJ6LJqhzaY7t-QFlosj8AGHbQOwrIz0AmuwXP1MktZU7RDvcerzk1xQNpyTthqTyEF6PqeVSzVM1KlrXASP1Ef8FVRiUEmUCqpBoknfrFoYEt_EQdcrWTtsN/s1600/Fig6.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU-Ux6Fem20J2GCWjRLBQbuJ6LJqhzaY7t-QFlosj8AGHbQOwrIz0AmuwXP1MktZU7RDvcerzk1xQNpyTthqTyEF6PqeVSzVM1KlrXASP1Ef8FVRiUEmUCqpBoknfrFoYEt_EQdcrWTtsN/s640/Fig6.tiff" /></a></div>
<br />
<div style="text-align: justify;">
One thing you may notice is that the variogram parameters do not seem to have anything in common with the image shown above. I mean, in order to create this variogram model I had to set the sill of the spatial component at -60, which is total nonsense. However, I decided to try to fit this model by-eye as best as I could just to show you how to perform this type of fitting and calculate its error; but in this case it cannot be taken seriously. I found that for the automatic fit the parameters selected for <code>vgmST</code> do not make much of a difference, so probably you do not have to worry too much about the parameters you select in <code>vgmST</code>.<br />
We can check how this model fits our data by using the function <code>fit.StVariogram</code> with the option <code>fit.method=0</code>, which keeps this model but calculates its Mean Absolute Error (MSE), compared to the actual data: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> separable_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> separable<span style="color: #339933;">,</span> fit.method=<span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>separable_Vgm<span style="color: #339933;">,</span><span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">54.96278</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This is basically the error of the eye fit. However, we can also use the same function to automatically fit the separable model to our data (here I used the settings suggested in the demo):</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> separable_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> separable<span style="color: #339933;">,</span> fit.method=<span style="color: #cc66cc;">11</span><span style="color: #339933;">,</span>method=<span style="color: blue;">"L-BFGS-B"</span><span style="color: #339933;">,</span> stAni=<span style="color: #cc66cc;">5</span><span style="color: #339933;">,</span> lower=pars.l<span style="color: #339933;">,</span>upper=pars.u<span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>separable_Vgm<span style="color: #339933;">,</span> <span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">451.0745</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
As you can see the error increases. This probably demonstrates that this model is not suitable for our data, even though with some magic we can create a pattern that is similar to what we see in the observations. In fact, if we check the fit by plotting the model it is clear that this variogram cannot properly describe our data:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span>separable_Vgm<span style="color: #339933;">,</span>map=F<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxPPmNKVBPt0AaT636mRb8t_nVEdrbK_OgKhGrC36ORHIHPVB1Hj8rLHa1HpcWq2ymDDZycnZhhXtdCad9HKsdTIW_lMNEPtsssXsABncRwtKXOYms84dyswUxRoNAw85Bh242gojAzkm/s1600/Fig7.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxPPmNKVBPt0AaT636mRb8t_nVEdrbK_OgKhGrC36ORHIHPVB1Hj8rLHa1HpcWq2ymDDZycnZhhXtdCad9HKsdTIW_lMNEPtsssXsABncRwtKXOYms84dyswUxRoNAw85Bh242gojAzkm/s640/Fig7.tiff" /></a></div>
<br />
<div style="text-align: justify;">
To check the parameters of the model we can use the function <code>extractPar</code>:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> extractPar<span style="color: #009900;">(</span>separable_Vgm<span style="color: #009900;">)</span>
range.s nugget.s range.t nugget.t sill
<span style="color: #cc66cc;">199.999323</span> <span style="color: #cc66cc;">10.000000</span> <span style="color: #cc66cc;">99.999714</span> <span style="color: #cc66cc;">1.119817</span> <span style="color: #cc66cc;">17.236256</span> </pre>
</div>
</div>
<br />
<h4>
Product Sum</h4>
<div style="text-align: justify;">
A more flexible variogram model for spatio-temporal data is the product sum, which do not assume separability. The equation of the covariance model is given by:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPheZU-sUrv0XabB0AR8gjBHh9aAGlKqIqWH-vg9pZ-8u0jczsY5vr-FVO9QHZD95KJb3lOrLu1yn6XXZfKpmnCIivpxyACmc_56dflo7DWjRULfPurDKjK026DhFuGjqgaKDhyphenhyphen-_v3uMb/s1600/Eq5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPheZU-sUrv0XabB0AR8gjBHh9aAGlKqIqWH-vg9pZ-8u0jczsY5vr-FVO9QHZD95KJb3lOrLu1yn6XXZfKpmnCIivpxyACmc_56dflo7DWjRULfPurDKjK026DhFuGjqgaKDhyphenhyphen-_v3uMb/s640/Eq5.jpg" /></a></div>
<br />
<div style="text-align: justify;">
with <i>k</i> > 0.</div>
<div style="text-align: justify;">
In this case in the function <code>vgmST</code> we need to provide both the spatial and temporal component, plus the value of the parameter <code>k</code> (which needs to be positive): </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">prodSumModel <- vgmST<span style="color: #009900;">(</span><span style="color: blue;">"productSum"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/packages/cran/space">space</a> = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: blue;">"Exp"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">150</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0.5</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/time"><span style="color: #003399; font-weight: bold;">time</span></a> = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <span style="color: blue;">"Exp"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">5</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0.5</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>k = <span style="color: #cc66cc;">50</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
I first tried to set <code>k = 5</code>, but R returned an error message saying that it needed to be positive, which I did not understand. However, with 50 it worked and as I mentioned the automatic fit does not care much about these initial values, probably the most important things are the upper and lower bounds we set before.<br />
We can then proceed with the fitting process and we can check the MSE with the following two lines: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> prodSumModel_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> prodSumModel<span style="color: #339933;">,</span>method = <span style="color: blue;">"L-BFGS-B"</span><span style="color: #339933;">,</span>lower=pars.l<span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>prodSumModel_Vgm<span style="color: #339933;">,</span> <span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">215.6392</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This process returns the following model: </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4edPbwTchcq28YVu7KvKvG7XEkWXhB0bFRj_VMFpke1Is22kGBSE0y32IHNFAnBWgI0w3Q5Uf5E_B-QkioRu7FwpkF5KztN1vY6UlgAmHb95nAFotvKhDhM-FQCquG4ulO7jCWz-MvZ8a/s1600/Fig8.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4edPbwTchcq28YVu7KvKvG7XEkWXhB0bFRj_VMFpke1Is22kGBSE0y32IHNFAnBWgI0w3Q5Uf5E_B-QkioRu7FwpkF5KztN1vY6UlgAmHb95nAFotvKhDhM-FQCquG4ulO7jCWz-MvZ8a/s640/Fig8.tiff" /></a></div>
<br />
<h4>
Metric</h4>
<div style="text-align: justify;">
This model assumes identical covariance functions for both the spatial and the temporal components, but includes a spatio-temporal anisotropy (<i>k</i>) that allows some flexibility.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr96BG0mgmKnfeB_4OCl3Rvtq0M56yyO29HSrqaeGQj2QU6V_wXm_sTKQ8jDaTpPaBbusgWQI0ZFEaLis4dN9cr9j2s9DYCiV5-AEaxA2ecKh1dqCCWTJo0F-aLbjLrww47d7CL4IrGaQ8/s1600/Eq6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr96BG0mgmKnfeB_4OCl3Rvtq0M56yyO29HSrqaeGQj2QU6V_wXm_sTKQ8jDaTpPaBbusgWQI0ZFEaLis4dN9cr9j2s9DYCiV5-AEaxA2ecKh1dqCCWTJo0F-aLbjLrww47d7CL4IrGaQ8/s640/Eq6.jpg" /></a></div>
<br />
<div style="text-align: justify;">
In this model all the distances (spatial, temporal and spatio-temporal) are treated equally, meaning that we only need to fit a joint variogram to all three. The only parameter we have to modify is the anisotropy <i>k</i>. In R <i>k</i> is named <code>stAni</code> and creating a metric model in <code>vgmST</code> can be done as follows:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">metric <- vgmST<span style="color: #009900;">(</span><span style="color: blue;">"metric"</span><span style="color: #339933;">,</span> joint = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">50</span><span style="color: #339933;">,</span><span style="color: blue;">"Mat"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> stAni=<span style="color: #cc66cc;">200</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
The automatic fit produces the following MSE:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> metric_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> metric<span style="color: #339933;">,</span> method=<span style="color: blue;">"L-BFGS-B"</span><span style="color: #339933;">,</span>lower=pars.l<span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>metric_Vgm<span style="color: #339933;">,</span> <span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">79.30172</span> </pre>
</div>
</div>
<br />
We can plot this model to visually check its accuracy: <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7vDeqz-cjMCWwEAj6PwBzts9dHNSpBgAw6aiCgmq8mwMHiyu2Ripmv1Q-BleKmyrJ1bknqRan_2IOzFijQi1N348Awu5rzj5IStJ4qru0M9jm1aji2nhWXik8hZWOpwKlck9wX_NteMw-/s1600/Fig10.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7vDeqz-cjMCWwEAj6PwBzts9dHNSpBgAw6aiCgmq8mwMHiyu2Ripmv1Q-BleKmyrJ1bknqRan_2IOzFijQi1N348Awu5rzj5IStJ4qru0M9jm1aji2nhWXik8hZWOpwKlck9wX_NteMw-/s640/Fig10.tiff" /></a></div>
<br />
<h4>
Sum Metric</h4>
<div style="text-align: justify;">
A more complex version of this model is the sum metric, which includes a spatial and temporal covariance models, plus the joint component with the anisotropy: </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBGcxA-Da9GgG3KPtU8FfxTe_8wV0FDwJsosEDTR0JBmcpF3BEINln3nivixQ6CWR5jqOkICvbdyLQMCg0kOAtZKAj5cEbj7Yax46UOlFFaw6g8SytpOy5DKkam3hj2VE5xIKei7wFHb1c/s1600/Eq7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBGcxA-Da9GgG3KPtU8FfxTe_8wV0FDwJsosEDTR0JBmcpF3BEINln3nivixQ6CWR5jqOkICvbdyLQMCg0kOAtZKAj5cEbj7Yax46UOlFFaw6g8SytpOy5DKkam3hj2VE5xIKei7wFHb1c/s640/Eq7.jpg" /></a></div>
<br />
<div style="text-align: justify;">
This model allows maximum flexibility, since all the components can be set independently. In R this is achieved with the following line:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">sumMetric <- vgmST<span style="color: #009900;">(</span><span style="color: blue;">"sumMetric"</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/packages/cran/space">space</a> = vgm<span style="color: #009900;">(</span>psill=<span style="color: #cc66cc;">5</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/range"><span style="color: #003399; font-weight: bold;">range</span></a>=<span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> nugget=<span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/time"><span style="color: #003399; font-weight: bold;">time</span></a> = vgm<span style="color: #009900;">(</span>psill=<span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/range"><span style="color: #003399; font-weight: bold;">range</span></a>=<span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> nugget=<span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> joint = vgm<span style="color: #009900;">(</span>psill=<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/range"><span style="color: #003399; font-weight: bold;">range</span></a>=<span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> nugget=<span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> stAni=<span style="color: #cc66cc;">500</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
The automatic fit can be done like so:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> sumMetric_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> sumMetric<span style="color: #339933;">,</span> method=<span style="color: blue;">"L-BFGS-B"</span><span style="color: #339933;">,</span>lower=pars.l<span style="color: #339933;">,</span>upper=pars.u<span style="color: #339933;">,</span>tunit=<span style="color: blue;">"hours"</span><span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>sumMetric_Vgm<span style="color: #339933;">,</span> <span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">58.98891</span> </pre>
</div>
</div>
<br />
Which creates the following model: <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJPUtQUrEzLi9SUb1JQuWxTChTkAcrBviS-cOamZwMNe-SjHiJgLLZInU5FiuGgQTPcLOuxrBuS75XicEck4Ajv0_8ujLTyAWm3_NjQKqKcw60K_02u_et-1YFcgFSm39aIGPPY4s18BFg/s1600/Fig9.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJPUtQUrEzLi9SUb1JQuWxTChTkAcrBviS-cOamZwMNe-SjHiJgLLZInU5FiuGgQTPcLOuxrBuS75XicEck4Ajv0_8ujLTyAWm3_NjQKqKcw60K_02u_et-1YFcgFSm39aIGPPY4s18BFg/s640/Fig9.tiff" /></a></div>
<br />
<h4>
Simple Sum Metric</h4>
<div style="text-align: justify;">
As the title suggests, this is a simpler version of the sum metric model. In this case instead of having total flexibility for each component we restrict them to having a single nugget. Basically we still have to set all the parameters, even though we do not care about setting the nugget in each component since we need to set a nugget effect for all three:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">SimplesumMetric <- vgmST<span style="color: #009900;">(</span><span style="color: blue;">"simpleSumMetric"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/packages/cran/space">space</a> = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">5</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/time"><span style="color: #003399; font-weight: bold;">time</span></a> = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> joint = vgm<span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span><span style="color: blue;">"Sph"</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span> nugget=<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> stAni=<span style="color: #cc66cc;">500</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
This returns a model similar to the sum metric:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">> SimplesumMetric_Vgm <- fit.StVariogram<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span> SimplesumMetric<span style="color: #339933;">,</span>method = <span style="color: blue;">"L-BFGS-B"</span><span style="color: #339933;">,</span>lower=pars.l<span style="color: #009900;">)</span>
> <a href="http://inside-r.org/r-doc/base/attr"><span style="color: #003399; font-weight: bold;">attr</span></a><span style="color: #009900;">(</span>SimplesumMetric_Vgm<span style="color: #339933;">,</span> <span style="color: blue;">"MSE"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span> <span style="color: #cc66cc;">59.36172</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiHHb7cWgaov0L_jrlwpvY1CL4u7CX3UgJw3TWDXcT-fnrtJnVmAKJilkwb0-fXyu5NneOMFYFDU_a9hTkUQKgBAGUIzmH3uN-ymTjS3aYrYA5GHYSw_No04YbNzNM8Af5JCT7NLEXV7bq/s1600/Fig11.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiHHb7cWgaov0L_jrlwpvY1CL4u7CX3UgJw3TWDXcT-fnrtJnVmAKJilkwb0-fXyu5NneOMFYFDU_a9hTkUQKgBAGUIzmH3uN-ymTjS3aYrYA5GHYSw_No04YbNzNM8Af5JCT7NLEXV7bq/s640/Fig11.tiff" /></a></div>
<br />
<h4>
Choosing the Best Model</h4>
<div style="text-align: justify;">
We can visually compare all the models we fitted using wireframes in the following way:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span>separable_Vgm<span style="color: #339933;">,</span> prodSumModel_Vgm<span style="color: #339933;">,</span> metric_Vgm<span style="color: #339933;">,</span> sumMetric_Vgm<span style="color: #339933;">,</span> SimplesumMetric_Vgm<span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/all"><span style="color: #003399; font-weight: bold;">all</span></a>=T<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/lattice/wireframe"><span style="color: #003399; font-weight: bold;">wireframe</span></a>=T<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWucFwPyKVPrADvSEBmMLrOgBnnwieVDVXP0sZYXOdcCUCY5HXYn5Be5ZmVpm0c5xQBcnOzLbLdJ58IQt_gLFIjfgdjsBmQllIsZMVFPWFCpjcj1qc3b6VmPw6JojX1SQI_RGqZGS8Nuew/s1600/Fig12.tif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWucFwPyKVPrADvSEBmMLrOgBnnwieVDVXP0sZYXOdcCUCY5HXYn5Be5ZmVpm0c5xQBcnOzLbLdJ58IQt_gLFIjfgdjsBmQllIsZMVFPWFCpjcj1qc3b6VmPw6JojX1SQI_RGqZGS8Nuew/s400/Fig12.tif" /></a></div>
<br />
<div style="text-align: justify;">
The most important parameter to take into account for selecting the best model is certainly the MSE. By looking at the these it is clear that the best model is the sum metric, with an error of around 59, so I will use this for kriging. </div>
<br />
<h3>
Prediction Grid</h3>
<div style="text-align: justify;">
Since we are performing spatio-temporal interpolation, it is clear that we are interested in estimating new values in both space and time. For this reason we need to create a spatio-temporal prediction grid. In this case I first downloaded the road network for the area around Zurich, then I cropped it to match the extension of my study area, and then I created the spatial grid:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">roads <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"VEC25_str_l_Clip/VEC25_str_l.shp"</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This is the shapefile with the road network extracted from the Vector25 map of Switzerland. Unfortunately for copyright reasons I cannot share it. This file is projected in CH93, which is the Swiss national projection. Since I wanted to perform a basic experiment, I decided not to include the whole network, but only the major roads that in Switzerland are called Klass1. So the first thing I did was extracting from the <i>roads </i>object only the lines belonging to Klass1 streets:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Klass1 <- roads<span style="color: #009900;">[</span>roads$objectval==<span style="color: blue;">"1_Klass"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Then I changed the projection of this object from CH93 to UTM, so that it is comparable with what I used so far:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Klass1.UTM <- spTransform<span style="color: #009900;">(</span>Klass1<span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
Now I can crop this file so that I obtain only the roads within my study area. I can use the function <code>crop</code> in <b>rgeos</b>, with the object <i>ozone.UTM</i> that I created before:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Klass1.cropped <- crop<span style="color: #009900;">(</span>Klass1.UTM<span style="color: #339933;">,</span>ozone.UTM<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This gives me the road network around the locations where the data were collected. I can show you the results with the following two lines:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>Klass1.cropped<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>ozone.UTM<span style="color: #339933;">,</span>add=T<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"red"</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgodEOJFDai8qsCV2Z-Lk5A_gccswXp-etBeb1sovRhr6qozLoUgovj0MjDdQ3rwR9csYJswZpd_Ne4efp8BjaFx_hQuVQ3Yh775k3PZy2_uWOrEXwmGP8Y4Zcka1HUeI132ze-6cw0mNTY/s1600/Fig13.tif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgodEOJFDai8qsCV2Z-Lk5A_gccswXp-etBeb1sovRhr6qozLoUgovj0MjDdQ3rwR9csYJswZpd_Ne4efp8BjaFx_hQuVQ3Yh775k3PZy2_uWOrEXwmGP8Y4Zcka1HUeI132ze-6cw0mNTY/s400/Fig13.tif" /></a></div>
<br />
<div style="text-align: justify;">
Where the Klass1 roads are in black and the data points are represented in red. With this selection I can now use the function <code>spsample</code> to create a random grid of points along the road lines:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">sp.grid.UTM <- spsample<span style="color: #009900;">(</span>Klass1.cropped<span style="color: #339933;">,</span>n=<span style="color: #cc66cc;">1500</span><span style="color: #339933;">,</span>type=<span style="color: blue;">"random"</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This generates the following grid, which I think I can share with you in <code>RData</code> format (<a href="http://www.fabioveronesi.net/Blog/gridST.RData">gridST.RData</a>): </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC17ggpqJ3VEqGbqzZPoBFm7EIXbIro6ZaVGH29yQzGJ4AKW1yGnd6NQpWia6ptWewwScyhE0HWCeUrmR1NNlDj92kkoZ1SF51EgFMBIVyegTmBornvJ636TpxRkPbRwS3FgXP2G4VFzni/s1600/Fig14.tif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC17ggpqJ3VEqGbqzZPoBFm7EIXbIro6ZaVGH29yQzGJ4AKW1yGnd6NQpWia6ptWewwScyhE0HWCeUrmR1NNlDj92kkoZ1SF51EgFMBIVyegTmBornvJ636TpxRkPbRwS3FgXP2G4VFzni/s400/Fig14.tif" /></a></div>
<br />
<div style="text-align: justify;">
As I mentioned, now we need to add a temporal component to this grid. We can do that again using the package <b>spacetime</b>. We first need to create a vector of Date/Times using the function <code>seq</code>:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">tm.grid <- <a href="http://inside-r.org/r-doc/base/seq"><span style="color: #003399; font-weight: bold;">seq</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span><span style="color: blue;">'2011-12-12 06:00 CET'</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/as.POSIXct"><span style="color: #003399; font-weight: bold;">as.POSIXct</span></a><span style="color: #009900;">(</span><span style="color: blue;">'2011-12-14 09:00 CET'</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>length.out=<span style="color: #cc66cc;">5</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
This creates a vector with 5 elements (<code>length.out=5</code>), with <code>POSIXct</code> values between the two Date/Times provided. In this case we are interested in creating a spatio-temporal data frame, since we do not yet have any data for it. Therefore we can use the function <code>STF</code> to merge spatial and temporal data into a spatio-temporal grid: </div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">grid.ST <- STF<span style="color: #009900;">(</span>sp.grid.UTM<span style="color: #339933;">,</span>tm.grid<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
This can be used as new data in the kriging function. <br />
<br />
<h3>
Kriging</h3>
<div style="text-align: justify;">
This is probably the easiest step in the whole process. We have now created the spatio-temporal data frame, compute the best variogram model and create the spatio-temporal prediction grid. All we need to do now is a simple call to the function <code>krigeST</code> to perform the interpolation:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">pred <- krigeST<span style="color: #009900;">(</span>PPB~<span style="color: #cc66cc;">1</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>=timeDF<span style="color: #339933;">,</span> modelList=sumMetric_Vgm<span style="color: #339933;">,</span> newdata=grid.ST<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
We can plot the results again using the function <code>stplot</code>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">stplot<span style="color: #009900;">(</span>pred<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<br />
<br />
<h2>
More information</h2>
<div style="text-align: justify;">
There are various tutorial available that offer examples and guidance in performing spatio-temporal kriging. For example we can just write:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/vignette"><span style="color: #003399; font-weight: bold;">vignette</span></a><span style="color: #009900;">(</span><span style="color: blue;">"st"</span><span style="color: #339933;">,</span> package = <span style="color: blue;">"gstat"</span><span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
and a pdf will open with some of the instructions I showed here. Plus there is a demo available at:</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/demo"><span style="color: #003399; font-weight: bold;">demo</span></a><span style="color: #009900;">(</span>stkrige<span style="color: #009900;">)</span> </pre>
</div>
</div>
<br />
<div style="text-align: justify;">
In the article “Spatio-Temporal Interpolation using gstat” Gräler et al. explain in details the theory behind spatio-temporal kriging. The pdf of this article can be found here: <a href="https://cran.r-project.org/web/packages/gstat/vignettes/spatio-temporal-kriging.pdf">https://cran.r-project.org/web/packages/gstat/vignettes/spatio-temporal-kriging.pdf</a>There are also some books and articles that I found useful to better understand the topic, for which I will put the references at the end of the post. </div>
<br />
<br />
<h2>
References</h2>
Gräler, B., 2012. Different concepts of spatio-temporal kriging [WWW Document]. URL geostat-course.org/system/files/part01.pdf (accessed 8.18.15).<br />
<br />
Gräler, B., Pebesma, Edzer, Heuvelink, G., 2015. Spatio-Temporal Interpolation using gstat.<br />
<br />
Gräler, B., Rehr, M., Gerharz, L., Pebesma, E., 2013. Spatio-temporal analysis and interpolation of PM10 measurements in Europe for 2009.<br />
<br />
Oliver, M., Webster, R., Gerrard, J., 1989. Geostatistics in Physical Geography. Part I: Theory. Trans. Inst. Br. Geogr., New Series 14, 259–269. doi:10.2307/622687<br />
<br />
Sherman, M., 2011. Spatial statistics and spatio-temporal data: covariance functions and directional properties. John Wiley & Sons. <br />
<br />
<br />
<br />
<br />
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">All the code snippets were created by Pretty R at inside-R.org</a><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com30tag:blogger.com,1999:blog-1442302563171663500.post-19629165853297991512015-06-21T12:25:00.004+02:002015-06-21T12:25:35.519+02:00Organize a walk around London with RThe subtitle of this post can be "<b>How to plot multiple elements on interactive web maps in R</b>".<br />
In this experiment I will show how to include multiple elements in interactive maps created using both plotGoogleMaps and leafletR. To complete the work presented here you would need the following packages: <b>sp</b>, <b>raster</b>, <b>plotGoogleMaps </b>and <b>leafletR</b>.<br />
<br />
I am going to use data from the OpenStreet maps, which can be downloaded for free from this website: <a href="http://market.weogeo.com/datasets/osm-openstreetmap-planet.html" target="_blank">weogeo.com</a><br />
In particular I downloaded the shapefile with the stores, the one with the tourist attractions and the polyline shapefile with all the roads in London. I will assume that you want to spend a day or two walking around London, and for this you would need the location of some hotels and the locations of all the Greggs in the area, for lunch. You need to create a web map that you can take with you when you walk around the city with all these customized elements, that's how you create it.<br />
<br />
Once you have downloaded the shapefile from weogeo.com you can open them and assign the correct projection with the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">stores <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"weogeo_j117529/data/shop_point.shp"</span><span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>stores<span style="color: #009900;">)</span>=CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3857"</span><span style="color: #009900;">)</span>
roads <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"weogeo_j117529/data/route_line.shp"</span><span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>roads<span style="color: #009900;">)</span>=CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3857"</span><span style="color: #009900;">)</span>
tourism <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"weogeo_j117529/data/tourism_point.shp"</span><span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>tourism<span style="color: #009900;">)</span>=CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3857"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
To extract only the data we would need to the map we can use these lines:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Greggs <- stores<span style="color: #009900;">[</span>stores$NAME %in% <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Gregg's"</span><span style="color: #339933;">,</span><span style="color: blue;">"greggs"</span><span style="color: #339933;">,</span><span style="color: blue;">"Greggs"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>
Hotel <- tourism<span style="color: #009900;">[</span>tourism$TOURISM==<span style="color: blue;">"hotel"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>
Hotel <- Hotel<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/sample"><span style="color: #003399; font-weight: bold;">sample</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span>:<a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>Hotel<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>
Footpaths <- roads<span style="color: #009900;">[</span>roads$ROUTE==<span style="color: blue;">"foot"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span></pre>
</div>
</div>
<br />
<br />
<b><span style="font-size: large;">plotGoogleMaps</span></b><br />
I created three objects, two are points (Greggs and Hotel) and the last is of class <i>SpatialLinesDataFrame</i>. We already saw how to plot Spatial objects with plotGoogleMaps, here the only difference is that we need to create several maps and then link them together.<br />
Let's take a look at the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Greggs.google <- plotGoogleMaps<span style="color: #009900;">(</span>Greggs<span style="color: #339933;">,</span>iconMarker=<a href="http://inside-r.org/r-doc/base/rep"><span style="color: #003399; font-weight: bold;">rep</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://local-insiders.com/wp-content/themes/localinsiders/includes/img/tag_icon_food.png"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>Greggs<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>mapTypeId=<span style="color: blue;">"ROADMAP"</span><span style="color: #339933;">,</span>add=T<span style="color: #339933;">,</span>flat=T<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=F<span style="color: #339933;">,</span>layerName=<span style="color: blue;">"Gregg's"</span><span style="color: #339933;">,</span>fitBounds=F<span style="color: #339933;">,</span>zoom=<span style="color: #cc66cc;">13</span><span style="color: #009900;">)</span>
Hotel.google <- plotGoogleMaps<span style="color: #009900;">(</span>Hotel<span style="color: #339933;">,</span>iconMarker=<a href="http://inside-r.org/r-doc/base/rep"><span style="color: #003399; font-weight: bold;">rep</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://www.linguistics.ucsb.edu/projects/weal/images/hotel.png"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>Hotel<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>mapTypeId=<span style="color: blue;">"ROADMAP"</span><span style="color: #339933;">,</span>add=T<span style="color: #339933;">,</span>flat=T<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=F<span style="color: #339933;">,</span>layerName=<span style="color: blue;">"Hotels"</span><span style="color: #339933;">,</span>previousMap=Greggs.google<span style="color: #009900;">)</span>
plotGoogleMaps<span style="color: #009900;">(</span>Footpaths<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"dark green"</span><span style="color: #339933;">,</span>mapTypeId=<span style="color: blue;">"ROADMAP"</span><span style="color: #339933;">,</span>filename=<span style="color: blue;">"Multiple_Objects_GoogleMaps.html"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=F<span style="color: #339933;">,</span>previousMap=Hotel.google<span style="color: #339933;">,</span>layerName=<span style="color: blue;">"Footpaths"</span><span style="color: #339933;">,</span>strokeWeight=<span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
As you can see I first create two objects using the same function and then I call again the same function to draw and save the map. I can link the three maps together using the option <i>add=T</i> and <i>previousMap</i>.<br />
We need to be <u>careful </u>here though, because the use of the option <i>add </i>is different from the standard plot function. In <b>plot </b>I can call the first and then if I want to add a second I call again the function with the option <i>add=T</i>. <u>Here this option needs to go in the first and second calls, not in the last. </u>Basically in this case we are saying to R not to close the plot because later on we are going to add elements to it. In the last line we do not put <i>add=T</i>, thus saying to R to go ahead and close the plot.<br />
<br />
Another important option is <i>previousMap</i>, which is used starting from the second plot to link the various elements. This option is used referencing the previous object, meaning that I reference the map in <i>Hotel.google</i> to the map map to <i>Greggs.google</i>, while in the last call I reference it to the previous <i>Hotel.google</i>, not the very first.<br />
<br />
The zoom level, if you want to set it, goes only in the first plot.<br />
<br />
Another thing I changed compared to the last example is the addition of custom icons to the plot, using the option <i>iconMarker</i>. This takes a vector of icons, not just one, with the same length of the <i>SpatialObject </i>to be plotted. That is why I use the function <b>rep</b>, to create a vector with the same URL repeated for a number of times equal to the length of the object.<br />
The icon can be whatever image you like. You can find a collection of free icons from this website: <a href="http://kml4earth.appspot.com/icons.html" target="_blank">http://kml4earth.appspot.com/icons.html</a> <br />
<br />
The result is the map below, available here: <a href="http://www.fabioveronesi.net/Blog/Multiple_Objects_GoogleMaps.html" target="_blank">Multiple_Objects_GoogleMaps.html</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUK45Czlv35fyNLziUOHnRXj7AHPZF5GA-GVDg54cLVJg4kJPmSptRwX2EC2cWOwBTOu6etlTWRQ2LsSkv2l7V5wIxOVLy7ZEdHy5GwWLw2vaRfRzi83zsxk8lbDCZ1ffHvbfXt1cQeHUl/s1600/Immagine.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUK45Czlv35fyNLziUOHnRXj7AHPZF5GA-GVDg54cLVJg4kJPmSptRwX2EC2cWOwBTOu6etlTWRQ2LsSkv2l7V5wIxOVLy7ZEdHy5GwWLw2vaRfRzi83zsxk8lbDCZ1ffHvbfXt1cQeHUl/s640/Immagine.jpg" width="640" /></a></div>
<br />
<br />
<br />
<b><span style="font-size: large;">leafletR</span></b><br />
We can do the same thing using leafletR. We first need to create <i>GeoJSON </i>files for each element of the map using the following lines:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Greggs.geojson <- toGeoJSON<span style="color: #009900;">(</span>Greggs<span style="color: #009900;">)</span>
Hotel.geojson <- toGeoJSON<span style="color: #009900;">(</span>Hotel<span style="color: #009900;">)</span>
Footpaths.geojson <- toGeoJSON<span style="color: #009900;">(</span>Footpaths<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
Now we need to set the style for each element. For this task we are going to use the function <b>styleSingle</b>, which basically defines a single style for all the elements of the <i>GeoJSON</i>. This differ from the map in a previous post in which we used the function <b>styleGrad </b>to create graduated colors depending of certain features of the dataset.<br />
We can change the icons of the elements in <b>leafletR </b>using the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Greggs.style <- styleSingle<span style="color: #009900;">(</span>marker=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"fast-food"</span><span style="color: #339933;">,</span> <span style="color: blue;">"red"</span><span style="color: #339933;">,</span> <span style="color: blue;">"s"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
Hotel.style <- styleSingle<span style="color: #009900;">(</span>marker=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"lodging"</span><span style="color: #339933;">,</span> <span style="color: blue;">"blue"</span><span style="color: #339933;">,</span> <span style="color: blue;">"s"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
Footpaths.style <- styleSingle<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"darkred"</span><span style="color: #339933;">,</span>lwd=<span style="color: #cc66cc;">4</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
As you can see we have the option <i>marker </i>that takes a vector with the name of the icon, its color and its size (between "s" for small, "m" for medium and "l" for large). The names of the icons can be found here: <a href="https://www.mapbox.com/maki/" target="_blank">https://www.mapbox.com/maki/</a>, where you have a series of icons and if you hover the mouse over them you would see some info, among which there is the name to use here, as the very last name. The style of the lines is set using the two options <i>col </i>and <i>lwd</i>, for line width.<br />
<br />
Then we can simply use the function <b>leaflet </b>to set the various elements and styles of the map:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">leaflet<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span>Greggs.geojson<span style="color: #339933;">,</span>Hotel.geojson<span style="color: #339933;">,</span>Footpaths.geojson<span style="color: #009900;">)</span><span style="color: #339933;">,</span>style=<a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span>Greggs.style<span style="color: #339933;">,</span>Hotel.style<span style="color: #339933;">,</span>Footpaths.style<span style="color: #009900;">)</span><span style="color: #339933;">,</span>popup=<a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"NAME"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"NAME"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"OPERATOR"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>base.map=<span style="color: blue;">"osm"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
The result is the image below and the map available here: <a href="http://www.fabioveronesi.net/Blog/map.html" target="_blank">http://www.fabioveronesi.net/Blog/map.html</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTBpyL0L7y0KSwGoxmjkr1KbA75GRd0JQyxQmhpYV3lsCnvXb2U-cEXTxqSQpraQ_P9PlxLVuDIjhzOj_YnzbN9aVqQnV4_DBtbXnRS1YXEdpEhzZZCnBNZ3k1TCKuD4TEOeRyRwiLwNIG/s1600/Immagine2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTBpyL0L7y0KSwGoxmjkr1KbA75GRd0JQyxQmhpYV3lsCnvXb2U-cEXTxqSQpraQ_P9PlxLVuDIjhzOj_YnzbN9aVqQnV4_DBtbXnRS1YXEdpEhzZZCnBNZ3k1TCKuD4TEOeRyRwiLwNIG/s640/Immagine2.jpg" width="640" /></a></div>
<br />
<br />
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">R code snippets created by Pretty R at inside-R.org</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com0tag:blogger.com,1999:blog-1442302563171663500.post-51647398326669637782015-06-01T18:14:00.002+02:002015-06-24T10:24:45.995+02:00Cluster analysis on earthquake data from USGS<b><span style="font-size: large;">Theoretical Background</span></b><br />
<div class="MsoNormal">
In some cases we would like to classify the events we have
in our dataset based on their spatial location or on some other data. As an
example we can return to the epidemiological scenario in which we want to
determine if the spread of a certain disease is affected by the presence of a
particular source of pollution. With the G function we are able to determine
quantitatively that our dataset is clustered, which means that the events are
not driven by chance but by some external factor. Now we need to verify that
indeed there is a cluster of points located around the source of pollution, to
do so we need a form of classification of the points.</div>
<div class="MsoNormal">
Cluster analysis refers to a series of techniques that allow
the subdivision of a dataset into subgroups, based on their similarities (James
et al., 2013). There are various clustering method, but probably the most
common is k-means clustering. This technique aims at partitioning the data into
a specific number of clusters, defined <i style="mso-bidi-font-style: normal;">a
priori</i> by the user, by minimizing the within-clusters variation. The
within-cluster variation measures how much each event in a cluster k, differs
from the others in the same cluster k. The most common way to compute the
differences is using the squared Euclidean distance (James et al., 2013),
calculated as follow:</div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOrsEZZs47PyilGRErebw0S8503lrxVlRAtJ5O59nyWtVZGiGsGu1v68xgiAMc5Eubdb12JRRBo1ZWS5FIbWeyNtNLa_qd__oL3LhV1u95hQqD8Kc8pbsTpcY9FzgEebgrth5XZV37g4lT/s1600/Eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOrsEZZs47PyilGRErebw0S8503lrxVlRAtJ5O59nyWtVZGiGsGu1v68xgiAMc5Eubdb12JRRBo1ZWS5FIbWeyNtNLa_qd__oL3LhV1u95hQqD8Kc8pbsTpcY9FzgEebgrth5XZV37g4lT/s640/Eq1.png" width="640" /></a></div>
<div class="MsoNormal">
Where W_k (I use the underscore to indicate the subscripts) is the within-cluster variation for the cluster k, n_k is the total number of elements in the cluster k, p is the total number of variables we are considering for clustering and x_ij is one variable of one event contained in cluster k. This equation seems complex, but it actually quite easy to understand. To better understand what this means in practice we can take a look at the figure below. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiktmmXiExEBfKH-g7WBof7W-6bSzgRlUodjmAJQqVI-MpjDWpsqJ5QA5WkB38XwYQXVxofgS3Mz0aFoKP0_rhhWyFZC2fN3XW9phRVAXArsa5A2zxhbSHRG0sWq2pvrU6DqVDnap449oEC/s1600/Fig8.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="568" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiktmmXiExEBfKH-g7WBof7W-6bSzgRlUodjmAJQqVI-MpjDWpsqJ5QA5WkB38XwYQXVxofgS3Mz0aFoKP0_rhhWyFZC2fN3XW9phRVAXArsa5A2zxhbSHRG0sWq2pvrU6DqVDnap449oEC/s640/Fig8.tiff" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For the sake of the argument we can assume that all the events in this point pattern are located in one unique cluster k, therefore n_k is 15. Since we are clustering events based on their geographical location we are working with two variables, i.e. latitude and longitude; so p is equal to two. To calculate the variance for one single pair of points in cluster k, we simply compute the difference between the first point’s value of the first variable, i.e. its latitude, and the second point value of the same variable; and we do the same for the second variable. So the variance between point 1 and 2 is calculated as follow:</div>
<div class="MsoNormal">
<br /></div>
<br />
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-GB</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="false"
DefSemiHidden="false" DefQFormat="false" DefPriority="99"
LatentStyleCount="371">
<w:LsdException Locked="false" Priority="0" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 9"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 9"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footnote text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="header"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footer"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index heading"/>
<w:LsdException Locked="false" Priority="35" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="table of figures"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="envelope address"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="envelope return"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footnote reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="line number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="page number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="endnote reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="endnote text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="table of authorities"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="macro"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="toa heading"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 5"/>
<w:LsdException Locked="false" Priority="10" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Closing"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Signature"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="true"
UnhideWhenUsed="true" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Message Header"/>
<w:LsdException Locked="false" Priority="11" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Salutation"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Date"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text First Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text First Indent 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Note Heading"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Block Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Hyperlink"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="FollowedHyperlink"/>
<w:LsdException Locked="false" Priority="22" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Document Map"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Plain Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="E-mail Signature"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Top of Form"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Bottom of Form"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal (Web)"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Acronym"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Address"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Cite"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Code"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Definition"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Keyboard"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Preformatted"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Sample"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Typewriter"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Variable"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal Table"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation subject"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="No List"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Contemporary"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Elegant"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Professional"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Subtle 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Subtle 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Balloon Text"/>
<w:LsdException Locked="false" Priority="39" Name="Table Grid"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Theme"/>
<w:LsdException Locked="false" SemiHidden="true" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" SemiHidden="true" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" QFormat="true"
Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" QFormat="true"
Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" QFormat="true"
Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" QFormat="true"
Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" QFormat="true"
Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" QFormat="true"
Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" SemiHidden="true"
UnhideWhenUsed="true" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="TOC Heading"/>
<w:LsdException Locked="false" Priority="41" Name="Plain Table 1"/>
<w:LsdException Locked="false" Priority="42" Name="Plain Table 2"/>
<w:LsdException Locked="false" Priority="43" Name="Plain Table 3"/>
<w:LsdException Locked="false" Priority="44" Name="Plain Table 4"/>
<w:LsdException Locked="false" Priority="45" Name="Plain Table 5"/>
<w:LsdException Locked="false" Priority="40" Name="Grid Table Light"/>
<w:LsdException Locked="false" Priority="46" Name="Grid Table 1 Light"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark"/>
<w:LsdException Locked="false" Priority="51" Name="Grid Table 6 Colorful"/>
<w:LsdException Locked="false" Priority="52" Name="Grid Table 7 Colorful"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 1"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 1"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 1"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 1"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 1"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 2"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 2"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 2"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 2"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 2"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 3"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 3"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 3"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 3"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 3"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 4"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 4"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 4"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 4"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 4"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 5"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 5"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 5"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 5"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 5"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 6"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 6"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 6"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 6"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 6"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="46" Name="List Table 1 Light"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark"/>
<w:LsdException Locked="false" Priority="51" Name="List Table 6 Colorful"/>
<w:LsdException Locked="false" Priority="52" Name="List Table 7 Colorful"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 1"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 1"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 1"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 1"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 1"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 2"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 2"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 2"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 2"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 2"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 3"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 3"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 3"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 3"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 3"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 4"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 4"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 4"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 4"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 4"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 5"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 5"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 5"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 5"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 5"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 6"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 6"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 6"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 6"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 6"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 6"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin-top:0cm;
mso-para-margin-right:0cm;
mso-para-margin-bottom:8.0pt;
mso-para-margin-left:0cm;
line-height:107%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-fareast-language:EN-US;}
</style>
<![endif]-->
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgL-_kKFY2KpUZMKZENdg6MJeKIsbUe_akaQyf6TKtlU7gmPFBdOaRaTP4c-pydBPlRqXQ-L09LdUG55S4H6iKaKOXqZW66EtZu4YAg2HERstdJMhzl0Fq-WQ_E3h0YmH3HnoHyBp3mI5Df/s1600/eq2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="30" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgL-_kKFY2KpUZMKZENdg6MJeKIsbUe_akaQyf6TKtlU7gmPFBdOaRaTP4c-pydBPlRqXQ-L09LdUG55S4H6iKaKOXqZW66EtZu4YAg2HERstdJMhzl0Fq-WQ_E3h0YmH3HnoHyBp3mI5Df/s640/eq2.png" width="640" /></a></div>
<div class="MsoNormal">
where V_(1:2) is the variance of the two points. Clearly the geographical position is not the only factor that can be used to partition events in a point pattern; for example we can divide earthquakes based on their magnitude. Therefore the two equations can be adapted to take more variables and the only difference is in the length of the linear equation that needs to be solved to calculate the variation between two points. The only problem may be in the number of equations that would need to be solved to obtain a solution. This however is something that the k-means algorithms solves very efficiently. <!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--></div>
<br />
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-GB</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="false"
DefSemiHidden="false" DefQFormat="false" DefPriority="99"
LatentStyleCount="371">
<w:LsdException Locked="false" Priority="0" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 9"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" Name="toc 9"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footnote text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="header"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footer"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index heading"/>
<w:LsdException Locked="false" Priority="35" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="table of figures"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="envelope address"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="envelope return"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="footnote reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="line number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="page number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="endnote reference"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="endnote text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="table of authorities"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="macro"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="toa heading"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Bullet 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Number 5"/>
<w:LsdException Locked="false" Priority="10" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Closing"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Signature"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="true"
UnhideWhenUsed="true" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="List Continue 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Message Header"/>
<w:LsdException Locked="false" Priority="11" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Salutation"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Date"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text First Indent"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text First Indent 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Note Heading"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Body Text Indent 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Block Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Hyperlink"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="FollowedHyperlink"/>
<w:LsdException Locked="false" Priority="22" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Document Map"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Plain Text"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="E-mail Signature"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Top of Form"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Bottom of Form"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal (Web)"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Acronym"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Address"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Cite"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Code"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Definition"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Keyboard"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Preformatted"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Sample"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Typewriter"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="HTML Variable"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Normal Table"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="annotation subject"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="No List"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Outline List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Simple 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Classic 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Colorful 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Columns 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Grid 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 4"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 5"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 6"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 7"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table List 8"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table 3D effects 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Contemporary"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Elegant"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Professional"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Subtle 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Subtle 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Web 3"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Balloon Text"/>
<w:LsdException Locked="false" Priority="39" Name="Table Grid"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="Table Theme"/>
<w:LsdException Locked="false" SemiHidden="true" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" SemiHidden="true" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" QFormat="true"
Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" QFormat="true"
Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" QFormat="true"
Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" QFormat="true"
Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" QFormat="true"
Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" QFormat="true"
Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" SemiHidden="true"
UnhideWhenUsed="true" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="TOC Heading"/>
<w:LsdException Locked="false" Priority="41" Name="Plain Table 1"/>
<w:LsdException Locked="false" Priority="42" Name="Plain Table 2"/>
<w:LsdException Locked="false" Priority="43" Name="Plain Table 3"/>
<w:LsdException Locked="false" Priority="44" Name="Plain Table 4"/>
<w:LsdException Locked="false" Priority="45" Name="Plain Table 5"/>
<w:LsdException Locked="false" Priority="40" Name="Grid Table Light"/>
<w:LsdException Locked="false" Priority="46" Name="Grid Table 1 Light"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark"/>
<w:LsdException Locked="false" Priority="51" Name="Grid Table 6 Colorful"/>
<w:LsdException Locked="false" Priority="52" Name="Grid Table 7 Colorful"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 1"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 1"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 1"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 1"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 1"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 2"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 2"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 2"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 2"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 2"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 3"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 3"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 3"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 3"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 3"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 4"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 4"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 4"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 4"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 4"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 5"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 5"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 5"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 5"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 5"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="46"
Name="Grid Table 1 Light Accent 6"/>
<w:LsdException Locked="false" Priority="47" Name="Grid Table 2 Accent 6"/>
<w:LsdException Locked="false" Priority="48" Name="Grid Table 3 Accent 6"/>
<w:LsdException Locked="false" Priority="49" Name="Grid Table 4 Accent 6"/>
<w:LsdException Locked="false" Priority="50" Name="Grid Table 5 Dark Accent 6"/>
<w:LsdException Locked="false" Priority="51"
Name="Grid Table 6 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="52"
Name="Grid Table 7 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="46" Name="List Table 1 Light"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark"/>
<w:LsdException Locked="false" Priority="51" Name="List Table 6 Colorful"/>
<w:LsdException Locked="false" Priority="52" Name="List Table 7 Colorful"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 1"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 1"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 1"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 1"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 1"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 1"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 2"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 2"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 2"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 2"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 2"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 2"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 3"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 3"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 3"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 3"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 3"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 3"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 4"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 4"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 4"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 4"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 4"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 4"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 5"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 5"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 5"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 5"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 5"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 5"/>
<w:LsdException Locked="false" Priority="46"
Name="List Table 1 Light Accent 6"/>
<w:LsdException Locked="false" Priority="47" Name="List Table 2 Accent 6"/>
<w:LsdException Locked="false" Priority="48" Name="List Table 3 Accent 6"/>
<w:LsdException Locked="false" Priority="49" Name="List Table 4 Accent 6"/>
<w:LsdException Locked="false" Priority="50" Name="List Table 5 Dark Accent 6"/>
<w:LsdException Locked="false" Priority="51"
Name="List Table 6 Colorful Accent 6"/>
<w:LsdException Locked="false" Priority="52"
Name="List Table 7 Colorful Accent 6"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin-top:0cm;
mso-para-margin-right:0cm;
mso-para-margin-bottom:8.0pt;
mso-para-margin-left:0cm;
line-height:107%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-fareast-language:EN-US;}
</style>
<![endif]-->
</div>
<div class="MsoNormal">
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;">The algorithm starts by randomly
assigning each event to a cluster, then it calculates the mean centre of each
cluster (we looked at what the mean centre is in the post: <a href="http://r-video-tutorial.blogspot.ch/2015/05/introductory-point-pattern-analysis-of.html" target="_blank">Introductory Point Pattern Analysis of Open Crime Data in London</a>). At this point it calculates the Euclidean distance
between each event and the two clusters and reassigns them to a new cluster,
based on the closest mean centre, then it recalculates the mean centres and it
keeps going until the cluster elements stop changing. As an example we can look
at the figure below, assuming we want to divide the events into two clusters. </span></div>
<div class="MsoNormal">
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhg1_CWmKEWZ6uW7UKgvheVfMyT-kzsgfjuSDQodA3gkxWMLxbRnDmULwfY1dJ3j_8TC_QdHngqAnqNaKuPBQ0arTsmuMJg-0s44X-frLGa7O8wJjIwYGb58Btmx2sS4dhTM80Cl4EzKkLL/s1600/Fig11.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="630" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhg1_CWmKEWZ6uW7UKgvheVfMyT-kzsgfjuSDQodA3gkxWMLxbRnDmULwfY1dJ3j_8TC_QdHngqAnqNaKuPBQ0arTsmuMJg-0s44X-frLGa7O8wJjIwYGb58Btmx2sS4dhTM80Cl4EzKkLL/s640/Fig11.tiff" width="640" /></a></div>
<div class="MsoNormal">
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;"><br /></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;">In Step 1
the algorithm assigns each event to a cluster at random. It then
computes the mean centres of the two clusters (Step 2), which are the large
black and red circles. Then the algorithm calculate the Euclidean distance
between each event and the two mean centres, and reassign the events to new
clusters based on the closest mean centre, so if a point was first in cluster
one but it is closer to the mean centre of cluster two it is reassigned to the
latter. Subsequently the mean centres are computed again for the new clusters
(Step 4). This process keeps going until the cluster elements stop changing.</span></div>
<br />
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;"></span><br />
<div class="MsoNormal">
</div>
<div class="MsoNormal">
<br />
<span style="mso-fareast-font-family: "Times New Roman"; mso-fareast-theme-font: minor-fareast;"></span></div>
<div class="MsoNormal">
<span style="font-size: large;"><b>Practical Example</b></span></div>
<div class="MsoNormal">
In this experiment we will look at a very simple exercise of cluster analysis of seismic events downloaded from the USGS website. To complete this exercise you would need the following packages: <b>sp</b>, <b>raster</b>, <b>plotrix</b>, <b>rgeos</b>, <b>rgdal </b>and <b>scatterplot3d</b></div>
<div class="MsoNormal">
I already mentioned in the post <a href="http://r-video-tutorial.blogspot.ch/2015/04/downloading-and-visualizing-seismic_28.html" target="_blank">Downloading and Visualizing Seismic Events from USGS </a>how to download the open data from the United States Geological Survey, so I will not repeat the process. The code for that is the following.</div>
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">URL <- <span style="color: blue;">"http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"</span>
Earthquake_30Days <- <a href="http://inside-r.org/r-doc/utils/read.table"><span style="color: #003399; font-weight: bold;">read.table</span></a><span style="color: #009900;">(</span>URL<span style="color: #339933;">,</span> sep = <span style="color: blue;">","</span><span style="color: #339933;">,</span> header = T<span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Download, unzip and load the polygon shapefile with the countries' borders</span>
<a href="http://inside-r.org/r-doc/utils/download.file"><span style="color: #003399; font-weight: bold;">download.file</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip"</span><span style="color: #339933;">,</span>destfile=<span style="color: blue;">"TM_WORLD_BORDERS_SIMPL-0.3.zip"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/unzip"><span style="color: #003399; font-weight: bold;">unzip</span></a><span style="color: #009900;">(</span><span style="color: blue;">"TM_WORLD_BORDERS_SIMPL-0.3.zip"</span><span style="color: #339933;">,</span>exdir=<a href="http://inside-r.org/r-doc/base/getwd"><span style="color: #003399; font-weight: bold;">getwd</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
polygons <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"TM_WORLD_BORDERS_SIMPL-0.3.shp"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
I also included the code to download the shapefile with the borders of all countries.<br />
<br />
For the cluster analysis I would like to try to divide the seismic events by origin. In other words I would like to see if there is a way to distinguish between events close to plates, or volcanoes or other faults. In many cases the distinction is hard to make since many volcanoes are originated from subduction, e.g. the Andes, where plates and volcanoes are close to one another and the algorithm may find difficult to distinguish the origins. In any case I would like to explore the use of cluster analysis to see what the algorithm is able to do.<br />
<br />
Clearly the first thing we need to do is download data regarding the location of plates, faults and volcanoes. We can find shapefiles with these information at the following website: <a href="http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/" target="_blank">http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/</a><br />
<br />
The data are provided in zip files, so we need to extract them and load them in R. There are some legal restrictions to use these data. They are distributed by ESRI and can be used in conjunction with the book: "Mapping Our World: GIS Lessons for Educators.". Details of the license and other information may be found here: <a href="http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Earthquakes/plat_lin.htm#getacopy" target="_blank">http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Earthquakes/plat_lin.htm#getacopy</a><br />
<br />
If you have the rights to download and use these data for your studies you can download them directly from the web with the following code. We already looked at code to do this in previous posts so I would not go into details here:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/dir.create"><span style="color: #003399; font-weight: bold;">dir.create</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/getwd"><span style="color: #003399; font-weight: bold;">getwd</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: blue;">"/GeologicalData"</span><span style="color: #339933;">,</span>sep=<span style="color: blue;">""</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Faults</span>
<a href="http://inside-r.org/r-doc/utils/download.file"><span style="color: #003399; font-weight: bold;">download.file</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/FAULTS.zip"</span><span style="color: #339933;">,</span>destfile=<span style="color: blue;">"GeologicalData/FAULTS.zip"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/unzip"><span style="color: #003399; font-weight: bold;">unzip</span></a><span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/FAULTS.zip"</span><span style="color: #339933;">,</span>exdir=<span style="color: blue;">"GeologicalData"</span><span style="color: #009900;">)</span>
faults <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/FAULTS.SHP"</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Plates</span>
<a href="http://inside-r.org/r-doc/utils/download.file"><span style="color: #003399; font-weight: bold;">download.file</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/PLAT_LIN.zip"</span><span style="color: #339933;">,</span>destfile=<span style="color: blue;">"GeologicalData/plates.zip"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/unzip"><span style="color: #003399; font-weight: bold;">unzip</span></a><span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/plates.zip"</span><span style="color: #339933;">,</span>exdir=<span style="color: blue;">"GeologicalData"</span><span style="color: #009900;">)</span>
plates <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/PLAT_LIN.SHP"</span><span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#Volcano</span>
<a href="http://inside-r.org/r-doc/utils/download.file"><span style="color: #003399; font-weight: bold;">download.file</span></a><span style="color: #009900;">(</span><span style="color: blue;">"http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/VOLCANO.zip"</span><span style="color: #339933;">,</span>destfile=<span style="color: blue;">"GeologicalData/VOLCANO.zip"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/utils/unzip"><span style="color: #003399; font-weight: bold;">unzip</span></a><span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/VOLCANO.zip"</span><span style="color: #339933;">,</span>exdir=<span style="color: blue;">"GeologicalData"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/datasets/volcano"><span style="color: #003399; font-weight: bold;">volcano</span></a> <- shapefile<span style="color: #009900;">(</span><span style="color: blue;">"GeologicalData/VOLCANO.SHP"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
The only piece of code that I never presented before is the first line, to create a new folder. It is pretty self explanatory, we just need to create a string with the name of the folder and R will create it. The rest of the code downloads data from the address above, unzip them and load them in R.<br />
<br />
We have not yet transform the object <i>Earthquake_30Days</i>, which is now a <i>data.frame</i>, into a <i>SpatioPointsDataFrame</i>. The data from USGS contain seismic events that are not only earthquakes but also related to mining and other events. For this analysis we want to keep only the events that are classified as earthquakes, which we can do with the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Earthquakes <- Earthquake_30Days<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>Earthquake_30Days$type<span style="color: #009900;">)</span>==<span style="color: blue;">"earthquake"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>
coordinates<span style="color: #009900;">(</span>Earthquakes<span style="color: #009900;">)</span>=~longitude+latitude</pre>
</div>
</div>
<br />
This extracts only earthquakes and transform the object into a <i>SpatialObject</i>.<br />
<br />
<br />
We can create a map that shows the earthquakes alongside all the other geological elements we downloaded using the following code, which saves directly the image in jpeg:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/grDevices/jpeg"><span style="color: #003399; font-weight: bold;">jpeg</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Earthquake_Origin.jpg"</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4000</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2000</span><span style="color: #339933;">,</span>res=<span style="color: #cc66cc;">300</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>plates<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"red"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>polygons<span style="color: #339933;">,</span>add=T<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/title"><span style="color: #003399; font-weight: bold;">title</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Earthquakes in the last 30 days"</span><span style="color: #339933;">,</span>cex.main=<span style="color: #cc66cc;">3</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/lines"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">(</span>faults<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"dark grey"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/points"><span style="color: #003399; font-weight: bold;">points</span></a><span style="color: #009900;">(</span>Earthquakes<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"blue"</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.5</span><span style="color: #339933;">,</span>pch=<span style="color: blue;">"+"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/points"><span style="color: #003399; font-weight: bold;">points</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/datasets/volcano"><span style="color: #003399; font-weight: bold;">volcano</span></a><span style="color: #339933;">,</span>pch=<span style="color: blue;">"*"</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.7</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"dark red"</span><span style="color: #009900;">)</span>
legend.pos <- <a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span>x=<span style="color: #cc66cc;">20.97727</span><span style="color: #339933;">,</span>y=-<span style="color: #cc66cc;">57.86364</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a><span style="color: #009900;">(</span>legend.pos<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Plates"</span><span style="color: #339933;">,</span><span style="color: blue;">"Faults"</span><span style="color: #339933;">,</span><span style="color: blue;">"Volcanoes"</span><span style="color: #339933;">,</span><span style="color: blue;">"Earthquakes"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>pch=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"-"</span><span style="color: #339933;">,</span><span style="color: blue;">"-"</span><span style="color: #339933;">,</span><span style="color: blue;">"*"</span><span style="color: #339933;">,</span><span style="color: blue;">"+"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"red"</span><span style="color: #339933;">,</span><span style="color: blue;">"dark grey"</span><span style="color: #339933;">,</span><span style="color: blue;">"dark red"</span><span style="color: #339933;">,</span><span style="color: blue;">"blue"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>bty=<span style="color: blue;">"n"</span><span style="color: #339933;">,</span>bg=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"white"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>y.intersp=<span style="color: #cc66cc;">0.75</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/title"><span style="color: #003399; font-weight: bold;">title</span></a>=<span style="color: blue;">"Days from Today"</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.8</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/text"><span style="color: #003399; font-weight: bold;">text</span></a><span style="color: #009900;">(</span>legend.pos$x<span style="color: #339933;">,</span>legend.pos$y+<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: blue;">"Legend:"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/grDevices/dev.off"><span style="color: #003399; font-weight: bold;">dev.off</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
This code is very similar to what I used <a href="http://r-video-tutorial.blogspot.ch/2015/04/downloading-and-visualizing-seismic_28.html" target="_blank">here</a> so I will not explain it in details. We just added more elements to the plot and therefore we need to remember that R plots in layers one on top of the other depending on the order in which they appear on the code. For example, as you can see from the code, the first thing we plot are the plates, which will be plotted below everything, even the borders of the polygons, which come second. You can change this just by changing the order of the lines. Just remember to use the option <i>add=T</i> correctly.<br />
The result is the image below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2oGnDd_q_wroJEJ7NFe8hRGHTw4Gxv491WzFXDbQk2LdWUjQF54VmfWDUiFsHpz2Cm4SKhTD98HaFbZR3fEMHaLR62GaoOWuHHEV1ooRUXSY-5kDeUbVJttne3viuYEHoKEph2wxAoTXQ/s1600/Earthquake_Origin.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2oGnDd_q_wroJEJ7NFe8hRGHTw4Gxv491WzFXDbQk2LdWUjQF54VmfWDUiFsHpz2Cm4SKhTD98HaFbZR3fEMHaLR62GaoOWuHHEV1ooRUXSY-5kDeUbVJttne3viuYEHoKEph2wxAoTXQ/s640/Earthquake_Origin.jpg" width="640" /></a></div>
<br />
Before proceeding with the cluster analysis we first need to fix the projections of the <i>SpatialObjects</i>.
Luckily the object polygons was created from a shapefile with the
projection data attached to it, so we can use it to tell R that the
other objects have the same projection:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">projection<span style="color: #009900;">(</span>faults<span style="color: #009900;">)</span>=projection<span style="color: #009900;">(</span>polygons<span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/datasets/volcano"><span style="color: #003399; font-weight: bold;">volcano</span></a><span style="color: #009900;">)</span>=projection<span style="color: #009900;">(</span>polygons<span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>Earthquakes<span style="color: #009900;">)</span>=projection<span style="color: #009900;">(</span>polygons<span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>plates<span style="color: #009900;">)</span>=projection<span style="color: #009900;">(</span>polygons<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
Now we can proceed with the cluster analysis. As I said I would like to try and classify earthquakes based on their distance between the various geological features. To calculate this distance we can use the function <b>gDistance </b>in the package <b>rgeos</b>. <br />
These shapefiles are all unprojected, and their coordinates are in degrees. We cannot use them directly with the function <b>gDistance </b>because it deals only with projected data, so we need to transform them using the function <b>spTransform </b>(in the package <b>rgdal</b>). This function takes two arguments, the first is the <i>SpatialObject</i>, which needs to have projection information, and the second is the data regarding the projection to transform the object into. The code for doing that is the following:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">volcanoUTM <- spTransform<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/datasets/volcano"><span style="color: #003399; font-weight: bold;">volcano</span></a><span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
faultsUTM <- spTransform<span style="color: #009900;">(</span>faults<span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
EarthquakesUTM <- spTransform<span style="color: #009900;">(</span>Earthquakes<span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
platesUTM <- spTransform<span style="color: #009900;">(</span>plates<span style="color: #339933;">,</span>CRS<span style="color: #009900;">(</span><span style="color: blue;">"+init=epsg:3395"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
The projection we are going to use is the standard mercator, details here: <a href="http://spatialreference.org/ref/epsg/wgs-84-world-mercator/" target="_blank">http://spatialreference.org/ref/epsg/wgs-84-world-mercator/</a><br />
<br />
<b>NOTE</b>:<br />
the plates object presents lines also along the borders of the image above. This is something that R cannot deal with, so I had to remove them manually from ArcGIS. If you want to replicate this experiment you have to do the same. I do not know of any method in R to do that quickly, if you know it please let me know in the comment section.<br />
<br />
<br />
We are going to create a matrix of distances between each earthquake and the geological features with the following loop:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">distance.matrix <- <a href="http://inside-r.org/r-doc/base/matrix"><span style="color: #003399; font-weight: bold;">matrix</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>Earthquakes<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">7</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/dimnames"><span style="color: #003399; font-weight: bold;">dimnames</span></a>=<a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Lat"</span><span style="color: #339933;">,</span><span style="color: blue;">"Lon"</span><span style="color: #339933;">,</span><span style="color: blue;">"Mag"</span><span style="color: #339933;">,</span><span style="color: blue;">"Depth"</span><span style="color: #339933;">,</span><span style="color: blue;">"DistV"</span><span style="color: #339933;">,</span><span style="color: blue;">"DistF"</span><span style="color: #339933;">,</span><span style="color: blue;">"DistP"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>i <span style="color: black; font-weight: bold;">in</span> <span style="color: #cc66cc;">1</span>:<a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a> <- EarthquakesUTM<span style="color: #009900;">[</span>i<span style="color: #339933;">,</span><span style="color: #009900;">]</span>
dist.v <- gDistance<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #339933;">,</span>volcanoUTM<span style="color: #009900;">)</span>
dist.f <- gDistance<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #339933;">,</span>faultsUTM<span style="color: #009900;">)</span>
dist.p <- gDistance<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a><span style="color: #339933;">,</span>platesUTM<span style="color: #009900;">)</span>
distance.matrix<span style="color: #009900;">[</span>i<span style="color: #339933;">,</span><span style="color: #009900;">]</span> <- <a href="http://inside-r.org/r-doc/base/matrix"><span style="color: #003399; font-weight: bold;">matrix</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a>@coords<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a>$mag<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/sub"><span style="color: #003399; font-weight: bold;">sub</span></a>$depth<span style="color: #339933;">,</span>dist.v<span style="color: #339933;">,</span>dist.f<span style="color: #339933;">,</span>dist.p<span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/ncol"><span style="color: #003399; font-weight: bold;">ncol</span></a>=<span style="color: #cc66cc;">7</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
distDF <- <a href="http://inside-r.org/r-doc/base/as.data.frame"><span style="color: #003399; font-weight: bold;">as.data.frame</span></a><span style="color: #009900;">(</span>distance.matrix<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<br />
In this code we first create an empty matrix, which is usually wise to do since R already allocates the RAM it would need for the process and it should also be faster to fill it compared to create a new matrix directly from inside the loop. In the loop we iterate through the earthquakes and for each we calculate its distance to the geological features. Finally we change the <i>matrix </i>into a <i>data.frame</i>.<br />
<br />
The next step is finding the correct number of clusters. To do that we can follow the approach suggested by Matthew Peeples here: <a href="http://www.mattpeeples.net/kmeans.html" target="_blank">http://www.mattpeeples.net/kmeans.html</a> and also discussed in this stackoverflow post: <a href="http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters" target="_blank">http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters</a><br />
<br />
The code for that is the following:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">mydata <- <a href="http://inside-r.org/r-doc/base/scale"><span style="color: #003399; font-weight: bold;">scale</span></a><span style="color: #009900;">(</span>distDF<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">5</span>:<span style="color: #cc66cc;">7</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span>
wss <- <span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>mydata<span style="color: #009900;">)</span>-<span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span>*<a href="http://inside-r.org/r-doc/base/sum"><span style="color: #003399; font-weight: bold;">sum</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/apply"><span style="color: #003399; font-weight: bold;">apply</span></a><span style="color: #009900;">(</span>mydata<span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/stats/var"><span style="color: #003399; font-weight: bold;">var</span></a><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<span style="color: black; font-weight: bold;">for</span> <span style="color: #009900;">(</span>i <span style="color: black; font-weight: bold;">in</span> <span style="color: #cc66cc;">2</span>:<span style="color: #cc66cc;">15</span><span style="color: #009900;">)</span> wss<span style="color: #009900;">[</span>i<span style="color: #009900;">]</span> <- <a href="http://inside-r.org/r-doc/base/sum"><span style="color: #003399; font-weight: bold;">sum</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/kmeans"><span style="color: #003399; font-weight: bold;">kmeans</span></a><span style="color: #009900;">(</span>mydata<span style="color: #339933;">,</span>
centers=i<span style="color: #009900;">)</span>$withinss<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span>:<span style="color: #cc66cc;">15</span><span style="color: #339933;">,</span> wss<span style="color: #339933;">,</span> type=<span style="color: blue;">"b"</span><span style="color: #339933;">,</span> xlab=<span style="color: blue;">"Number of Clusters"</span><span style="color: #339933;">,</span>
ylab=<span style="color: blue;">"Within groups sum of squares"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
We basically calculate clusters between 2 and 15 and we plot the number of clusters against the "within clusters sum of squares", which is the parameters that is minimized during the clustering process. Generally this quantity decreases very fast up to a point, and then basically stops decreasing. We can see this behaviour from the plot below generated from the earthquake data:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxyk4oE35lmKvnWx7wEncueo59sihOSbFwC135GuLvVzZFG1IQghjeLsG4K8_Avyg2FfQY76YyTLD1H3ksD9baMjOeo8Q2yKFwi9fqsVMcvNAJZ5FhMPJlfFkzXUlZ4retfWypnUn534S_/s1600/Cluster_Selection.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxyk4oE35lmKvnWx7wEncueo59sihOSbFwC135GuLvVzZFG1IQghjeLsG4K8_Avyg2FfQY76YyTLD1H3ksD9baMjOeo8Q2yKFwi9fqsVMcvNAJZ5FhMPJlfFkzXUlZ4retfWypnUn534S_/s640/Cluster_Selection.jpeg" width="640" /></a></div>
<br />
As you can see for 1 and 2 clusters the sum of squares is high and decreases fast, then it decreases between 3 and 5, and then it gets erratic. So probably the best number of clusters would be 5, but clearly this is an empirical method so we would need to check other numbers and test whether they make more sense.<br />
<br />
To create the clusters we can simply use the function <b>kmeans</b>, which takes two arguments: the data and the number of clusters:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">clust <- <a href="http://inside-r.org/r-doc/stats/kmeans"><span style="color: #003399; font-weight: bold;">kmeans</span></a><span style="color: #009900;">(</span>mydata<span style="color: #339933;">,</span><span style="color: #cc66cc;">5</span><span style="color: #009900;">)</span>
distDF$Clusters <- clust$cluster</pre>
</div>
</div>
<br />
We can check the physical meaning of the clusters by plotting them against the distance from the geological features using the function <b>scatterplot3d</b>, in the package <b>scatterplot3d</b>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/packages/cran/scatterplot3d">scatterplot3d</a><span style="color: #009900;">(</span>distDF$DistV<span style="color: #339933;">,</span>xlab=<span style="color: blue;">"Distance to Volcano"</span><span style="color: #339933;">,</span>distDF$DistF<span style="color: #339933;">,</span>ylab=<span style="color: blue;">"Distance to Fault"</span><span style="color: #339933;">,</span>distDF$DistP<span style="color: #339933;">,</span>zlab=<span style="color: blue;">"Distance to Plate"</span><span style="color: #339933;">,</span> color = clust$cluster<span style="color: #339933;">,</span>pch=<span style="color: #cc66cc;">16</span><span style="color: #339933;">,</span>angle=<span style="color: #cc66cc;">120</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/scale"><span style="color: #003399; font-weight: bold;">scale</span></a>=<span style="color: #cc66cc;">0.5</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/grid"><span style="color: #003399; font-weight: bold;">grid</span></a>=T<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/box"><span style="color: #003399; font-weight: bold;">box</span></a>=F<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
This function is very similar to the standard <b>plot </b>function, but it takes three arguments instead of just two. I wrote the line of code distinguishing between the three axis to better understand it. So we have the variable for x, and the corresponding axis label, and so on for each axis. Then we set the colours based on clusters, and the symbol with <i>pch</i>, as we would do in <b>plot</b>. The last options are only available here: we have the <span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: 'Times New Roman'; font-size: xx-small; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><i>angle </i>between x and y axis, the <i>scale </i>of the z axis compared to the other two, then we plot a <i>grid </i>on the xy plane and we do not plot a <i>box </i>all around the plot. The result is the following image:</span><br />
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: 'Times New Roman'; font-size: xx-small; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOraGx0th7IXMtJGC76Im8N9qAR_NlC1zO04Kmo1VG7LWQOvKCL-6jlyhx0cZukg3-I_dSp8L7ygK4LVEh804gTKBEHmTRTKczmP5rse0yvAvNLN2vljKRsoVzt-jGdGtEogX9NPDhyphenhyphen1Or/s1600/Scatterplot3D.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOraGx0th7IXMtJGC76Im8N9qAR_NlC1zO04Kmo1VG7LWQOvKCL-6jlyhx0cZukg3-I_dSp8L7ygK4LVEh804gTKBEHmTRTKczmP5rse0yvAvNLN2vljKRsoVzt-jGdGtEogX9NPDhyphenhyphen1Or/s640/Scatterplot3D.jpeg" width="640" /></a></div>
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: 'Times New Roman'; font-size: xx-small; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><br /></span>
It seems that the red and green cluster are very similar, they differ only because red is closer to volcanoes than faults and vice-versa for the green. The black cluster seems only to be farther away from volcanoes. Finally the blue and light blue clusters seem to be close to volcanoes and far away from the other two features.<br />
<br />
We can create an image with the clusters using the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">clustSP <- SpatialPointsDataFrame<span style="color: #009900;">(</span>coords=Earthquakes@coords<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>=<a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span>Clusters=clust$cluster<span style="color: #009900;">)</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/grDevices/jpeg"><span style="color: #003399; font-weight: bold;">jpeg</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Earthquake_Clusters.jpg"</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4000</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2000</span><span style="color: #339933;">,</span>res=<span style="color: #cc66cc;">300</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>plates<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"red"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span>polygons<span style="color: #339933;">,</span>add=T<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/title"><span style="color: #003399; font-weight: bold;">title</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Earthquakes in the last 30 days"</span><span style="color: #339933;">,</span>cex.main=<span style="color: #cc66cc;">3</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/lines"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">(</span>faults<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"dark grey"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/points"><span style="color: #003399; font-weight: bold;">points</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/datasets/volcano"><span style="color: #003399; font-weight: bold;">volcano</span></a><span style="color: #339933;">,</span>pch=<span style="color: blue;">"x"</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.5</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"yellow"</span><span style="color: #009900;">)</span>
legend.pos <- <a href="http://inside-r.org/r-doc/base/list"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">(</span>x=<span style="color: #cc66cc;">20.97727</span><span style="color: #339933;">,</span>y=-<span style="color: #cc66cc;">57.86364</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/points"><span style="color: #003399; font-weight: bold;">points</span></a><span style="color: #009900;">(</span>clustSP<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=clustSP$Clusters<span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.5</span><span style="color: #339933;">,</span>pch=<span style="color: blue;">"+"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a><span style="color: #009900;">(</span>legend.pos<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Plates"</span><span style="color: #339933;">,</span><span style="color: blue;">"Faults"</span><span style="color: #339933;">,</span><span style="color: blue;">"Volcanoes"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>pch=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"-"</span><span style="color: #339933;">,</span><span style="color: blue;">"-"</span><span style="color: #339933;">,</span><span style="color: blue;">"x"</span><span style="color: #339933;">,</span><span style="color: blue;">"+"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"red"</span><span style="color: #339933;">,</span><span style="color: blue;">"dark grey"</span><span style="color: #339933;">,</span><span style="color: blue;">"yellow"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>bty=<span style="color: blue;">"n"</span><span style="color: #339933;">,</span>bg=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"white"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>y.intersp=<span style="color: #cc66cc;">0.75</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.6</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/text"><span style="color: #003399; font-weight: bold;">text</span></a><span style="color: #009900;">(</span>legend.pos$x<span style="color: #339933;">,</span>legend.pos$y+<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span><span style="color: blue;">"Legend:"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/grDevices/dev.off"><span style="color: #003399; font-weight: bold;">dev.off</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
I created the object <i>clustSP </i>based on the coordinates in WGS84 so that I can plot everything as before. I also plotted the volcanoes in yellow, so that differ from the red cluster. The result is the following image:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUgslqFNaJqlWp_eLDNLdyGOVNMiCEHN7Me7JUtFQ0H_a2Jwjxt5m42k9lHgw0BhIzTLJWQPWOUcbV0-ovtjdtGrRXyDuiXonCbj0kBfARGjOdBU9rnFpdvRoZhFjzfkC3RcwCrAThPRSJ/s1600/Earthquake_Clusters.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUgslqFNaJqlWp_eLDNLdyGOVNMiCEHN7Me7JUtFQ0H_a2Jwjxt5m42k9lHgw0BhIzTLJWQPWOUcbV0-ovtjdtGrRXyDuiXonCbj0kBfARGjOdBU9rnFpdvRoZhFjzfkC3RcwCrAThPRSJ/s640/Earthquake_Clusters.jpg" width="640" /></a></div>
<br />
<br />
To conclude this experiment I would also like to explore the relation between the distance to the geological features and the magnitude of the earthquakes. To do that we need to identify the events that are at a certain distance from each geological feature. We can use the function <b>gBuffer</b>, again available from the package <b>rgeos</b>, for this job.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">volcano.buffer <- gBuffer<span style="color: #009900;">(</span>volcanoUTM<span style="color: #339933;">,</span>width=<span style="color: #cc66cc;">1000</span><span style="color: #009900;">)</span>
volcano.over <- <a href="http://inside-r.org/r-doc/grDevices/over"><span style="color: #003399; font-weight: bold;">over</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #339933;">,</span>volcano.buffer<span style="color: #009900;">)</span>
plates.buffer <- gBuffer<span style="color: #009900;">(</span>platesUTM<span style="color: #339933;">,</span>width=<span style="color: #cc66cc;">1000</span><span style="color: #009900;">)</span>
plates.over <- <a href="http://inside-r.org/r-doc/grDevices/over"><span style="color: #003399; font-weight: bold;">over</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #339933;">,</span>plates.buffer<span style="color: #009900;">)</span>
faults.buffer <- gBuffer<span style="color: #009900;">(</span>faultsUTM<span style="color: #339933;">,</span>width=<span style="color: #cc66cc;">1000</span><span style="color: #009900;">)</span>
faults.over <- <a href="http://inside-r.org/r-doc/grDevices/over"><span style="color: #003399; font-weight: bold;">over</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #339933;">,</span>faults.buffer<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
This function takes minimum two arguments, the <i>SpatialObject </i>and the maximum distance (in metres because it requires data to be projected) to reach with the buffer, option <i>width</i>. The results is a <i>SpatialPolygons </i>object that include a buffer around the starting features; for example if we start with a point we end up with a circle of radius equal to <i>width</i>. In the code above we first created these buffer areas and then we overlaid <i>EarthquakesUTM </i>with these areas to find the events located within their borders. The overlay function returns two values: NA if the object is outside the buffer area and 1 if it is inside. We can use this information to subset <i>EarthquakesUTM </i>later on.<br />
<br />
Now we can include the overlays in EarthquakesUTM as follows:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">EarthquakesUTM$volcano <- <a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span>volcano.over<span style="color: #009900;">)</span>
EarthquakesUTM$plates <- <a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span>plates.over<span style="color: #009900;">)</span>
EarthquakesUTM$faults <- <a href="http://inside-r.org/r-doc/base/as.numeric"><span style="color: #003399; font-weight: bold;">as.numeric</span></a><span style="color: #009900;">(</span>faults.over<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
To determine if there is a relation between the distance from each feature and the magnitude of the earthquakes we can simply plot the magnitude's distribution for the various events included in the buffer areas we created before with the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/graphics/plot"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/density"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$volcano<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span>ylim=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>xlim=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>main=<span style="color: blue;">"Earthquakes by Origin"</span><span style="color: #339933;">,</span>xlab=<span style="color: blue;">"Magnitude"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/lines"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/density"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$faults<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"red"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/lines"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/density"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$plates<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: blue;">"blue"</span><span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">3</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">0.6</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/title"><span style="color: #003399; font-weight: bold;">title</span></a>=<span style="color: blue;">"Mean magnitude per origin"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/graphics/legend"><span style="color: #003399; font-weight: bold;">legend</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Volcanic"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/round"><span style="color: #003399; font-weight: bold;">round</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$volcano<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Faults"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/round"><span style="color: #003399; font-weight: bold;">round</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$faults<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"Plates"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/round"><span style="color: #003399; font-weight: bold;">round</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/mean"><span style="color: #003399; font-weight: bold;">mean</span></a><span style="color: #009900;">(</span>EarthquakesUTM<span style="color: #009900;">[</span><a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span>EarthquakesUTM$plates<span style="color: #009900;">)</span>==<span style="color: blue;">"1"</span><span style="color: #339933;">,</span><span style="color: #009900;">]</span>$mag<span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>pch=<span style="color: blue;">"-"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/col"><span style="color: #003399; font-weight: bold;">col</span></a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"black"</span><span style="color: #339933;">,</span><span style="color: blue;">"red"</span><span style="color: #339933;">,</span><span style="color: blue;">"blue"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>cex=<span style="color: #cc66cc;">0.8</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
which creates the following plot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgr6NZplT_5GhoAutpcc2JRMBpwVdy1LUdd3v7iODkYIsRu3yhln6Je9CHKjojtS0K-cv9HbIDM9yjoDuyWkRpxVH7H0XyHSpZMZOcCLF92EpUM4Ji79hGUbwVehZW5iz8aJ4NhMZEc3FsR/s1600/Magnitude_Distribution.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgr6NZplT_5GhoAutpcc2JRMBpwVdy1LUdd3v7iODkYIsRu3yhln6Je9CHKjojtS0K-cv9HbIDM9yjoDuyWkRpxVH7H0XyHSpZMZOcCLF92EpUM4Ji79hGUbwVehZW5iz8aJ4NhMZEc3FsR/s640/Magnitude_Distribution.jpeg" width="640" /></a></div>
<br />
It seems that earthquakes close to plates have higher magnitude on average.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">R code snippets created by Pretty R at inside-R.org</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com13tag:blogger.com,1999:blog-1442302563171663500.post-1734820007195032282015-05-28T07:54:00.001+02:002015-07-22T10:32:40.704+02:00Live Earthquake Map with Shiny and Google Map APIIn the post <a href="http://r-video-tutorial.blogspot.ch/2015/05/exchange-data-between-r-and-google-maps.html" target="_blank">Exchange data between R and the Google Maps API using Shiny</a> I presented a very simple way to allow communication between R and javascript using <b>Shiny</b>.<br />
<br />
This is an example of a practical approach for which the same system can be used. Here I created a tool to visualize seismic events, collected from USGS, in the Google Maps API using R to do some basic data preparation. The procedure to complete this experiment is pretty much identical to what I presented in the post mentioned above, so I will not repeated here.<br />
<br />
<br />
The final map looks like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcv2sRm0nIfALySwqBbTlJ4RzEflUGVBP72uVygwkztv9SZ2CPzdG2DdIrRKZqiotuDs9g660N6CpXJuBjVAm1L05ykwlpE_mnpUZorVZEvbkdlK5jj2-T1lG4v8IaZWOqhkbVUno1qRjd/s1600/Map.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcv2sRm0nIfALySwqBbTlJ4RzEflUGVBP72uVygwkztv9SZ2CPzdG2DdIrRKZqiotuDs9g660N6CpXJuBjVAm1L05ykwlpE_mnpUZorVZEvbkdlK5jj2-T1lG4v8IaZWOqhkbVUno1qRjd/s640/Map.jpg" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
and it is accessible from this site, hosted on the Amazon Cloud: <a href="http://52.28.106.115:3838/Earthquake/" target="_blank">Earthquake</a><br />
<br />
The colours of the markers depends on magnitude and they are set in R. For magnitudes below 2 the marker is green, between 2 and 4 is yellow, between 4 and 6 is orange and above 6 is red.<br />
I also set R to export other information about the event to the json file that I then use to populate the infowindow of each marker.<br />
<br />
The code for creating this map consists of two pieces, an index.html file (which needs to go in a folder names www) and the file server.r, available below:<br />
<br />
Server.r<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhm3vZw_Dw_J0fDP5kaKTaFuosuxcJVkt-iRFiRKtR1d2laXFO5fiLyAL7c-oAIZeJrHum49C0VXVly0-mBatVCQw6SF19cwRpWXXJRPmsO-Sh5hWLFiSd3v0h47CyKgU-c0PqZ4c69D0db/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> # server.R
#Title: Earthquake Visualization in Shiny
#Copyright: Fabio Veronesi
library(sp)
library(rjson)
library(RJSONIO)
shinyServer(function(input, output) {
output$json <- reactive ({
if(length(input$Earth)>0){
if(input$Earth==1){
hour <- read.table("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.csv", sep = ",", header = T)
if(nrow(hour)>0){
lis <- list()
for(i in 1:nrow(hour)){
if(hour$mag[i]<=2){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_green.png"}
else if(hour$mag[i]>2&hour$mag[i]<=4){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_yellow.png"}
else if(hour$mag[i]>4&hour$mag[i]<=6){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_orange.png"}
else {icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_red.png"}
Date.hour <- substring(hour$time[i],1,10)
Time.hour <- substring(hour$time[i],12,23)
lis[[i]] <- list(i,hour$longitude[i],hour$latitude[i],icon,hour$place[i],hour$depth[i],hour$mag[i],Date.hour,Time.hour)
}
#This code creates the variable test directly in javascript for export the grid in the Google Maps API
#I have taken this part from:http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny
paste('<script>test=',
RJSONIO::toJSON(lis),
';setAllMap();Cities_Markers();',
'</script>')
}
}
else if(input$Earth==4){
month <- read.table("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv", sep = ",", header = T)
if(nrow(month)>0){
lis <- list()
for(i in 1:nrow(month)){
if(month$mag[i]<=2){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_green.png"}
else if(month$mag[i]>2&month$mag[i]<=4){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_yellow.png"}
else if(month$mag[i]>4&month$mag[i]<=6){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_orange.png"}
else {icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_red.png"}
Date.month <- substring(month$time[i],1,10)
Time.month <- substring(month$time[i],12,23)
lis[[i]] <- list(i,month$longitude[i],month$latitude[i],icon,month$place[i],month$depth[i],month$mag[i],Date.month,Time.month)
}
#This code creates the variable test directly in javascript for export the grid in the Google Maps API
#I have taken this part from:http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny
paste('<script>test=',
RJSONIO::toJSON(lis),
';setAllMap();Cities_Markers();',
'</script>')
}
}
else if(input$Earth==3){
week <- read.table("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv", sep = ",", header = T)
if(nrow(week)>0){
lis <- list()
for(i in 1:nrow(week)){
if(week$mag[i]<=2){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_green.png"}
else if(week$mag[i]>2&week$mag[i]<=4){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_yellow.png"}
else if(week$mag[i]>4&week$mag[i]<=6){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_orange.png"}
else {icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_red.png"}
Date.week <- substring(week$time[i],1,10)
Time.week <- substring(week$time[i],12,23)
lis[[i]] <- list(i,week$longitude[i],week$latitude[i],icon,week$place[i],week$depth[i],week$mag[i],Date.week,Time.week)
}
#This code creates the variable test directly in javascript for export the grid in the Google Maps API
#I have taken this part from:http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny
paste('<script>test=',
RJSONIO::toJSON(lis),
';setAllMap();Cities_Markers();',
'</script>')
}
}
else {
day <- read.table("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv", sep = ",", header = T)
if(nrow(day)>0){
lis <- list()
for(i in 1:nrow(day)){
if(day$mag[i]<=2){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_green.png"}
else if(day$mag[i]>2&day$mag[i]<=4){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_yellow.png"}
else if(day$mag[i]>4&day$mag[i]<=6){icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_orange.png"}
else {icon="http://maps.gstatic.com/mapfiles/ridefinder-images/mm_20_red.png"}
Date.day <- substring(day$time[i],1,10)
Time.day <- substring(day$time[i],12,23)
lis[[i]] <- list(i,day$longitude[i],day$latitude[i],icon,day$place[i],day$depth[i],day$mag[i],Date.day,Time.day)
}
#This code creates the variable test directly in javascript for export the grid in the Google Maps API
#I have taken this part from:http://stackoverflow.com/questions/26719334/passing-json-data-to-a-javascript-object-with-shiny
paste('<script>test=',
RJSONIO::toJSON(lis),
';setAllMap();Cities_Markers();',
'</script>')
}
}
}
})
})
</code></pre>
<br />
<br />
Index.html<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhm3vZw_Dw_J0fDP5kaKTaFuosuxcJVkt-iRFiRKtR1d2laXFO5fiLyAL7c-oAIZeJrHum49C0VXVly0-mBatVCQw6SF19cwRpWXXJRPmsO-Sh5hWLFiSd3v0h47CyKgU-c0PqZ4c69D0db/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> <!DOCTYPE html>
<html>
<head>
<title>Earthquake Visualization in Shiny</title>
<!--METADATA-->
<meta name="author" content="Fabio Veronesi">
<meta name="copyright" content="©Fabio Veronesi">
<meta http-equiv="Content-Language" content="en-gb">
<meta charset="utf-8"/>
<style type="text/css">
html { height: 100% }
body { height: 100%; margin: 0; padding: 0 }
map-canvas { height: 100%; width:100% }
.btn {
background: #dde6d8;
background-image: -webkit-linear-gradient(top, #dde6d8, #859ead);
background-image: -moz-linear-gradient(top, #dde6d8, #859ead);
background-image: -ms-linear-gradient(top, #dde6d8, #859ead);
background-image: -o-linear-gradient(top, #dde6d8, #859ead);
background-image: linear-gradient(to bottom, #dde6d8, #859ead);
-webkit-border-radius: 7;
-moz-border-radius: 7;
border-radius: 7px;
font-family: Arial;
color: #000000;
font-size: 20px;
padding: 9px 20px 10px 20px;
text-decoration: none;
}
.btn:hover {
background: #f29f9f;
background-image: -webkit-linear-gradient(top, #f29f9f, #ab1111);
background-image: -moz-linear-gradient(top, #f29f9f, #ab1111);
background-image: -ms-linear-gradient(top, #f29f9f, #ab1111);
background-image: -o-linear-gradient(top, #f29f9f, #ab1111);
background-image: linear-gradient(to bottom, #f29f9f, #ab1111);
text-decoration: none;
}
</style>
<script type="text/javascript" src="http://google-maps-utility-library-v3.googlecode.com/svn/tags/markerclusterer/1.0/src/markerclusterer.js"></script>
<script src="https://maps.googleapis.com/maps/api/js?v=3.exp&signed_in=true&libraries=drawing"></script>
<script type="application/shiny-singletons"></script>
<script type="application/html-dependencies">json2[2014.02.04];jquery[1.11.0];shiny[0.11.1];ionrangeslider[2.0.2];bootstrap[3.3.1]</script>
<script src="shared/json2-min.js"></script>
<script src="shared/jquery.min.js"></script>
<link href="shared/shiny.css" rel="stylesheet" />
<script src="shared/shiny.min.js"></script>
<link href="shared/ionrangeslider/css/normalize.css" rel="stylesheet" />
<link href="shared/ionrangeslider/css/ion.rangeSlider.css" rel="stylesheet" />
<link href="shared/ionrangeslider/css/ion.rangeSlider.skinShiny.css" rel="stylesheet" />
<script src="shared/ionrangeslider/js/ion.rangeSlider.min.js"></script>
<link href="shared/bootstrap/css/bootstrap.min.css" rel="stylesheet" />
<script src="shared/bootstrap/js/bootstrap.min.js"></script>
<script src="shared/bootstrap/shim/html5shiv.min.js"></script>
<script src="shared/bootstrap/shim/respond.min.js"></script>
<script type="text/javascript">
var map = null;
var Gmarkers = [];
function Cities_Markers() {
var infowindow = new google.maps.InfoWindow({ maxWidth: 500,maxHeight:500 });
//Loop to add markers to the map based on the JSON exported from R, which is within the variable test
for (var i = 0; i < test.length; i++) {
var lat = test[i][2]
var lng = test[i][1]
var marker = new google.maps.Marker({
position: new google.maps.LatLng(lat, lng),
title: 'test',
map: map,
icon:test[i][3]
});
//This sets up the infowindow
google.maps.event.addListener(marker, 'click', (function(marker, i) {
return function() {
infowindow.setContent('<div id="content"><p><b>Location</b> = '+
test[i][4]+'<p>'+
'<b>Depth</b> = '+test[i][5]+'Km <p>'+
'<b>Magnitude</b> = '+test[i][6]+ '<p>'+
'<b>Date</b> = '+test[i][7]+'<p>'+
'<b>Time</b> = '+test[i][8]+'</div>');
infowindow.open(map, marker);
}
})(marker, i));
Gmarkers.push(marker);
};
};
//Function to remove all the markers from the map
function setAllMap() {
for (var i = 0; i < Gmarkers.length; i++) {
Gmarkers[i].setMap(null);
}
}
//Initialize the map
function initialize() {
var mapOptions = {
center: new google.maps.LatLng(31.6, 0),
zoom: 3,
mapTypeId: google.maps.MapTypeId.TERRAIN
};
map = new google.maps.Map(document.getElementById('map-canvas'),mapOptions);
}
google.maps.event.addDomListener(window, 'load', initialize);
</script>
</head>
<body>
<div id="json" class="shiny-html-output"></div>
<button type="button" class="btn" id="hour" onClick="Shiny.onInputChange('Earth', 1)" style="position:absolute;top:1%;left:1%;width:100px;z-index:999">Last Hour</button>
<button type="button" class="btn" id="day" onClick="Shiny.onInputChange('Earth', 2)" style="position:absolute;top:1%;left:10%;width:100px;z-index:999">Last Day</button>
<button type="button" class="btn" id="week" onClick="Shiny.onInputChange('Earth', 3)" style="position:absolute;top:1%;left:20%;width:100px;z-index:999">Last Week</button>
<button type="button" class="btn" id="month" onClick="Shiny.onInputChange('Earth', 4)" style="position:absolute;top:1%;left:30%;width:100px;z-index:999">Last Month</button>
<div id="map-canvas" style="top:0%;right:0%;width:100%;height:100%;z-index:1"></div>
</body>
</html>
</code></pre>
<a href="http://codeformatter.blogspot.ch/">Created with CodeFormatter</a>Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com1tag:blogger.com,1999:blog-1442302563171663500.post-26946849599944045692015-05-25T14:55:00.002+02:002015-05-25T14:55:23.422+02:00Interactive maps of Crime data in Greater LondonIn the previous post we looked at ways to perform some introductory point pattern analysis of open data downloaded from Police.uk. As you remember we subset the dataset of crimes in the Greater London area, extracting only the drug related ones. Subsequently, we looked at ways to use those data with the package <b>spatstat </b>and perform basic statistics.<br />
In this post I will briefly discuss ways to create interactive plots of the results of the point pattern analysis using the Google Maps API and Leaflet from R.<br />
<br />
<b><span style="font-size: large;">Number of Crimes by Borough</span></b><br />
In the previous post we looped through the <i>GreaterLondonUTM</i> shapefile to extract the area of each borough and then counted the number of crimes within its border. To show the results we used a simple barplot. Here I would like to use the same method I presented in my post <a href="http://r-video-tutorial.blogspot.ch/2015/05/interactive-maps-for-web-in-r.html" target="_blank">Interactive Maps for the Web</a> to plot these results on Google Maps.<br />
<br />
This post is intended to be a continuation of the previous, so I will not present again the methods and objects we used in the previous experiment. To make this code work you can just copy and paste it below the code you created before and it should work just fine.<br />
<br />
First of all, let's create a new object including only the names of the boroughs from the <i>GreaterLondonUTM</i> shapefile. We need to do this because otherwise when we will click on a polygons on the map it will show us a long list of useless data.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">GreaterLondon.Google <- GreaterLondonUTM<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: blue;">"name"</span><span style="color: #009900;">]</span></pre>
</div>
</div>
<br />
The new object has only one column with the name of each borough.<br />
Now we can create a loop to iterate through these names and calculate the intensity of the crimes:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Borough <- GreaterLondonUTM<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: blue;">"name"</span><span style="color: #009900;">]</span>
<span style="color: black; font-weight: bold;">for</span><span style="color: #009900;">(</span>i <span style="color: black; font-weight: bold;">in</span> <a href="http://inside-r.org/r-doc/base/unique"><span style="color: #003399; font-weight: bold;">unique</span></a><span style="color: #009900;">(</span>GreaterLondonUTM$name<span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">{</span>
sub.name <- Local.Intensity<span style="color: #009900;">[</span>Local.Intensity<span style="color: #009900;">[</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span>==i<span style="color: #339933;">,</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">]</span>
Borough<span style="color: #009900;">[</span>Borough$name==i<span style="color: #339933;">,</span><span style="color: blue;">"Intensity"</span><span style="color: #009900;">]</span> <- sub.name
Borough<span style="color: #009900;">[</span>Borough$name==i<span style="color: #339933;">,</span><span style="color: blue;">"Intensity.Area"</span><span style="color: #009900;">]</span> <- <a href="http://inside-r.org/r-doc/base/round"><span style="color: #003399; font-weight: bold;">round</span></a><span style="color: #009900;">(</span>sub.name/<span style="color: #009900;">(</span>GreaterLondonUTM<span style="color: #009900;">[</span>GreaterLondonUTM$name==i<span style="color: #339933;">,</span><span style="color: #009900;">]</span>@polygons<span style="color: #009900;">[</span><span style="color: #009900;">[</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">]</span><span style="color: #009900;">]</span>@area/<span style="color: #cc66cc;">10000</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">4</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span></pre>
</div>
</div>
<br />
As you can see this loop selects one name at the time, then subset the object <i>Local.Intensity</i> (which we created in the previous post) to extract the number of crimes for each borough. The next line attach this intensity to the object <i>Borough</i> as a new column named <i>Intensity</i>. However, the code does not stop here. We also create another column named <i>Intensity.Area</i> in which we calculate the amount of crimes per unit area. Since the area from the shapefile is in square meters and the number were very high, I though about dividing it by 10'000 in order to have a unit area of 10 square km. So this column shows the amount of crime per 10 square km in each borough. This should correct the fact that certain borough have a relatively high number of crimes only because their area is larger than others.<br />
<br />
Now we can use again the package <b>plotGoogleMaps</b> to create a beautiful visualization of our results and save it in HTML so that we can upload it to our website or blog.<br />
The code for doing that is very simple and it is presented below:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">plotGoogleMaps<span style="color: #009900;">(</span>Borough<span style="color: #339933;">,</span>zcol=<span style="color: blue;">"Intensity"</span><span style="color: #339933;">,</span>filename=<span style="color: blue;">"Crimes_Boroughs.html"</span><span style="color: #339933;">,</span>layerName=<span style="color: blue;">"Number of Crimes"</span><span style="color: #339933;">,</span> fillOpacity=<span style="color: #cc66cc;">0.4</span><span style="color: #339933;">,</span>strokeWeight=<span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>mapTypeId=<span style="color: blue;">"ROADMAP"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<a href="https://www.blogger.com/blogger.g?blogID=1442302563171663500" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a>I decided to plot the polygons on top of the roadmap and not on top of the satellite image, which is the default for the function. Thus I added the option <i>mapTypeId="ROADMAP"</i>.<br />
The result is the map shown below and at this link: <a href="http://www.fabioveronesi.net/Blog/Crimes_Boroughs.html" target="_blank">Crimes on GoogleMaps</a><br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoUoJQiE62gP6q4924iaGVsNeMGyikn-hQ4Z8Mdwlpo-ZWcj61s1eRw_7Fd9oncDVKNnnwNVXnm92w1lijoYCHfMak94zr-89L6IHFdFWKSBU8TqeceT2CzWvAwutk0acLNRdHaBFH9KI5/s1600/Areas_GoogleMaps.jpg" imageanchor="1"><img border="0" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoUoJQiE62gP6q4924iaGVsNeMGyikn-hQ4Z8Mdwlpo-ZWcj61s1eRw_7Fd9oncDVKNnnwNVXnm92w1lijoYCHfMak94zr-89L6IHFdFWKSBU8TqeceT2CzWvAwutk0acLNRdHaBFH9KI5/s640/Areas_GoogleMaps.jpg" width="640" /></a>
<br />
<br />
In the post Interactive Maps for the Web in R I received a comment from Gerardo Celis, whom I thank for it, telling me that now in R is also available the package <b>leafletR</b>, that allows us to create interactive maps based on <a href="http://leafletjs.com/" target="_blank">Leaflet</a>. So for this new experiment I decided to try it out!<br />
<br />
I started from the sample of code presented here: <a href="https://github.com/chgrl/leafletR" target="_blank">https://github.com/chgrl/leafletR</a> and I adapted with very few changes to my data.<br />
The function <b>leaflet </b>does not work directly with Spatial data, we first need to transform them into <i>GeoJSON</i> with another function in <b>leafletR</b>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Borough.Leaflet <- toGeoJSON<span style="color: #009900;">(</span>Borough<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
Extremely simple!!<br />
<br />
Now we need to set the style to use for plotting the polygons using the function <b>styleGrad</b>, which is used to create a list of colors based on a particular attribute:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">map.style <- styleGrad<span style="color: #009900;">(</span>pro=<span style="color: blue;">"Intensity"</span><span style="color: #339933;">,</span>breaks=<a href="http://inside-r.org/r-doc/base/seq"><span style="color: #003399; font-weight: bold;">seq</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/min"><span style="color: #003399; font-weight: bold;">min</span></a><span style="color: #009900;">(</span>Borough$Intensity<span style="color: #009900;">)</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/max"><span style="color: #003399; font-weight: bold;">max</span></a><span style="color: #009900;">(</span>Borough$Intensity<span style="color: #009900;">)</span>+<span style="color: #cc66cc;">15</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/by"><span style="color: #003399; font-weight: bold;">by</span></a>=<span style="color: #cc66cc;">20</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>style.val=<a href="http://inside-r.org/r-doc/grDevices/cm.colors"><span style="color: #003399; font-weight: bold;">cm.colors</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>leg=<span style="color: blue;">"Number of Crimes"</span><span style="color: #339933;">,</span> fill.alpha=<span style="color: #cc66cc;">0.4</span><span style="color: #339933;">,</span> lwd=<span style="color: #cc66cc;">0</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
In this function we need to set several options:<br />
<i>pro</i> = is the name of the attribute (as the column name) to use for setting the colors<br />
<i>breaks</i> = this option is used to create the ranges of values for each colors. In this case, as in the example, I just created a sequence of values from the minimum to the maximum. As you can see from the code I added 15 to the maximum value. This is because the number of breaks needs to have 1 more element compared to the number of colors. For example, if we set 10 breaks we would need to set 9 colors. For this reason if the sequence of breaks ends before the maximum, the polygons with the maximum number of crimes would be presented in grey.<br />
This is important!!<br />
<br />
<i>style.val</i> = this option takes the color scale to be used to present the polygons. We can select among one of the default scales or we can create a new one with the function <b>color.scale</b> in the package <b>plotrix</b>, which I already discussed here: <a href="http://r-video-tutorial.blogspot.ch/2015/04/downloading-and-visualizing-seismic_28.html" target="_blank">Downloading and Visualizing Seismic Events from USGS </a><br />
<br />
<i>leg</i> = this is simply the title of the legend<br />
<i>fill.alpha </i>= is the opacity of the colors in the map (ranges from 0 to 1, where 1 is the maximum)<br />
<i>lwd </i>= is the width of the line between polygons<br />
<br />
After we set the style we can simply call the function <b>leaflet </b>to create the map:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">leaflet<span style="color: #009900;">(</span>Borough.Leaflet<span style="color: #339933;">,</span>popup=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"name"</span><span style="color: #339933;">,</span><span style="color: blue;">"Intensity"</span><span style="color: #339933;">,</span><span style="color: blue;">"Intensity.Area"</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>style=map.style<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
In this function we need to input the name of the <i>GeoJSON</i> object we created before, the style of the map and the names of the columns to use for the popups.<br />
The result is the map shown below and available at this link: <a href="http://www.fabioveronesi.net/Blog/Borough.html" target="_blank">Leaflet Map</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVKUdzWHq2ixIeMkq5ByvQOKiSX9F-AHkN2mRGrKf9LSH2GW4Wcpt-0dZOWhfYvNt9HhtiWU3HdLEpTeib32mKjjtXHCUb2T_4UWHRLpUKbmO1eWD87FhaQXoGr1PAG_1shu6RZy0emDLH/s1600/Areas_Leaflet.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVKUdzWHq2ixIeMkq5ByvQOKiSX9F-AHkN2mRGrKf9LSH2GW4Wcpt-0dZOWhfYvNt9HhtiWU3HdLEpTeib32mKjjtXHCUb2T_4UWHRLpUKbmO1eWD87FhaQXoGr1PAG_1shu6RZy0emDLH/s640/Areas_Leaflet.jpg" width="640" /></a></div>
<br />
<br />
I must say this function is very neat. First of all the function <b>plotGoogleMaps</b>, if you do not set the name of the HTML file, creates a series of temporary files stored in your temp folder, which is not great. Then even if you set the name of the file the legend is saved into different image files every time you call the function, which you may do many times until you are fully satisfied the result.<br />
The package <b>leafletR </b>on the other hand creates a new folder inside the working directory where it stores both the <i>GeoJSON</i> and the HTML file, and every time you modify the visualization the function overlays the same file.<br />
However, I noticed that I cannot see the map if I open the HTML files from my PC. I had to upload the file to my website every time I changed it to actually see these changes and how they affected the plot. This may be something related to my PC, however.<br />
<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Density of Crimes in raster format</span></b><br />
As you may remember from the previous post, one of the steps included in a point pattern analysis is the computation of the spatial density of the events. One of the techniques to do that is the kernel density, which basically calculates the density continuously across the study area, thus creating a raster.<br />
We already looked at the kernel density in the previous post so I will not go into details here, the code for computing the density and transform it into a raster is the following:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Density <- density.ppp<span style="color: #009900;">(</span>Drugs.ppp<span style="color: #339933;">,</span> sigma = <span style="color: #cc66cc;">500</span><span style="color: #339933;">,</span>edge=T<span style="color: #339933;">,</span>W=as.mask<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/window"><span style="color: #003399; font-weight: bold;">window</span></a><span style="color: #339933;">,</span>eps=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">100</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">100</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
Density.raster <- raster<span style="color: #009900;">(</span>Density<span style="color: #009900;">)</span>
projection<span style="color: #009900;">(</span>Density.raster<span style="color: #009900;">)</span>=projection<span style="color: #009900;">(</span>GreaterLondonUTM<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
The first lines is basically the same we used in the previous post. The only difference is that here I added the option <i>W</i> to set the resolution of the map with <i>eps </i>at 100x100 m.<br />
Then I simply transformed the first object into a raster and assign to it the same UTM projection of the object <i>GreaterLondonUTM</i>.<br />
Now we can create the map. As far as I know (and for what I tested) <b>leafletR </b>is not yet able to plot raster objects, so the only way we have of doing it is again to use the function <b>plotGoogleMaps</b>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">plotGoogleMaps<span style="color: #009900;">(</span>Density.raster<span style="color: #339933;">,</span>filename=<span style="color: blue;">"Crimes_Density.html"</span><span style="color: #339933;">,</span>layerName=<span style="color: blue;">"Number of Crimes"</span><span style="color: #339933;">,</span> fillOpacity=<span style="color: #cc66cc;">0.4</span><span style="color: #339933;">,</span>strokeWeight=<span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span>colPalette=<a href="http://inside-r.org/r-doc/base/rev"><span style="color: #003399; font-weight: bold;">rev</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/grDevices/heat.colors"><span style="color: #003399; font-weight: bold;">heat.colors</span></a><span style="color: #009900;">(</span><span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
When we use this function to plot a raster we clearly do not need to specify the <i>zcol </i>option. Moreover, here I changed the default color scale using the function <i>colPalette </i>to a reverse <i>heat.colors</i>, which I think is more appropriate for such a map. The result is the map below and at this link: <a href="http://www.fabioveronesi.net/Blog/Crimes_Density.html" target="_blank">Crime Density</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpwLmC3Ytqf4M_d_Qf3UYyA6gPMQRRbbIFnT-ENlpxh8s2wq73slgwldh2rv9BFhq6f2lKtJKWcoANGYImQXbni810D13NKfqifCsc6_zpwxW8dmn2KGXXwppon5DIXYd0ydJAbcdDDYky/s1600/Raster_GoogleMaps.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpwLmC3Ytqf4M_d_Qf3UYyA6gPMQRRbbIFnT-ENlpxh8s2wq73slgwldh2rv9BFhq6f2lKtJKWcoANGYImQXbni810D13NKfqifCsc6_zpwxW8dmn2KGXXwppon5DIXYd0ydJAbcdDDYky/s640/Raster_GoogleMaps.jpg" width="640" /></a></div>
<br />
<br />
<br />
<span style="font-size: large;"><b>Density of Crimes as contour lines</b></span><br />
The raster presented above can also be represented as contour lines. The advantage of this type of visualization is that it is less intrusive, compared to a raster, and can also be better suited to pinpoint problematic locations.<br />
Doing this in R is extremely simple, since there is a dedicated function in the package <b>raster</b>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Contour <- rasterToContour<span style="color: #009900;">(</span>Density.raster<span style="color: #339933;">,</span>maxpixels=<span style="color: #cc66cc;">100000</span><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/nlevels"><span style="color: #003399; font-weight: bold;">nlevels</span></a>=<span style="color: #cc66cc;">10</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
This function transforms the raster above into a series of 10 contour lines (we can change the number of lines by changing the option <i>nlevels</i>).<br />
<br />
Now we can plot these lines to an interactive web map. I first tested again the use of <b>plotGoogleMaps </b>but I was surprised to see that for contour lines it does not seem to do a good job. I do not fully know the reason, but if I use the object <i>Contour </i>with this function it does not plot all the lines on the Google map and therefore the visualization is useless.<br />
For this reason I will present below the lines to plot contour lines using <b>leafletR</b>:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">Contour.Leaflet <- toGeoJSON<span style="color: #009900;">(</span>Contour<span style="color: #009900;">)</span>
colour.scale <- color.scale<span style="color: #009900;">(</span><span style="color: #cc66cc;">1</span>:<span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>Contour$level<span style="color: #009900;">)</span>-<span style="color: #cc66cc;">1</span><span style="color: #009900;">)</span><span style="color: #339933;">,</span>color.spec=<span style="color: blue;">"rgb"</span><span style="color: #339933;">,</span><a href="http://inside-r.org/packages/cran/extRemes">extremes</a>=<a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"red"</span><span style="color: #339933;">,</span><span style="color: blue;">"blue"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span>
map.style <- styleGrad<span style="color: #009900;">(</span>pro=<span style="color: blue;">"level"</span><span style="color: #339933;">,</span>breaks=Contour$level<span style="color: #339933;">,</span>style.val=colour.scale<span style="color: #339933;">,</span>leg=<span style="color: blue;">"Number of Crimes"</span><span style="color: #339933;">,</span> lwd=<span style="color: #cc66cc;">2</span><span style="color: #009900;">)</span>
leaflet<span style="color: #009900;">(</span>Contour.Leaflet<span style="color: #339933;">,</span>style=map.style<span style="color: #339933;">,</span>base.map=<span style="color: blue;">"tls"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
As mentioned, the first thing to do to use <b>leafletR </b>is to transform our Spatial object into a GeoJSON; the object Contour belongs to the class <i>SpatialLinesDataFrame</i>, so it is supported in the function <b>toGeoJSON</b>.<br />
The next step is again to set the style of the map and then plot it. In this code I changed a few things just to show some more options. The first thing is the custom color scale I created using the function <b>color.scale</b> in the package <b>plotrix</b>. The only thing that the function <b>styleGrad </b>needs to set the colors in the option <i>style.val</i> is a vector of colors, which must be long one unit less than the vector used for the breaks. In this case the object Contour has only one property, namely "<i>level</i>", which is a vector of class factor. The function <b>styleGrad </b>can use it to create the breaks but the function color.scale cannot use it to create the list of colors. We can work around this problem by setting the length of the color.scale vector using another vector: <i>1:(length(Contour$level)-1</i>, which basically creates a vector of integers from 1 to the length of Contours minus one. The result of this function is a vector of colors ranging from red to blue, which we can plug in in the following function.<br />
In the function leaflet the only thing I changed is the <i>base.map</i> option, in which I use "<i>tls</i>". From the help page of the function we can see that the following options are available:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><i>"<span style="background-color: white;">One or a list of </span><code style="background-color: white;">"osm"</code><span style="background-color: white;"> (OpenStreetMap standard map), </span><code style="background-color: white;">"tls"</code><span style="background-color: white;"> (Thunderforest Landscape), </span><code style="background-color: white;">"mqosm"</code><span style="background-color: white;"> (MapQuest OSM), </span><code style="background-color: white;">"mqsat"</code><span style="background-color: white;"> (MapQuest Open Aerial),</span><code style="background-color: white;">"water"</code><span style="background-color: white;"> (Stamen Watercolor), </span><code style="background-color: white;">"toner"</code><span style="background-color: white;"> (Stamen Toner), </span><code style="background-color: white;">"tonerbg"</code><span style="background-color: white;"> (Stamen Toner background), </span><code style="background-color: white;">"tonerlite"</code><span style="background-color: white;"> (Stamen Toner lite), </span><code style="background-color: white;">"positron"</code><span style="background-color: white;"> (CartoDB Positron) or </span><code style="background-color: white;">"darkmatter"</code><span style="background-color: white;"> (CartoDB Dark matter). "</span></i></span><br />
<br />
These lines create the following image, available as a webpage here: <a href="http://www.fabioveronesi.net/Blog/Contour.html" target="_blank">Contour</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo_hxk8Ep9C3EPeRausHPtyRNY7UauvYTDcc_RNUQByzlUmN68P8f4nZOjSdz76N6Q-qFCKkG6elaQADKWvG6FjI-qUnCDY2XdsFBPXt6XNMuki4aBlIGqRikwMZeothZSxeB1vg4TDpnS/s1600/Contour_Leaflet.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo_hxk8Ep9C3EPeRausHPtyRNY7UauvYTDcc_RNUQByzlUmN68P8f4nZOjSdz76N6Q-qFCKkG6elaQADKWvG6FjI-qUnCDY2XdsFBPXt6XNMuki4aBlIGqRikwMZeothZSxeB1vg4TDpnS/s640/Contour_Leaflet.jpg" width="640" /></a></div>
<br />
<br />
<br />
<br />
<a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">R code snippets created by Pretty R at inside-R.org</a><br />
<br />Fabio Veronesihttp://www.blogger.com/profile/07827549157455488947noreply@blogger.com3