Friday, 15 July 2016

Time Series Analysis in ArcGIS

In this post I will introduce another toolbox I created to show the functions that can be added to ArcGIS by using R and the R-Bridge technology.
In this toolbox I basically implemented the functions I showed in the previous post about time series analysis in R.
Once again I prepared a sample dataset that I included in the GitHub archive so that you can reproduce the experiment I'm presenting here. I will start my description from there.

Dataset
As for my previous post, here I'm also including open data in shapefile from the EPA, which I downloaded for free using the custom R function I presented here.
I downloaded only temperature data (in F) from 2013, but I kept two categorical variables: State and Address.


As you can see from the image above the time variable is in the format year-month-day. As I mentioned in the post about the plotting toolbox, it is important to set this format correctly so that R can recognize it. Please refer to this page for more information about the formats that R recognizes.


Time Series Plot
This type of plot is available in several packages, including ggplot2, which I used to create the plotting toolbox.  However, in my post about time series analysis I presented the package xts, which is very powerful for handling and plotting time-series data. For this toolbox I decided to maintain the same package and refer everything to xts for several reasons that I would explain along the text.
The first reason is related to the plotting capabilities of this package. Let's take a look for example at the first script in the toolbox, specific for plotting time series.


Similarly to the script for time series in the plotting toolbox, here users need to select the dataset (which can be a shapefile or a CSV, or any other table format that can be accessed in ArcGIS). Then they need to select the variable of interest, in the sample dataset that is Temp, which clearly stands for temperature. Another important information for R is the data/time column and its format, again please refer to my previous post for more information. Finally, I inserted an SQL call to subset the dataset. In this case I'm subsetting a particular station.
The result is the plot below:


As you can see there are quite a few missing values in the dataset related to the station I subset. The very nice thing about the package xts is that with this plot it is perfectly clear where are the missing data, since along the X axis these are evident by the lack of grey tick marks.


Time Histogram
This is a simple bar chart that basically plots time against frequency of samples. The idea behind this plot is to allow users to explore the number of samples for specific time intervals in the dataset.


The user interface is similar to the previous scrips. Users need to select the dataset, then the variable and then the time column and specify its format. I also included an option to select a subset of the dataset with a SQL selection. At this point I included a list to select the averaging period, and users can select between day, month or year. In this case I selected month, which means that R will loop through the months and subset the dataset for each of these. Then it will basically count the number of data sampled in each month and plot this information against the month itself. The result is the plot below:


As you can see we can definitely gather some useful information from this plot; for example we can determine that basically this station, in the year 2013, did not have any problem.


Time Subset
In some cases we may need to subset our dataset according to a particular time period. This can be done in ArcGIS with the "Select by Attribute" tool and by using an SQL string similar to what you see in the image below:


The package xts however, provides much more powerful and probably faster ways to subset by time. For example, in ArcGIS if we want to subset the whole month of June we would need to specify an SQL string like this:
"Time" >= '2013-06-01' AND "Time" < '2013-07-01'

On the contrary, in R and with the package xts if we wanted to do the same we would just need to use the string '2013-06', and R would know to keep only the month of June. Below are some other examples of successful time subset with the package xts (from http://www.inside-r.org/packages/cran/xts/docs/xts):

sample.xts['2013']  # all of 2013
sample.xts['2013-03']  # just March 2013
sample.xts['2013-03/']  # March 2013 to the end of the data set
sample.xts['2013-03/2013']  # March 2013 to the end of 2013
sample.xts['/'] # the whole data set
sample.xts['/2013'] # the beginning of the data through 2013
sample.xts['2013-01-03'] # just the 3rd of January 2013


With this in mind I created the following script:


As you can see from the image above, here there is an option named "Subset" where users can insert one of the strings from the examples above (just the text within square brackets) and select time intervals with the same flexibility allowed in R and the package xts.
The result of this script is a new shapefile containing only the time included in the Subset call.



Time Average Plots
As I showed in my previous post about time series analysis, with the package xts is possible to perform custom functions on specific time intervals with the following commands: apply.daily, apply.weekly, apply.monthly and apply.yearly.
In this toolbox I used these functions to compute the average, 25th and 75th percentiles for specific time intervals, which the user may choose. This is the toolbox:


The only differences from the other scripts are the "Average by", with which the user can select between day, week, month or year. Each of these will trigger the appropriate apply function. Then there is also the possibility to select the position for the plot legend: between topright, topleft, bottomright and bottomleft. Finally, users can select the output folder where the plot below will be saved, along with a CSV with the numerical values for mean, q25 and q75.



Time Function
This is another script that provides direct access to the apply functions I presented before. Here the output is not a plot but a CSV with the results of the function, and users can input their own function directly in the GUI. Let's take a look:


As you can see there is a field named "Function". Here users can insert their own custom function, written in the R language. This function takes a vector (x) and returns a vector and it is in the form:

function(x){sum(x>70}

Only the string within curly brackets needs to be written in the GUI. This will then be passed to the script and applied to the values averaged by day, week, month or year. Users can select this last aspect in the field "Average by". Here for example I am calculating the number of days, for each month, with a temperature above 70 degrees Fahrenheit (21 degrees celsius) in Alaska. The results are saved in CSV in the output folder and printed on screen, as you can see from the image below.



Trend Analysis
In this last script I included access to the function decompose, which I briefly described in my previous post. This function does not work with xts time series, so the time series needs to be loaded with the standard method, ts, in R. This method requires the user to include the frequency of the time series. For this reason I had to add an option for this in the GUI.
Unfortunately, the dataset I created for this experiment only has one full year and thus making a decomposition does not make much sense, but you are invited to try with your data and it should work fine and provide you with results similar to the image below:




Download
Once again the time-series toolbox is available for free from my GitHub page at:
https://github.com/fveronesi/TimeSeries_Toolbox/

2 comments:

  1. Replies
    1. Thank you very much!!
      If you find any bug, please do not hesitate to point them out so that we can make sure the toolbox works perfectly.

      Delete

Note: only a member of this blog may post a comment.