R tutorial for Spatial Statistics: ESRI

Showing posts with label ESRI. Show all posts

Monday, 11 July 2016

The Power of ggplot2 in ArcGIS - The Plotting Toolbox

In this post I present my third experiment with R-Bridge. The plotting toolbox is a plug-in for ArcGIS 10.3.x that allows the creation of beautiful and informative plot, with ggplot2, directly from the ESRI ArcGIS console.
As always I not only provide the toolbox but also a dataset to try it out. Let's start from here...

Data
For testing the plotting tool, I downloaded some air pollution data from EPA (US Environmental Protection Agency), which provides open access to its database. I created a custom function to download data from EPA that you can find in this post.
Since I wanted to provide a relatively small dataset, I extracted values from only four states: California, New York, Iowa and Ohio. For each of these, I included time series for Temperature, CO, NO2, SO2 and Barometric Pressure. Finally, the coordinates of the points are the centroid for each of these four states. The image below depicts the location and the first lines of the attribute table. This dataset is provided in shapefile and CSV, both can be used with the plotting toolbox.

Toolbox
Now that we have seen the sample dataset, we can take a look at the toolbox. I included 5 packages to help summarize and visualize spatial data directly from the ArcGIS console.

I first included a package for summarizing our variables, this creates a table with some descriptive statistics. Then I included all the major plotting types I presented in my book "Learning R for Data Visualization [VIDEO]" edited by Packt Publishing, with some modifications to the scripts for adapting them to the R-Bridge Technology from ESRI. For more information, practical and theoretical, about each of these plots please refer to the book.

The tool can be downloaded from my GitHub page at:
https://github.com/fveronesi/PlottingToolbox

Summary
This is the first package I would like to describe simply because a data analysis should always start with a look at our variables with some descriptive statistics. As for all the tools presented here its use is simple and straightforward with the GUI presented below:

Here the user has to point to the dataset s/he wants to analyze in "Input Table". This can be a shapefile, either loaded already in ArcGIS or not, but it can also be a table, for example a CSV. That is the reason why I included a CSV version of the EPA dataset as well.
At this point the area in "Variable" will fill up with the column names of the input file, from here the user can select the variable s/he is interested in summarizing. The final step is the selection of the "Output Folder". Important: users need to first create the folder and then select it. This is because this parameter is set as input, so the folder needs to exist. I decided to do it this way because otherwise for each new plot a new folder would have need to be created. This way all the summaries and plots can go into the same directory.
Let's take a look at the results:

The Summary package presents two tables, with all the variables we wanted to summarize, arranged one above the other. This is the output users will see on screen once the toolbox has competed its run. Then in the output folder, the R script will save this exact table in PDF format, plus a CSV with the raw data.

Histogram
As the name suggest this tool provides access to the histogram plot in ggplot2. ArcGIS provides a couple of ways to represent data in histograms. The first way is by clicking with the right mouse button on one of the column in the attribute table; a drop-down menu will appear and from there users can click on "Statistics" to access some descriptive statistics and a histogram. Another way is through the "Geostatistical Analyst", which is an add-on for which users need an additional license. This has an "Explore Data" package from which it is possible to create histograms. The problem with both these methods is that the results are not, in my opinion at least, suitable for any publication. You can maybe share them with your colleagues, but I would not suggest using them for any article. This implies that ArcGIS users need to open another software, maybe Excel, to create the plots they require, and we all know how painful it is in Excel to produce any decent plot, and histogram are probably the most painful to produce.
This changes now!!

By combining the power of R and ggplot2 with ArcGIS we can provide users with an easy to use way to produce beautiful visualizations directly within the ArcGIS environment and have them saved in jpeg at 300 dpi. This is what this set of packages does.
Let's now take a look at the interface for histograms:

As for Summary we first need to insert the input dataset, and then select the variable/s we want to plot. If two or more variables are selected, several plots will be created and saved in jpeg.
I also added some optional values to further customize the plots. The first is the faceting variable, which is a categorical variable. If this is selected the plot will have a histogram for each category; for example, in the sample dataset I have the categorical variable "state", with the name of the four states I included. If I select this for faceting the result will be the figure below:

Another option available here is the binwidth. The package ggplot2 usually sets this number as the range of the variable divided by 30. However, this can be customized by the user and this is what you can do with this option. Finally users need to specify an output folder where R will save a jpeg of the plots shown on screen.

Box-Plot
This is another great way to compare variables' distributions and as far as I know it cannot be done directly from ArcGIS. Let's take a look at this package:

This is again very easy to use. Users just need to set the input file, then the variable of interest and the categorical variable for the grouping. Here I am again using the states, so that I compare the distribution of NO2 across the four US states in my dataset.
The results is ordered by median values and it is shown below:

As you can see I decided to plot the categories vertically, as to accomodate long names. This of course can be changed by tweaking the R script. As for each package, this plot is saved in the output folder.

Bar Chart
This tool is generally used to compare different values for several categories, but generally we have one value for each category. However, it may happen that the dataset contains multiple measurements for some categories, and I implemented a way to deal with that here. Below is presented the GUI to this package:

The inputs are basically the same as for box-plots, the only difference is in the option "Average Values". If this is set, R will average the values of the variable for each unique category in the dataset. The results are again ordered, and are saved in jpeg in the output folder:

Scatterplot
This is another great way to visually analyze our data. The package ggplot2 allows the creation of highly customized plots and I tried to implement as much as I could in terms of customization in this toolbox. Let's take a look:

After selecting the input dataset the user can select what to plot. S/he can choose to only plot two variables, one on the X axis and one on the Y axis, or further increase the amount of information presented in the plot by including a variable that changes the color of the points and one for their size. Moreover, there is also the possibility to include a regression line. Color, size and regression line are optional, but I wanted to include them to present the full range of customizations that this package allows.

Once again this plot is saved in the output folder.

Time Series
The final type of plots I included here is the time series, which is also the one with the highest number of inputs from the user side. In fact, many spatial datasets include a temporal component but often time this is not standardized. By that I mean that in some cases the time variable has only a date, and in some cases it includes a time; in other cases the format changes from dataset to dataset. For this reason it is difficult to create an R script that works with most datasets, therefore for time-series plots users need to do some pre-processing. For example, if date and time are in separate columns, these need to be merged into one for this R script to work.
At this point the TimeSeries package can be started:

The first two columns are self explanatory. Then users need to select the column with the temporal information, and then input manually the format of this column.
In this case the format I have in the sample dataset is the following: 2014-01-01
Therefore I have the year with century, a minus sign, the month, another minus sign and the day. I need to use the symbols for each of these to allow R to recognize the temporal format of the file.
Common symbol are:
%Y - Year with century
%y - Year without century
%m - Month
%d - Day
%H - Hour as decimal number (00-23)
%M - Minute as decimal number (00-59)
%S - Second as decimal number

More symbols can be found at this page: http://www.inside-r.org/r-doc/base/strptime

The remaining two inputs are optional, but if one is selected the other needs to be provided. For "Subsetting Column" I intend a column with categorical information. For example, in my dataset I can generate a time-series for each US state, therefore my subsetting column is state. In the option "Subset" users need to write manually the category they want to use to subset their data. Here I just want to have the time-series for California, so I write California. You need to be careful here to write exactly the name you see in the attribute table because R is case sensitive, thus if you write california with a lower case c R will be unable to produce the plot.
The results, again saved in jpeg automatically, is presented below:

Friday, 8 July 2016

Time Averages of NetCDF files from ECMWF in ArcGIS with R-Bridge

With this post I would like to talk again about R-Bridge, which allows a direct communication between ArcGIS and R.
In the previous post, I presented a very simple application of R-Bridge where I built a toolbox to perform k-means clustering on point shapefiles. The purpose of that post was to explain the basics of the technology, but its scientific scope was limited. However, I would like to try step by step to translate more of my R scripts in ArcGIS, so that more people can use them even if they are not experts in R.
In this post I will start presenting a toolbox to handle NetCDF files downloaded from the European Centre for Medium-Range Weather Forecasts (ECMWF).

ECMWF
The European Centre for Medium-Range Weather Forecasts provides free access to numerous weather data through their website. You can go directly to this page to take a look at the data available: http://www.ecmwf.int/en/research/climate-reanalysis/browse-reanalysis-datasets
The data are freely accessible and downloadable (for research purposes!!) but you need to register to the website to be able to do so.

My Research
For a research I'm doing right now I downloaded the ERA Interim dataset from 2010 to 2015 from this access page: http://apps.ecmwf.int/datasets/data/interim-full-daily/levtype=sfc/

The data are provided in large NetCDF files, which include all the weather variables I selected and for the entire time frame. In R, NetCDF files can be easily imported, using the packages raster and ncdf4, as a raster brick. This will have a X and Y dimensions, plus a time dimension for each of the variables I decided to download. Since I wanted 5 years and the ERA dataset include four datasets per day, I had quite a lot of data.
I decided that for the research I was planning to do I did not need each and every one of these rasters, but rather some time averages, for example the monthly averages for each variable. Therefore I created an R script to do the job.

Now I decided to use R-Bridge to implement the R script I developed in ArcGIS. This should allow people not familiar with R to easily create time averages of weather reanalysis data provided by ECMWF.

Toolbox
I already covered in the previous post the installation process of R-Bridge and explained how to create a new toolbox with a script, so if you do not know how to do these thing please refer to this post.
For this script I created a simple GUI with two inputs and one output:

The first is used to select the NetCDF file, downloaded from the ECMWF website, on the users' computer. The second is a list of values from which the user can select what type of time average to perform:

Users can select between four types: hourly, daily, monthly and yearly averages. In each case, R will select the unique values for each of these categories and create average rasters. Please remember that ECMWF does not provide hourly data, but only observations at specific time of day, every 6 hours or so; therefore, do not expect the script to generate hourly rasters, but only averages for these time slots.

The final thing that users need to provide is the output folder, where R will save all the rasters. This needs to be a new folder!! R will then create the new folder on disk and then save the rasters in there.
For the time being I do not have a way to plot these rasters onto ArcGIS after the script is completed. In theory with the function writeRaster in the raster package it is possible to export a raster to ArcGIS directly, but users would need to provide the name of this output raster in the Toolbox GUI and in this case this is not possible because many rasters are created at once. I also tried to create another toolbox in Model Builder where the R script was followed by an iterator that should have opened the rasters directly from the output folder, but it does not work. If you have any suggestion for doing this I would like to hear them. In any case this is not an issue, the important thing is being able to produce average rasters from NetCDF files.

R Script
In the final part of the post I will present the R script I used for this Toolbox. Here is the code:

### NetCDF Time Average Toolbox
##Author: Fabio Veronesi
tool_exec <- function(in_params, out_params)
{
 if (!requireNamespace("ncdf4", quietly = TRUE))
  install.packages("ncdf4")
 require(ncdf4)
 
 if (!requireNamespace("reshape2", quietly = TRUE))
  install.packages("reshape2")
 require(reshape2)
 
 if (!requireNamespace("sp", quietly = TRUE))
  install.packages("sp")
 require(sp)
 
 if (!requireNamespace("raster", quietly = TRUE))
  install.packages("raster")
 require(raster)
 
 if (!requireNamespace("rgdal", quietly = TRUE))
  install.packages("rgdal")
 require(rgdal)
 
 
 print("Time Averages of ECMWF Datasets")
 print("Author: Fabio Veronesi")
 
 source_nc = in_params[[1]]
 time_average = in_params[[2]]
 
 out_folder = out_params[[1]]
        
 dir.create(out_folder)
 
 ### Read Data
 arc.progress_label("Reading the NetCDF Dataset...")
 print("Opening NC...")
 nc <- nc_open(source_nc)
 var <- names(nc$var)
 
 print(paste("NetCDF Variable: ",var))
 
 
 print("Creating Average Rasters ...")
 print("Please note that this process can be time-consuming.")
 
 
 for(VAR1 in var){
 print(paste("Executing Script for Variable: ", VAR1))
 var.nc <- brick(source_nc, varname=VAR1, layer="time")
 
 #Divide by Month
 TIME <- as.POSIXct(substr(var.nc@data@names, start=2, stop=20), format="%Y.%m.%d.%H.%M.%S")
 df <- data.frame(INDEX = 1:length(TIME), TIME=TIME)
 
 if(time_average=="Daily Averages"){
  days <- unique(format(TIME, "%d"))
 
  #LOOP
  for(DAY in days){
   subset <- df[format(df$TIME, "%d") == DAY,]
   sub.var <- var.nc[[subset$INDEX]]
 
   print(paste("Executing Average for Day: ",DAY))
   av.var <- calc(sub.var, fun=mean, filename=paste0(out_folder,"/",VAR1,"_Day",DAY,".tif"))
   print(paste("Raster for Day ",DAY," Ready in the Output Folder"))
  }
 } else if(time_average=="Monthly Averages") {
  months <- unique(format(TIME, "%m"))
 
  #LOOP
  for(MONTH in months){
   subset <- df[format(df$TIME, "%m") == MONTH,]
   sub.var <- var.nc[[subset$INDEX]]
 
   print(paste("Executing Average for Month: ",MONTH))
   av.var <- calc(sub.var, fun=mean, filename=paste0(out_folder,"/",VAR1,"_Month",MONTH,".tif"))
   print(paste("Raster for Month ",MONTH," Ready in the Output Folder"))
  }
 } else if(time_average=="Yearly Averages") {
  years <- unique(format(TIME, "%Y"))
 
  #LOOP
  for(YEAR in years){
   subset <- df[format(df$TIME, "%Y") == YEAR,]
   sub.var <- var.nc[[subset$INDEX]]
 
   print(paste("Executing Average for Year: ",YEAR))
   av.var <- calc(sub.var, fun=mean, filename=paste0(out_folder,"/",VAR1,"_Year",YEAR,".tif"))
   print(paste("Raster for Year ",YEAR," Ready in the Output Folder"))
  } 
 } else {
  hours <- unique(format(TIME, "%H"))
 
  #LOOP
  for(HOUR in hours){
   subset <- df[format(df$TIME, "%H") == HOUR,]
   sub.var <- var.nc[[subset$INDEX]]
 
   print(paste("Executing Average for Hour: ",HOUR))
   av.var <- calc(sub.var, fun=mean, filename=paste0(out_folder,"/",VAR1,"_Hour",HOUR,".tif"))
   print(paste("Raster for Hour ",HOUR," Ready in the Output Folder"))
  } 
 }
 
 }
}

Created by Pretty R at inside-R.org

As I described in my previous post, the R code for an ArcGIS Toolbox needs to be included in a function that takes inputs and outputs from the ArcGIS GUI.
In this function the very first thing we need to do is load the required packages, with an option to install them if necessary. Then we need to assign object names to the inputs and output parameters.
As you can see I included quite a few print calls so that ArcGIS users can easily follow the process.

At this point we can move to the NetCDF file. As I mentioned, I will be using the raster package to import the NetCDF file, but first I need to open it directly with the package ncdf4 and the function nc_open. This is necessary to obtain the list of variables included in the file. In my tests I downloaded the temperature at 2m above ground and the albedo, therefore the variables' names were d2m and alb. Since these names are generally not known by end users, we need a way to extract them from the NetCDF file, which is provided by these lines.

Once we have that, we can start a for loop when we iterate through the variables. As you can see the first line within the loop needs to import the nc file with the function brick in the package raster. Within this function we need to specify the name of the variable to use. The raster names include the temporal information we are going to need to create the time averages. For this reason I created an object called TIME with POSIXct values created starting from these names, and then I collected these value into a data.frame. This will be used later on to extract only the indexes of the rasters with the correct date/time.

Now I set up a series of if statements that trigger certain actions depending on what the user selected in the Time Average list on the GUI. Let us assume that the user selected "Daily Averages".
At this point R first uses the function format to extract the days from the data.frame with date/time, named df. Then extracts the unique values from this list. The next step involves iterating through these days and create an average raster for each of these. This can be done with the function calc in raster. This function takes a series of rasters, a function (in this case mean) and can also save a raster object on disk. For the correct address for the output file I simply used the function paste to name the file according to the variable and day. The exact same process is performed for the other time averages.

Download
The R script and Toolbox to perform the time average on ECMWF datasets is available on my GitHub page at:
https://github.com/fveronesi/ECMWF_Toolbox/tree/v0.1

As I said I have other ideas to further enhance this toolbox, thus I created a branch named v0.1. I hope to have time to do these other scripts.

Saturday, 2 July 2016

Combining ArcGIS and R - Clustering Toolbox

Last year at the ESRI User Conference in San Diego, there was an announcement of an initiative to bridge ArcGIS and R. This became reality I think early this year with R-Bridge.
Basically, ESRI has created an R library that is able to communicate and exchange data between ArcGIS and R, so that we can create ArcGIS toolboxes using R scripts.

I am particularly interested in this application because R has become quite powerful for spatial data analysis in the last few years. However, I have the impression that within the geography community, R is still considered a bit of an outsider. This is because the main GIS application, i.e. ArcGIS, is based on Python and therefore courses in departments of geography and geomatics tend to focus on teaching Python, neglecting R. This I think is a mistake, since R in my opinion is easier to learn for people without a background in computer science, and has very powerful libraries for spatio-temporal data analysis.
For these reasons, the creation of the R-Bridge is particularly welcomed from my side because it will allow me to teach students how to create powerful new Toolboxes for ArcGIS based on scripts created in R. For example, this Autumn semester we will implement in the course of GIS III a module about geo-sensors, and then I will teach spatio-temporal data analysis using R within ArcGIS. This way students will learn the power of R starting from the familiar environment and user interface of ArcGIS.
Since I never worked with R-Bridge before, Today I started doing some testing and I decided that the best way to learn it was to create a simple Toolbox to do K-Means clustering on point shapefiles, which I think is a function not available in ArcGIS. In this post I will describe in details how to create the Toolbox and the R Script to perform the analysis.

R-Bridge Installation
Installing R-Bridge is extremely simple. TYou only need a recent version of R (I have the 3.0.0) installed on your PC (32-bit or 64-bit, consistent with the version of ArcGIS you have installed) and ArcGIS 10.3.1 or more recent.
At this point you can download the installation files from the R-Bridge GitHub page: https://github.com/R-ArcGIS/r-bridge-install
You can unzip its content anywhere on your PC. At this point you need to run ArcGIS as aministrator (this is very important!!), and then in the ArcCatalog navigate to the folder where you unzip the zip you just downloaded.

Now you just need to run the script "Install R Bindings" and ArcGIS will take care of the rest. I found the process extremely easy!!

Getting Started
ESRI created two examples to help us get started with the development of packages for ArcGIS written in the R language. You can find it here: https://github.com/R-ArcGIS/r-sample-tools
When you unzip this you will find a folder named "Scripts" where you can find R scripts optimized for the use in ArcGIS. I started from these to learn how to create scripts that work.

Clustering Example - R Script
As I said, ESRI created a specific library for R to be able to communicate back and forth with ArcGIS: it is called "arcbinding" and it is installed during the installation process we completed before. This library has a series of functions that allow the R script to be run starting from the ArcGIS console and its GUI. For this reason the R script is a bit different compared to the one you would do to reach the same result outside of ArcGIS. Probably it is better if I just start including some code so that you can better understand.
Below is the full R script I used for this example:

### KMeans Clustering Toolbox
##Author: Fabio Veronesi
tool_exec <- function(in_params, out_params)
{
  if (!requireNamespace("sp", quietly = TRUE))
    install.packages("sp")
  require(sp)
 
  print("K-Means Clustering of Shapefiles")
  print("Author: Fabio Veronesi")
 
  source_dataset = in_params[[1]]
  nclust = in_params[[2]]
  variable = in_params[[3]]
 
  out_shape = out_params[[1]]
 
  ### Read Data
  arc.progress_label("Loading Dataset")
  d <- arc.open(source_dataset)
 
 
  ### Create a Data.Frame with the variables to cluster
  data <- arc.select(d, variable)
  data_clust <- data.frame(data[,variable[1]])
 
 
  if(length(variable)>1){
 for(i in 2:length(variable)){
  data_clust <- cbind(data_clust,data[,variable[i]])
 }
  }
 
  names(data_clust) <- variable
 
  for(i in 1:length(variable)){
 dev.new()
 plot(hist(data_clust[,i]),main=paste0("Histogram of ",variable[i]),xlab=variable[i])
  }
 
  clusters <- kmeans(data_clust, nclust)
 
  result <- data.frame(cluster=clusters$cluster)
 
  arc.write(out_shape, result, coords = arc.shape(data))
 
  print("Done!!")
  return(out_params)
}

Created by Pretty R at inside-R.org

As you can see, the whole script is wrapped in a function called tool_exec with two arguments in_params and out_params. These are the list of input and output parameters that will be passed to R from ArcGIS.
The next three lines are taken directly from the script that ESRI provides. Basically, if the user does not have the package sp installed, R will download, install and load it. You can copy and paste these lines if you need other packages installed on the user's machine to perform your analysis. In this case I am only using the function kmeans, available in the package stats, which is loaded by default in R.
At this point I inserted two print calls with the title and my name on them. This has no real purpose except to let you know that you can print information from R directly onto the dialog in ArcGIS with simple print calls. We will see at the end how they look.
Now we need to create an object for each input and output parameter you need. We will need to specify these in ArcGIS once we create the Toolbox. Since I want to cluster a shapefile, the first input parameter will be this object. Then I want the user to select the number of clusters, so I will create another option for this. Then I would like the user to be able to select the variables s/he wants to use for clustering, so I will need to create an option for this in ArcGIS and then collect it into the object variable. Finally, ArcGIS will save another shapefile with the points plus their cluster. This will be the only output parameter, and I collect it into the object out_shape.

Now I can start the real computation. The function arc.open allows to import in R the shapefile from the Toolbox in ArcGIS. If you want you can take a look at the structure of this object by simply inserting print(str(d)) right after it. This will print the structure of the object d in the dialog created in ArcGIS.
Now we have the function arc.select, which allows to extract from d only the variables we need and that are selected by the user on the Toolbox GUI.
At this point we need to create a data.frame that we are going to fill with only the variables the user selected in the Toolbox. The object variable is a list of strings, therefore we can use its elements to extract single columns from the object data, with the syntax data[,variable[1]].
Since we do not know how many variables will users select and we do not want to limit them, I created an if statement with a loop to attach additional columns to the object data_clust. Then I replaced the column names in data_clust with the names of the variables, this will help me in the next phase.
Now in fact, I want to produce histograms of the variables the user selected. This will allow me to check whether what I am about to do makes sense, and it is one of those things for which R excels. For this I can simply call the function plot and R will show it even if it is called from ArcGIS, as simple as that!! We only need to remember to insert dev.new() so that each plot is created separately and the user can see/save them all.
After this step we can call the function kmeans to cluster our data. Then we can collect the results in a new data.frame, and finally use the function arc.write to write the object out_shape with the results. As you can see we also need to specify the coordinates of each point and this can be done calling the function arc.shape.
Then we print the string "Done!!" and return the output parameters, that will be taken from ArcGIS and shown to the user.

Toolbox
Now that we've seen how to create the R script we can take a look at the Toolbox, since both things need to be done in parallel.
Creating a new Toolbox in ArcGIS is very simple, we just need to open ArcCatalog, click where we want to create it with the right mouse button and then select New->Toolbox.

Once this is done we will then need to add, within this Toolbox, a script. To do this we can again click with the right button on the Toolbox we just created and then select Add->Script...

At this point a dialog will appear where we can set the parameters of this script. First we add a title and a label and click proceed (my PC runs with Italian as the local language, sorry!!)

Then we need to select the R script we need to run. Since the creation of the Toolbox can also be done before taking care of the script, here we can select an R script not completed and ArcGIS will not have any problem. This is what I did to create this example, so that I could debug the R script using print calls and looking at the results on the ArcGIS dialog.

The next window is very important, because it allows us to set the input and output parameters that will then be passed to R. As we saw in the R script here I set 4 parameters, 3 inputs and 1 output. It is important that the order matches what we have in the R script, so for example the number of clusters is the second input.

The first parameter is the input data. For this I used the type "Table View", which allows the user to select a dataset s/he already imported in ArcGIS. I selected this because usually I first load data into ArcGIS, check them, and then perform some analysis. However, if you prefer I think you could also select the type shapefile, to allow users to select a shp directly from their PC.
The next parameter is the number of clusters, which is a simple number. Then we have the field variables. This is very important because we need to set it in a way that allow users to select variables directly from the dataset we are importing.

We can do that by setting the options "Filter" and "Obtained from" that you see in the image above. It is important that we set "Obtained from" with the name of our input data.
At this point we can set the output file, which is a shapefile.

One thing we could do is set the symbology for the shapefile that will be created at the end of the script. To do so we need to create and set a layer file. I did it by changing the symbology to another shapefile and then export it. The only problem is that this technique is not really flexible, meaning that if the layer is set for 5 clusters and users select 10, the symbology will still be with 5 colors. I am not sure if that can be changed or adapted somehow. If the symbology file is not provided the R script will still run correctly and produce a result, but this will not have any colors and users will need to set these afterwards, which probably is not a big deal.
Once this final step is done we can finish the creation of the tool and take a look at the resulting GUI:

Run the Toolbox
Now that we have created both the script and the Toolbox to run it, we can test it. I included a shapefile with the location of Earthquakes that I downloaded from the USGS website yesterday (01 July 2016). This way you can test the tool with real data. As variables you can select: depth, magnitude, distance from volcanoes, faults and tectonic plates. For more info on this dataset please look at one of my previous posts: http://r-video-tutorial.blogspot.ch/2015/06/cluster-analysis-on-earthquake-data.html
We only need to fill the values in the GUI and then click OK. You can see the result in the image below:

As you can see R first produces a histogram of the variable/s the user selects, which can be saved. Then creates a shapefile that is automatically imported in ArcGIS. Moreover, as you can see from the dialog box, we can use the function print to provide messages to the user. Here I put only some simple text, but it may well be some numerical results.

Source Code
The source code for this Toolbox is provided in my GitHub at this link:
https://github.com/fveronesi/Clustering_Toolbox