Saturday, 2 July 2016

Combining ArcGIS and R - Clustering Toolbox

Last year at the ESRI User Conference in San Diego, there was an announcement of an initiative to bridge ArcGIS and R. This became reality I think early this year with R-Bridge.
Basically, ESRI has created an R library that is able to communicate and exchange data between ArcGIS and R, so that we can create ArcGIS toolboxes using R scripts.

I am particularly interested in this application because R has become quite powerful for spatial data analysis in the last few years. However, I have the impression that within the geography community, R is still considered a bit of an outsider. This is because the main GIS application, i.e. ArcGIS, is based on Python and therefore courses in departments of geography and geomatics tend to focus on teaching Python, neglecting R. This I think is a mistake, since R in my opinion is easier to learn for people without a background in computer science, and has very powerful libraries for spatio-temporal data analysis.
For these reasons, the creation of the R-Bridge is particularly welcomed from my side because it will allow me to teach students how to create powerful new Toolboxes for ArcGIS based on scripts created in R. For example, this Autumn semester we will implement in the course of GIS III a module about geo-sensors, and then I will teach spatio-temporal data analysis using R within ArcGIS. This way students will learn the power of R starting from the familiar environment and user interface of ArcGIS.
Since I never worked with R-Bridge before, Today I started doing some testing and I decided that the best way to learn it was to create a simple Toolbox to do K-Means clustering on point shapefiles, which I think is a function not available in ArcGIS. In this post I will describe in details how to create the Toolbox and the R Script to perform the analysis.

R-Bridge Installation
Installing R-Bridge is extremely simple. TYou only need a recent  version of R (I have the 3.0.0) installed on your PC (32-bit or 64-bit, consistent with the version of ArcGIS you have installed) and ArcGIS 10.3.1 or more recent.
At this point you can download the installation files from the R-Bridge GitHub page: https://github.com/R-ArcGIS/r-bridge-install
You can unzip its content anywhere on your PC. At this point you need to run ArcGIS as aministrator (this is very important!!), and then in the ArcCatalog navigate to the folder where you unzip the zip you just downloaded.


Now you just need to run the script "Install R Bindings" and ArcGIS will take care of the rest. I found the process extremely easy!!

Getting Started
ESRI created two examples to help us get started with the development of packages for ArcGIS written in the R language. You can find it here: https://github.com/R-ArcGIS/r-sample-tools
When you unzip this you will find a folder named "Scripts" where you can find R scripts optimized for the use in ArcGIS. I started from these to learn how to create scripts that work.


Clustering Example - R Script
As I said, ESRI created a specific library for R to be able to communicate back and forth with ArcGIS: it is called "arcbinding" and it is installed during the installation process we completed before. This library has a series of functions that allow the R script to be run starting from the ArcGIS console and its GUI. For this reason the R script is a bit different compared to the one you would do to reach the same result outside of ArcGIS. Probably it is better if I just start including some code so that you can better understand.
Below is the full R script I used for this example:

### KMeans Clustering Toolbox
##Author: Fabio Veronesi
tool_exec <- function(in_params, out_params)
{
  if (!requireNamespace("sp", quietly = TRUE))
    install.packages("sp")
  require(sp)
 
  print("K-Means Clustering of Shapefiles")
  print("Author: Fabio Veronesi")
 
  source_dataset = in_params[[1]]
  nclust = in_params[[2]]
  variable = in_params[[3]]
 
  out_shape = out_params[[1]]
 
  ### Read Data
  arc.progress_label("Loading Dataset")
  d <- arc.open(source_dataset)
 
 
  ### Create a Data.Frame with the variables to cluster
  data <- arc.select(d, variable)
  data_clust <- data.frame(data[,variable[1]])
 
 
  if(length(variable)>1){
 for(i in 2:length(variable)){
  data_clust <- cbind(data_clust,data[,variable[i]])
 }
  }
 
  names(data_clust) <- variable
 
  for(i in 1:length(variable)){
 dev.new()
 plot(hist(data_clust[,i]),main=paste0("Histogram of ",variable[i]),xlab=variable[i])
  }
 
  clusters <- kmeans(data_clust, nclust)
 
  result <- data.frame(cluster=clusters$cluster)
 
  arc.write(out_shape, result, coords = arc.shape(data))
 
  print("Done!!")
  return(out_params)
}
Created by Pretty R at inside-R.org

As you can see, the whole script is wrapped in a function called tool_exec with two arguments in_params and out_params. These are the list of input and output parameters that will be passed to R from ArcGIS.
The next three lines are taken directly from the script that ESRI provides. Basically, if the user does not have the package sp installed, R will download, install and load it. You can copy and paste these lines if you need other packages installed on the user's machine to perform your analysis. In this case I am only using the function kmeans, available in the package stats, which is loaded by default in R.
At this point I inserted two print calls with the title and my name on them. This has no real purpose except to let you know that you can print information from R directly onto the dialog in ArcGIS with simple print calls. We will see at the end how they look.
Now we need to create an object for each input and output parameter you need. We will need to specify these in ArcGIS once we create the Toolbox. Since I want to cluster a shapefile, the first input parameter will be this object. Then I want the user to select the number of clusters, so I will create another option for this. Then I would like the user to be able to select the variables s/he wants to use for clustering, so I will need to create an option for this in ArcGIS and then collect it into the object variable. Finally, ArcGIS will save another shapefile with the points plus their cluster. This will be the only output parameter, and I collect it into the object out_shape.

Now I can start the real computation. The function arc.open allows to import in R the shapefile from the Toolbox in ArcGIS. If you want you can take a look at the structure of this object by simply inserting print(str(d)) right after it. This will print the structure of the object d in the dialog created in ArcGIS.
Now we have the function arc.select, which allows to extract from d only the variables we need and that are selected by the user on the Toolbox GUI.
At this point we need to create a data.frame that we are going to fill with only the variables the user selected in the Toolbox. The object variable is a list of strings, therefore we can use its elements to extract single columns from the object data, with the syntax data[,variable[1]].
Since we do not know how many variables will users select and we do not want to limit them, I created an if statement with a loop to attach additional columns to the object data_clust. Then I replaced the column names in data_clust with the names of the variables, this will help me in the next phase.
Now in fact, I want to produce histograms of the variables the user selected. This will allow me to check whether what I am about to do makes sense, and it is one of those things for which R excels. For this I can simply call the function plot and R will show it even if it is called from ArcGIS, as simple as that!! We only need to remember to insert dev.new() so that each plot is created separately and the user can see/save them all.
After this step we can call the function kmeans to cluster our data. Then we can collect the results in a new data.frame, and finally use the function arc.write to write the object out_shape with the results. As you can see we also need to specify the coordinates of each point and this can be done calling the function arc.shape.
Then we print the string "Done!!" and return the output parameters, that will be taken from ArcGIS and shown to the user.


Toolbox
Now that we've seen how to create the R script we can take a look at the Toolbox, since both things need to be done in parallel.
Creating a new Toolbox in ArcGIS is very simple, we just need to open ArcCatalog, click where we want to create it with the right mouse button and then select New->Toolbox.


Once this is done we will then need to add, within this Toolbox, a script. To do this we can again click with the right button on the Toolbox we just created and then select Add->Script...


At this point a dialog will appear where we can set the parameters of this script. First we add a title and a label and click proceed (my PC runs with Italian as the local language, sorry!!)


Then we need to select the R script we need to run. Since the creation of the Toolbox can also be done before taking care of the script, here we can select an R script not completed and ArcGIS will not have any problem. This is what I did to create this example, so that I could debug the R script using print calls and looking at the results on the ArcGIS dialog.


The next window is very important, because it allows us to set the input and output parameters that will then be passed to R. As we saw in the R script here I set 4 parameters, 3 inputs and 1 output. It is important that the order matches what we have in the R script, so for example the number of clusters is the second input.


The first parameter is the input data. For this I used the type "Table View", which allows the user to select a dataset s/he already imported in ArcGIS. I selected this because usually I first load data into ArcGIS, check them, and then perform some analysis. However, if you prefer I think you could also select the type shapefile, to allow users to select a shp directly from their PC.
The next parameter is the number of clusters, which is a simple number. Then we have the field variables. This is very important because we need to set it in a way that allow users to select variables directly from the dataset we are importing.


We can do that by setting the options "Filter" and "Obtained from" that you see in the image above. It is important that we set "Obtained from" with the name of our input data.
At this point we can set the output file, which is a shapefile.



One thing we could do is set the symbology for the shapefile that will be created at the end of the script. To do so we need to create and set a layer file. I did it by changing the symbology to another shapefile and then export it. The only problem is that this technique is not really flexible, meaning that if the layer is set for 5 clusters and users select 10, the symbology will still be with 5 colors. I am not sure if that can be changed or adapted somehow. If the symbology file is not provided the R script will still run correctly and produce a result, but this will not have any colors and users will need to set these afterwards, which probably is not a big deal.
Once this final step is done we can finish the creation of the tool and take a look at the resulting GUI:



Run the Toolbox
Now that we have created both the script and the Toolbox to run it, we can test it. I included a shapefile with the location of Earthquakes that I downloaded from the USGS website yesterday (01 July 2016). This way you can test the tool with real data. As variables you can select: depth, magnitude, distance from volcanoes, faults and tectonic plates. For more info on this dataset please look at one of my previous posts: http://r-video-tutorial.blogspot.ch/2015/06/cluster-analysis-on-earthquake-data.html
We only need to fill the values in the GUI and then click OK. You can see the result in the image below:


As you can see R first produces a histogram of the variable/s the user selects, which can be saved. Then creates a shapefile that is automatically imported in ArcGIS. Moreover, as you can see from the dialog box, we can use the function print to provide messages to the user. Here I put only some simple text, but it may well be some numerical results.

Source Code
The source code for this Toolbox is provided in my GitHub at this link:
https://github.com/fveronesi/Clustering_Toolbox


8 comments:

  1. Any easier solution is to do away with proprietary software altogether and use R as an integrated GIS. With new packages such as mapview for viewing data on a web-map this is easier than ever: https://cran.r-project.org/web/packages/mapview/index.html

    ReplyDelete
    Replies
    1. Thank you for your comments and the links you provide.

      Delete
  2. And this tutorial shows the beginnings of how to use R as an integrated GIS: https://github.com/Robinlovelace/Creating-maps-in-R - hope this is useful.

    ReplyDelete
  3. Have you found a way while using this bridge to apply a symbolology to the outputted shapefile? For instance, have polylines showing up in ArcPro automatically showing graduated colours, instead of a single symbol

    ReplyDelete
    Replies
    1. Not really.
      If you look at the sample tools ESRI provides on the R-Bridge website you will see that they are able to present their results with a symbology simply because they included some .lyr files that provide a symbology for the shapefiles saved after the R script.

      Delete
  4. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

    ReplyDelete
  5. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Data Science online Training Bangalore

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete

Note: only a member of this blog may post a comment.