R tutorial for Spatial Statistics: box-plot

Showing posts with label box-plot. Show all posts

Sunday, 25 March 2018

Data Visualization Website with Shiny

My second Shiny app is dedicated to data visualization.
Here users can simply upload any csv or txt file and create several plots:

Histograms (with option for faceting)

Barchart (with error bars, and option for color with dodging and faceting)

BoxPlots (with option for faceting)

Scatterplots (with options for color, size and faceting)

TimeSeries

Error bars in barcharts are computed with the mean_se function in ggplot2, which computes error bars as mean ± standard error. When the color option is set, barcharts are plotted one next to the other for each color (option dodging).

For scatterplots, if the option for faceting is provided each plot will include a linear regression lines.

Some examples are below:

For the time being there is no option for saving plots, apart from saving the image from the screen. However, I would like to implement an option to have plots in tiff at 300dpi, but all the code I tried so far did not work. I will keep trying.

The app can be accessed here: https://fveronesi.shinyapps.io/DataViz/

The R code is available here: https://github.com/fveronesi/Shiny_DataViz

Monday, 11 July 2016

The Power of ggplot2 in ArcGIS - The Plotting Toolbox

In this post I present my third experiment with R-Bridge. The plotting toolbox is a plug-in for ArcGIS 10.3.x that allows the creation of beautiful and informative plot, with ggplot2, directly from the ESRI ArcGIS console.
As always I not only provide the toolbox but also a dataset to try it out. Let's start from here...

Data
For testing the plotting tool, I downloaded some air pollution data from EPA (US Environmental Protection Agency), which provides open access to its database. I created a custom function to download data from EPA that you can find in this post.
Since I wanted to provide a relatively small dataset, I extracted values from only four states: California, New York, Iowa and Ohio. For each of these, I included time series for Temperature, CO, NO2, SO2 and Barometric Pressure. Finally, the coordinates of the points are the centroid for each of these four states. The image below depicts the location and the first lines of the attribute table. This dataset is provided in shapefile and CSV, both can be used with the plotting toolbox.

Toolbox
Now that we have seen the sample dataset, we can take a look at the toolbox. I included 5 packages to help summarize and visualize spatial data directly from the ArcGIS console.

I first included a package for summarizing our variables, this creates a table with some descriptive statistics. Then I included all the major plotting types I presented in my book "Learning R for Data Visualization [VIDEO]" edited by Packt Publishing, with some modifications to the scripts for adapting them to the R-Bridge Technology from ESRI. For more information, practical and theoretical, about each of these plots please refer to the book.

The tool can be downloaded from my GitHub page at:
https://github.com/fveronesi/PlottingToolbox

Summary
This is the first package I would like to describe simply because a data analysis should always start with a look at our variables with some descriptive statistics. As for all the tools presented here its use is simple and straightforward with the GUI presented below:

Here the user has to point to the dataset s/he wants to analyze in "Input Table". This can be a shapefile, either loaded already in ArcGIS or not, but it can also be a table, for example a CSV. That is the reason why I included a CSV version of the EPA dataset as well.
At this point the area in "Variable" will fill up with the column names of the input file, from here the user can select the variable s/he is interested in summarizing. The final step is the selection of the "Output Folder". Important: users need to first create the folder and then select it. This is because this parameter is set as input, so the folder needs to exist. I decided to do it this way because otherwise for each new plot a new folder would have need to be created. This way all the summaries and plots can go into the same directory.
Let's take a look at the results:

The Summary package presents two tables, with all the variables we wanted to summarize, arranged one above the other. This is the output users will see on screen once the toolbox has competed its run. Then in the output folder, the R script will save this exact table in PDF format, plus a CSV with the raw data.

Histogram
As the name suggest this tool provides access to the histogram plot in ggplot2. ArcGIS provides a couple of ways to represent data in histograms. The first way is by clicking with the right mouse button on one of the column in the attribute table; a drop-down menu will appear and from there users can click on "Statistics" to access some descriptive statistics and a histogram. Another way is through the "Geostatistical Analyst", which is an add-on for which users need an additional license. This has an "Explore Data" package from which it is possible to create histograms. The problem with both these methods is that the results are not, in my opinion at least, suitable for any publication. You can maybe share them with your colleagues, but I would not suggest using them for any article. This implies that ArcGIS users need to open another software, maybe Excel, to create the plots they require, and we all know how painful it is in Excel to produce any decent plot, and histogram are probably the most painful to produce.
This changes now!!

By combining the power of R and ggplot2 with ArcGIS we can provide users with an easy to use way to produce beautiful visualizations directly within the ArcGIS environment and have them saved in jpeg at 300 dpi. This is what this set of packages does.
Let's now take a look at the interface for histograms:

As for Summary we first need to insert the input dataset, and then select the variable/s we want to plot. If two or more variables are selected, several plots will be created and saved in jpeg.
I also added some optional values to further customize the plots. The first is the faceting variable, which is a categorical variable. If this is selected the plot will have a histogram for each category; for example, in the sample dataset I have the categorical variable "state", with the name of the four states I included. If I select this for faceting the result will be the figure below:

Another option available here is the binwidth. The package ggplot2 usually sets this number as the range of the variable divided by 30. However, this can be customized by the user and this is what you can do with this option. Finally users need to specify an output folder where R will save a jpeg of the plots shown on screen.

Box-Plot
This is another great way to compare variables' distributions and as far as I know it cannot be done directly from ArcGIS. Let's take a look at this package:

This is again very easy to use. Users just need to set the input file, then the variable of interest and the categorical variable for the grouping. Here I am again using the states, so that I compare the distribution of NO2 across the four US states in my dataset.
The results is ordered by median values and it is shown below:

As you can see I decided to plot the categories vertically, as to accomodate long names. This of course can be changed by tweaking the R script. As for each package, this plot is saved in the output folder.

Bar Chart
This tool is generally used to compare different values for several categories, but generally we have one value for each category. However, it may happen that the dataset contains multiple measurements for some categories, and I implemented a way to deal with that here. Below is presented the GUI to this package:

The inputs are basically the same as for box-plots, the only difference is in the option "Average Values". If this is set, R will average the values of the variable for each unique category in the dataset. The results are again ordered, and are saved in jpeg in the output folder:

Scatterplot
This is another great way to visually analyze our data. The package ggplot2 allows the creation of highly customized plots and I tried to implement as much as I could in terms of customization in this toolbox. Let's take a look:

After selecting the input dataset the user can select what to plot. S/he can choose to only plot two variables, one on the X axis and one on the Y axis, or further increase the amount of information presented in the plot by including a variable that changes the color of the points and one for their size. Moreover, there is also the possibility to include a regression line. Color, size and regression line are optional, but I wanted to include them to present the full range of customizations that this package allows.

Once again this plot is saved in the output folder.

Time Series
The final type of plots I included here is the time series, which is also the one with the highest number of inputs from the user side. In fact, many spatial datasets include a temporal component but often time this is not standardized. By that I mean that in some cases the time variable has only a date, and in some cases it includes a time; in other cases the format changes from dataset to dataset. For this reason it is difficult to create an R script that works with most datasets, therefore for time-series plots users need to do some pre-processing. For example, if date and time are in separate columns, these need to be merged into one for this R script to work.
At this point the TimeSeries package can be started:

The first two columns are self explanatory. Then users need to select the column with the temporal information, and then input manually the format of this column.
In this case the format I have in the sample dataset is the following: 2014-01-01
Therefore I have the year with century, a minus sign, the month, another minus sign and the day. I need to use the symbols for each of these to allow R to recognize the temporal format of the file.
Common symbol are:
%Y - Year with century
%y - Year without century
%m - Month
%d - Day
%H - Hour as decimal number (00-23)
%M - Minute as decimal number (00-59)
%S - Second as decimal number

More symbols can be found at this page: http://www.inside-r.org/r-doc/base/strptime

The remaining two inputs are optional, but if one is selected the other needs to be provided. For "Subsetting Column" I intend a column with categorical information. For example, in my dataset I can generate a time-series for each US state, therefore my subsetting column is state. In the option "Subset" users need to write manually the category they want to use to subset their data. Here I just want to have the time-series for California, so I write California. You need to be careful here to write exactly the name you see in the attribute table because R is case sensitive, thus if you write california with a lower case c R will be unable to produce the plot.
The results, again saved in jpeg automatically, is presented below:

Monday, 25 April 2016

Learning R for Data Visualization [Video]

Last year Packt asked me to develop a video course to teach various techniques of data visualization in R. Since I love the idea of video courses and tutorials, and I also enjoy plotting data, I readily agreed.
The result is this course, published last March, which I will briefly present below.

The course is available here:
https://www.packtpub.com/big-data-and-business-intelligence/learning-r-data-visualization-video

I wanted to create a course that was easy to follow, and at the same time could provide a good basis even for the most advanced forms of data visualization available today in R.
Packt was interested in presenting ggplot2, which is definitely the most advanced way of creating static plots. Since I regularly use ggplot2 and I find it a tremendous tool, I was glad to be able to present its functionalities more in details. Three chapters are dedicated to this package. Here I present all the most important types of plots: histograms, box-plots, scatterplots, bar-charts and time-series. Moreover, a whole chapter is dedicated to embellish the default plots by adding elements, such as text labels and much more.

However, I am also very interested in interactive plotting, which I believe is now rapidly becoming commonplace for lots of applications. For this reason two chapters are completely dedicated to interactive plots. In the first I present the package rCharts, which is extremely powerful but also a bit tricky to use at times. In many cases there is little documentation to work with, and for developing the course I found myself often wondering through stackoverflow searching for answers. Luckily for all of us, Prof. Ramnath Vaidyanathan, the creator of rCharts, is always available to answer all the users' questions quickly and clearly. In chapter 5 the viewer will be able to start from zero and quickly create nice interactive versions of all the plots I covered with ggplot2.

The last chapter is dedicated to Shiny and it is aimed at the creation of a full website for importing and plotting data. Here the reader will first learn the basics of Shiny and then will write the code to create the website and add lots of interesting functionalities.

I hope this video course will help R users become familiar with data visualization.
I would also like to take this opportunity to stress that I am open to support viewers throughout the learning process, meaning that if you have any question about the material in the course you should not hesitate one second in contacting me at info@fabioveronesi.net

Thursday, 6 June 2013

Box-plot with R – Tutorial

Yesterday I wanted to create a box-plot for a small dataset to see the evolution of 3 stations through a 3 days period. I like box-plots very much because I think they are one of the clearest ways of showing trend in your data. R is extremely good for this type of plot and, for this reason, I decided to add a post on my blog to show how to create a box-plot, but also because I want to use my own blog to help me remember pieces of code that I might want to use in the future but that I tend to forget.

For this example I first created a dummy dataset using the function rnorm() which generates random normal-distributed sequences. This function requires 3 arguments, the number of samples to create, the mean and the standard deviation of the distribution, for example:

rnorm(n=100,mean=3,sd=1)

This generates 100 numbers (floats to be exact), which have mean equal to 3 and standard deviation equal to 1.

To generate my dataset I used the following line of code:

data<-data.frame(Stat11=rnorm(100,mean=3,sd=2),

Stat21=rnorm(100,mean=4,sd=1),

Stat31=rnorm(100,mean=6,sd=0.5),

Stat41=rnorm(100,mean=10,sd=0.5),

Stat12=rnorm(100,mean=4,sd=2),

Stat22=rnorm(100,mean=4.5,sd=2),

Stat32=rnorm(100,mean=7,sd=0.5),

Stat42=rnorm(100,mean=8,sd=3),

Stat13=rnorm(100,mean=6,sd=0.5),

Stat23=rnorm(100,mean=5,sd=3),

Stat33=rnorm(100,mean=8,sd=0.2),

Stat43=rnorm(100,mean=4,sd=4))

This line creates a data.frame with 12 columns that looks like this:

Stat11	Stat21	Stat31	Stat41	Stat12	Stat22	Stat32	Stat42	Stat13	Stat23	Stat33	Stat43
5	2	9	-3	10	4	1	1	4	1	5	9
6	13	8	3	7	3	10	10	10	5	9	8
4	4	6	0	10	6	7	6	6	8	2	7
6	7	6	3	9	1	7	0	1	0	6	0
0	2	8	1	6	8	0	8	3	10	9	8
0	19	10	0	11	10	5	6	5	8	10	1
7	4	5	-5	7	0	3	5	2	5	5	3
4	12	9	-4	7	1	9	0	7	2	1	7
7	3	9	0	11	0	8	1	7	0	7	7
6	19	8	3	10	10	9	6	0	2	8	2
6	13	6	-5	12	8	1	4	0	4	5	10
8	11	6	-1	11	4	4	1	4	6	6	10
8	13	5	-5	7	10	0	4	2	7	3	1
2	8	5	-2	5	7	4	2	7	0	3	1
8	11	7	3	11	1	0	9	2	3	5	8
4	19	5	-1	11	6	3	4	9	5	9	0
2	9	5	-3	12	7	6	4	8	2	6	8
7	10	5	-4	8	9	6	9	1	4	3	4
…	…	…	…	…	…	…	…	…	…	…	…

As I mentioned before, this should represent 4 stations for which the measure were replicated in 3 successive days.

Now, for the creation of the box-plot the simplest function is boxplot() and can be simply called by adding the name of the dataset as only argument:

boxplot(data)

This creates the following plot:

It is already a good plot, but it needs some adjustments. It is in black and white, the box-plots are evenly spaced, even though they are from 3 different replicates, there are no labels on the axis and the names of the stations are not all reported.

So now we need to start doing some tweaking.

First, I want to draw the names of the stations vertically, instead of horizontally. This can be easily done with the argument las. So now the call to the function boxplot() becomes:

boxplot(data, las = 2)

This generates the following plot:

Next, I want to change the name of the stations so that they look less confusing. For doing that I can use the option names:

boxplot(data, las = 2, names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))

which generates this plot:

If the names are too long and they do not fit into the plot’s window you can increase it by using the option par:

boxplot(data, las = 2, par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))

Now I want to group the 4 stations so that the division in 3 successive days is clearer. To do that I can use the option at, which let me specify the position, along the X axis, of each box-plot:

boxplot(data, las = 2, at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))

Here I am specifying that I want the first 4 box-plots at position x=1, x=2, x=3 and x=4, then I want to leave a space between the fourth and the fifth and place this last at x=6, and so on.

If you want to add colours to your box plot, you can use the option col and specify a vector with the colour numbers or the colour names. You can find the colour numbers here, and the colour names here.

Here is an example:

boxplot(data, las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1",

"royalblue2","red","sienna","palevioletred1","royalblue2"),

at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1),

names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))

Now, for the finishing touches, we can put some labels to plot.

The common way to put labels on the axes of a plot is by using the arguments xlab and ylab.

Let’s try it:

boxplot(data, ylab = "Oxigen (%)", xlab = "Time", las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"),at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))

I just added the two arguments highlighted, but the result is not what I was expecting

As you can see from the image above, the label on the Y axis is place very well and we can keep it. On the other hand, the label on the X axis is drawn right below the stations names and it does not look good.

To solve this is better to delete the option xlab from the boxplot call and instead use an additional function called mtext(), that places a text outside the plot area, but within the plot window. To place text within the plot area (where the box-plots are actually depicted) you need to use the function text().

The function mtext() requires 3 arguments: the label, the position and the line number.

An example of a call to the function mtext is the following:

mtext(“Label”, side = 1, line = 7)

the option side takes an integer between 1 and 4, with these meaning: 1=bottom, 2=left, 3=top, 4=right

The option line takes an integer with the line number, starting from 0 (which is the line closer to the plot axis). In this case I put the label onto the 7^th line from the X axis.

With these option you can produce box plot for every situation.

The following is just one example:

This is the script:

data<-data.frame(Stat11=rnorm(100,mean=3,sd=2),
Stat21=rnorm(100,mean=4,sd=1),
Stat31=rnorm(100,mean=6,sd=0.5),
Stat41=rnorm(100,mean=10,sd=0.5),
Stat12=rnorm(100,mean=4,sd=2),
Stat22=rnorm(100,mean=4.5,sd=2),
Stat32=rnorm(100,mean=7,sd=0.5),
Stat42=rnorm(100,mean=8,sd=3),
Stat13=rnorm(100,mean=6,sd=0.5),
Stat23=rnorm(100,mean=5,sd=3),
Stat33=rnorm(100,mean=8,sd=0.2),
Stat43=rnorm(100,mean=4,sd=4))




boxplot(data,  las = 2, 
col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"),
at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), 
names = c("","","","","","","","","","","",""),
ylim=c(-6,18))

#Station labels
mtext("Station1", side=1, line=1, at=1, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=2, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=3, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=4, las=2, font=4, col="royalblue2")
mtext("Station1", side=1, line=1, at=6, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=7, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=8, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=9, las=2, font=4, col="royalblue2")
mtext("Station1", side=1, line=1, at=11, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=12, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=13, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=14, las=2, font=4, col="royalblue2")

#Axis labels
mtext("Time", side = 1, line = 6, cex = 2, font = 3)
mtext("Oxigen (%)", side = 2, line = 3, cex = 2, font = 3)

#In-plot labels
text(1,-4,"*")
text(6,-4,"*")
text(11,-4,"*")

text(2,9,"A",cex=0.8,font=3)
text(7,11,"A",cex=0.8,font=3)
text(12,15,"A",cex=0.8,font=3)