Yesterday I wanted to create a box-plot for a small dataset
to see the evolution of 3 stations through a 3 days period. I like box-plots
very much because I think they are one of the clearest ways of showing trend in
your data. R is extremely good for this type of plot and, for this reason, I
decided to add a post on my blog to show how to create a box-plot, but also
because I want to use my own blog to help me remember pieces of code that I
might want to use in the future but that I tend to forget.
For this example I first created a
dummy dataset using the function rnorm() which generates random
normal-distributed sequences. This function requires 3 arguments, the number of
samples to create, the mean and the standard deviation of the distribution, for
example:
rnorm(n=100,mean=3,sd=1)
This generates 100 numbers (floats to be exact), which have
mean equal to 3 and standard deviation equal to 1.
To generate my dataset I used the following line of code:
data<-data.frame(Stat11=rnorm(100,mean=3,sd=2),
Stat21=rnorm(100,mean=4,sd=1),
Stat31=rnorm(100,mean=6,sd=0.5),
Stat41=rnorm(100,mean=10,sd=0.5),
Stat12=rnorm(100,mean=4,sd=2),
Stat22=rnorm(100,mean=4.5,sd=2),
Stat32=rnorm(100,mean=7,sd=0.5),
Stat42=rnorm(100,mean=8,sd=3),
Stat13=rnorm(100,mean=6,sd=0.5),
Stat23=rnorm(100,mean=5,sd=3),
Stat33=rnorm(100,mean=8,sd=0.2),
Stat43=rnorm(100,mean=4,sd=4))
This line creates a data.frame with 12 columns that looks
like this:
Stat11
|
Stat21
|
Stat31
|
Stat41
|
Stat12
|
Stat22
|
Stat32
|
Stat42
|
Stat13
|
Stat23
|
Stat33
|
Stat43
|
5
|
2
|
9
|
-3
|
10
|
4
|
1
|
1
|
4
|
1
|
5
|
9
|
6
|
13
|
8
|
3
|
7
|
3
|
10
|
10
|
10
|
5
|
9
|
8
|
4
|
4
|
6
|
0
|
10
|
6
|
7
|
6
|
6
|
8
|
2
|
7
|
6
|
7
|
6
|
3
|
9
|
1
|
7
|
0
|
1
|
0
|
6
|
0
|
0
|
2
|
8
|
1
|
6
|
8
|
0
|
8
|
3
|
10
|
9
|
8
|
0
|
19
|
10
|
0
|
11
|
10
|
5
|
6
|
5
|
8
|
10
|
1
|
7
|
4
|
5
|
-5
|
7
|
0
|
3
|
5
|
2
|
5
|
5
|
3
|
4
|
12
|
9
|
-4
|
7
|
1
|
9
|
0
|
7
|
2
|
1
|
7
|
7
|
3
|
9
|
0
|
11
|
0
|
8
|
1
|
7
|
0
|
7
|
7
|
6
|
19
|
8
|
3
|
10
|
10
|
9
|
6
|
0
|
2
|
8
|
2
|
6
|
13
|
6
|
-5
|
12
|
8
|
1
|
4
|
0
|
4
|
5
|
10
|
8
|
11
|
6
|
-1
|
11
|
4
|
4
|
1
|
4
|
6
|
6
|
10
|
8
|
13
|
5
|
-5
|
7
|
10
|
0
|
4
|
2
|
7
|
3
|
1
|
2
|
8
|
5
|
-2
|
5
|
7
|
4
|
2
|
7
|
0
|
3
|
1
|
8
|
11
|
7
|
3
|
11
|
1
|
0
|
9
|
2
|
3
|
5
|
8
|
4
|
19
|
5
|
-1
|
11
|
6
|
3
|
4
|
9
|
5
|
9
|
0
|
2
|
9
|
5
|
-3
|
12
|
7
|
6
|
4
|
8
|
2
|
6
|
8
|
7
|
10
|
5
|
-4
|
8
|
9
|
6
|
9
|
1
|
4
|
3
|
4
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
As I mentioned before, this should represent 4 stations for
which the measure were replicated in 3 successive days.
Now, for the creation of the box-plot the simplest function
is boxplot() and can be simply called by adding the name of the dataset as only
argument:
boxplot(data)
This creates the following plot:
It is already a good plot, but it needs some adjustments. It is in black and white, the
box-plots are evenly spaced, even though they are from 3 different replicates,
there are no labels on the axis and the names of the stations are not all
reported.
So now we need to start doing some tweaking.
First, I want to draw the names of the stations vertically,
instead of horizontally. This can be easily done with the argument las. So now the call to the function boxplot()
becomes:
boxplot(data, las = 2)
This generates the following plot:
Next, I want to change the name of the stations so that they
look less confusing. For doing that I can use the option names:
boxplot(data, las = 2, names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
which generates this plot:
If the names are too long and they do not fit into the plot’s
window you can increase it by using the option par:
boxplot(data, las = 2, par(mar
= c(12,
5, 4,
2)
+ 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Now I want to group the 4 stations so that the division in 3
successive days is clearer. To do that I can use the option at, which let me
specify the position, along the X axis, of each box-plot:
boxplot(data, las = 2, at =
c(1,2,3,4,
6,7,8,9,
11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Here I am specifying that I want the first 4 box-plots at
position x=1, x=2, x=3 and x=4, then I want to leave a space between the fourth
and the fifth and place this last at x=6, and so on.
If you want to add colours to your box plot, you can use the
option col and specify a vector with the colour numbers or the colour names.
You can find the colour numbers here, and the colour names here.
Here is an example:
boxplot(data, las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1",
"royalblue2","red","sienna","palevioletred1","royalblue2"),
at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1),
names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Now, for the finishing touches, we can put some labels to
plot.
The common way to put labels on the axes of a plot is by
using the arguments xlab and ylab.
Let’s try it:
boxplot(data, ylab =
"Oxigen (%)", xlab =
"Time", las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"),at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
As you can see from the image above, the label on the Y axis
is place very well and we can keep it. On the other hand, the label on the X
axis is drawn right below the stations names and it does not look good.
To solve this is better to delete the option xlab from the
boxplot call and instead use an additional function called mtext(), that places
a text outside the plot area, but within the plot window. To place text within
the plot area (where the box-plots are actually depicted) you need to use the
function text().
The function mtext() requires 3 arguments: the label, the
position and the line number.
An example of a call to the function mtext is the following:
mtext(“Label”, side = 1, line = 7)
the option side takes an integer between 1 and 4, with these
meaning: 1=bottom, 2=left, 3=top, 4=right
The option line takes an integer with the line number,
starting from 0 (which is the line closer to the plot axis). In this case I put
the label onto the 7th line from the X axis.
With these option you can produce box plot for every situation.
The following is just one example:
This is the script:
data<-data.frame(Stat11=rnorm(100,mean=3,sd=2),
Stat21=rnorm(100,mean=4,sd=1),
Stat31=rnorm(100,mean=6,sd=0.5),
Stat41=rnorm(100,mean=10,sd=0.5),
Stat12=rnorm(100,mean=4,sd=2),
Stat22=rnorm(100,mean=4.5,sd=2),
Stat32=rnorm(100,mean=7,sd=0.5),
Stat42=rnorm(100,mean=8,sd=3),
Stat13=rnorm(100,mean=6,sd=0.5),
Stat23=rnorm(100,mean=5,sd=3),
Stat33=rnorm(100,mean=8,sd=0.2),
Stat43=rnorm(100,mean=4,sd=4))
boxplot(data, las = 2,
col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"),
at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1),
names = c("","","","","","","","","","","",""),
ylim=c(-6,18))
#Station labels
mtext("Station1", side=1, line=1, at=1, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=2, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=3, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=4, las=2, font=4, col="royalblue2")
mtext("Station1", side=1, line=1, at=6, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=7, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=8, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=9, las=2, font=4, col="royalblue2")
mtext("Station1", side=1, line=1, at=11, las=2, font=1, col="red")
mtext("Station2", side=1, line=1, at=12, las=2, font=2, col="sienna")
mtext("Station3", side=1, line=1, at=13, las=2, font=3, col="palevioletred1")
mtext("Station4", side=1, line=1, at=14, las=2, font=4, col="royalblue2")
#Axis labels
mtext("Time", side = 1, line = 6, cex = 2, font = 3)
mtext("Oxigen (%)", side = 2, line = 3, cex = 2, font = 3)
#In-plot labels
text(1,-4,"*")
text(6,-4,"*")
text(11,-4,"*")
text(2,9,"A",cex=0.8,font=3)
text(7,11,"A",cex=0.8,font=3)
text(12,15,"A",cex=0.8,font=3)
how do I change the y-axis? in the sense, I can see somehow even numbers between -5 and 0, 0 and 5, ... etc?
ReplyDeleteHi Francesca,
Deleteyou can do that in 2 steps: 1. insert the option yaxt="n" in the boxplot call so that it does not plot the y axis.
Ex.
boxplot(data,yaxt="n",las = 2, ..... )
2.use the function axis to draw the y axis with the values you need.
Ex.
axis(side=2,at=seq(-5,20,1),las=2)
You can find more info here: http://www.statmethods.net/advgraphs/axes.html
I hope this helps,
Fabio
You are great! thank you so much!
DeleteThat was very helpful. Would you be able to make write a column on setting the internal area of plots.
ReplyDeleteThe reason I am asking is I am pretty new to R and I don't think I would do a very good job off writing it up.
I finally got some sort of solution
plot.new()
ymax<-max(y)+.1
plot.window(xlim=c(0,10),ylim=c(0,ymax),xaxs="r",yaxs="r")
box()
The problem was that my plots were crowded right up to the edges of the box so i couldn't label points above the greatest point or to the right of the last most point. The data ranges in this example were x=1:9 and y=(.1, .4) - y being a probability.
plot(x, y, type='o', bty='o', pty='s' las=1)
points(x, y, cex = 1, col = "dark red", pch = 20)
and set pty='s' but I don't think that makes that much difference.
I have spent a lot of time googling and experimenting to solve this. Part of the problem is I didn't understand the terminology so I spent a lot of time adjusting margins with mar and mai.
Also I didn't really come to understand plot.window properly or that mar, mai and a lot of these others should be saved and set prior to window new and restored after.
many thanks for your column.
mine plot gets chopped off on the right when i add the at = c(1,2,3,4, 6,7,8,9, 11,12,13,14) line
ReplyDeletehello,
ReplyDeleteI appreciated your article. I was wondering if you could help me with a problem I'm currently dealing with on R studio. I need to create boxplots for 2 different sampling years.
here is my question: how can I shift the arrangement of my two boxplots? instead of being the boxplot name "12" on the left part of my graph and "98" on the right part of the graph, i need to reverse this order: "12" should be on the right, while "98" on the left.
many many thanks for your help
I think you can solve your issue simply by reversing the order in the at option.
DeleteInstead of using: at=c(1,2,3,4,7,8,9,10,13,14,15,16)
you could use: at=rev(c(1,2,3,4,7,8,9,10,13,14,15,16))
This makes R plot the last boxplot in the first position.
Then clearly you should also reverse the order of the labels and the colors, but it should work.
Thank you very much for putting this up. Super easy to follow with the way you coloured each part of the code & functions in this tutorial. Keep up the good work!
ReplyDeleteThanks for this. I was able to recreate it on my machine, and now I know a lot more can be done with the boxplot
ReplyDelete