Showing posts with label linear model. Show all posts
Showing posts with label linear model. Show all posts

Friday, 21 July 2017

Power analysis and sample size calculation for Agriculture

Power analysis is extremely important in statistics since it allows us to calculate how many chances we have of obtaining realistic results. Sometimes researchers tend to underestimate this aspect and they are just interested in obtaining significant p-values. The problem with this is that a significance level of 0.05 does not necessarily mean that what you are observing is real.
In the book "Statistics Done Wrong" by Alex Reinhart (which you can read for free here: https://www.statisticsdonewrong.com/) this problem is discussed with an example where we can clearly see that a significance of 0.05 does not mean that we have 5% chances of getting it wrong, but actually we have closer to 30% chances of obtaining unrealistic results. This is because there are two types of errors in which we can incur (for example when performing an ANOVA), the type I (i.e. rejecting a null hypothesis when it is actually true) and type II (i.e. accepting a null hypothesis when it is actually false).

Image taken from: https://datasciencedojo.com/

The probability of incurring in a type I error is indicated by α (or significance level) and usually takes a value of 5%; this means that we are happy to consider a scenario where we have 5% chances of rejecting the null hypothesis when it is actually true. If we are not happy with this, we can further decrease this probability by decreasing the value of α (for example to 1%). On the contrary the probability of incurring in a type II error is expressed by β, which usually takes a value of 20% (meaning a power of 80%). This means we are happy to work assuming that we have a 20% chance of accepting the null hypothesis when it is actually false.
If our experiment is not designed properly we cannot be sure whether we actually incurred in one of these two errors. In other words, if we run a bad experiment and we obtain a insignificant p-value it may be that we incurred in a type II error, meaning that in reality our treatment works but its effect cannot be detected by our experiment. However, it may also be that we obtained a significant p-value but we incurred in a type I error, and if we repeat the experiment we will find different results.
The only way we can be sure to run a good experiment is by running a power analysis. By definition power is the probability of obtaining statistical significance (not necessarily a small p-value, but at least a realistic outcome). Power analysis can be used before an experiment to test whether our design has good chances of succeeding (a priori) or after to test whether the results we obtained were realistic.


Update 17/11/2017
How many subjects to compute a robust mean?

This is a question that I get sometimes when talking with students that are planning descriptive experiments. How many subjects do I need for the mean value I compute to be robust?

The answer is provided by Berkowitz here: www.columbia.edu/~mvp19/RMC/M6/M6.doc

The simplified formula to compute the minimum number of samples is:


where SD is the standard deviation and SE is the standard error. These values can be obtained from previous experiments, or from the literature. 

Effect Size

A simple and effective definition of effect size is provided in the book "Power Analysis for Experimental Research" by Bausell & Li. They say: 
"effect size is nothing more than a standardized measure of the size of the mean difference(s) among the study’s groups or of the strength of the relationship(s) among its variables".

Despite its simple definition the calculation of the effect size is not always straightforward and many indexes have been proposed over the years. Bausell & Li propose the following definition, in line with what proposed by Cohen in his "Statistical Power Analysis for the Behavioral Sciences":

where ES is the effect size (in Cohen this is referred as d). In this equation, Ya is the mean of the measures for treatment A, and Yb is the mean for treatment B. The denominator is the pooled standard deviation, which is computed as follows:



where SD are the standard deviation for treatments B and A, and n are the number of samples for treatment B and A.

This is the main definition but then every software or functions tend to use indexes correlated to this but not identical. We will see each way of calculating the effect size case by case.


One-Way ANOVA 

Sample size

For simple models the power calculating can be performed with the package pwr:

 library(pwr)  

In the previous post (Linear Models) we worked on a dataset where we tested the impact on yield of 6 levels of nitrogen. Let's assume that we need to run a similar experiment and we would like to know how many samples we should collect (or how many plants we should use in the glass house) for each level of nitrogen. To calculate this we need to do a power analysis.

To compute the sample size required to reach good power we can run the following line of code:

 pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8)  

Let's start describing the options from the end. We have the option power, to specify the power you require for your experiment. In general, this can be set to 0.8, as mentioned above. The significance level is alpha and usually we are happy to accept a significance of 5%. Another option is k, which is the number of groups in our experiment, in this case we have 6 groups.
Finally we have the option f, which is the effect size. As I mentioned above, there are many indexes to express the effect size and f is one of them.
According to Cohen, f can be expressed as:


where the numerator is the is the standard deviation of the effects that we want to test and the denominator is the common standard deviation. For two means, as in the equation we have seen above, f is simply equal to:



Clearly, before running the experiment we do not really know what the effect size would be. In some case we may have an idea, for example from previous experiments or a pilot study. However, most of the times we do not have a clue. In such cases we can use the classification suggested by Cohen, who considered the following values for f:

The general rule is that if we do not know anything about our experiment we should use a medium effect size, so in this case 0.25. This was suggested in the book Bausell & Li and it is based on a review of 302 studies in the social and behavioral sciences. for this reason it may well be that the effect size of your experiment would be different. However, if you do not have any additional information this is the only thing the literature suggest.

The function above returns the following output:

 > pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8)  
    Balanced one-way analysis of variance power calculation   
        k = 6  
        n = 35.14095  
        f = 0.25  
    sig.level = 0.05  
      power = 0.8  
 NOTE: n is number in each group  

In this example we would need 36 samples for each nitrogen level to achieve a power of 80% with a significance of 5%.


Power Calculation

As I mentioned above, sometimes we have a dataset we collected assuming we could reach good power but we are not actually sure if that is the case. In those instances what we can do is the a posteriori power analysis, where we basically compute the power for a model we already fitted.

As you remember is the previous post about linear models, we fitted the following:

 mod1 = aov(yield ~ nf, data=dat)  

To compute the power we achieved here we first need to calculate the effect size. As discussed above we have several options: d, f and another index called partial eta squared.
Let's start from d, which can be simply calculated using means and standard deviation of two groups, for example N0 (control) and N5:

 Control = dat[dat$nf=="N0","yield"]
 Treatment1 = dat[dat$nf=="N5","yield"]

 numerator = (mean(Treatment1)-mean(Control))
 denominator = sqrt((((length(Treatment1)-1)*sd(Treatment1)^2)+((length(Control)-1)*sd(Control)^2))/(length(Treatment1)+length(Control)-2))

 d = numerator/denominator

This code simply computes the numerator (difference in means) and the denominator (pooled standard deviation) and then computes the Cohen's d, you just need to change the vectors for objects Control and Treatment1. The effect size results in 0.38.

Again Cohen provides some values for the d, so that we can determine how large is our effects, which are presented below:

From this table we can see that our effect size is actually low, and not medium as we assumed for the a priori analysis. This is important because if we run the experiment with 36 samples per group we may end up with unrealistic results simply due to low power. For this reason it is my opinion that we should always be a bit more conservative and maybe include some additional replicates or blocks, just to account for potential unforeseen differences between our assumptions and reality.

The function to compute power is again pwr.anova.test, in which the effect size is expressed as f. We have two ways of doing that, the first is by using the d values we just calculated and halve it, so in this case f = 0.38/2 = 0.19. However, this will tell you the specific effects size for the relation between N0 and N5, and not for the full set of treatments.

NOTE:
At this link there is an Excel file that you can use to convert between indexes of effect size:
http://www.stat-help.com/spreadsheets/Converting%20effect%20sizes%202012-06-19.xls


Another way to get a fuller picture is by using the partial Eta Squared, which can be calculated using the sum of squares:


This will tell us the average effect size for all the treatments we applied, so not only for N5 compared to N0, but for all of them.
To compute the partial eta squared we first need to access the anova table, with the function anova:

 > anova(mod1)  
 Analysis of Variance Table  
 Response: yield  
       Df Sum Sq Mean Sq F value  Pr(>F)    
 nf      5  23987 4797.4 12.396 6.075e-12 ***  
 Residuals 3437 1330110  387.0             
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  

From this table we can extract the sum of squares for the treatment (i.e. nf) and the sum of squares of the residuals and then solve the equation above:

 > EtaSQ = 23987/(23987+1330110)  
 > print(EtaSQ)  
 [1] 0.01771439  

As for the other indexes, eta squares also has its table of interpretation:

The relation between f and eta squared is the following:


so to compute the f related to the full treatment we can simply do the following:

 > f = sqrt(EtaSQ / (1-EtaSQ))  
 > print(f)  
 [1] 0.1342902  

So now we have everything we need to calculate the power of our model:

 > pwr.anova.test(k=6, n=571, f=f, sig.level=0.05)  
    Balanced one-way analysis of variance power calculation   
        k = 6  
        n = 571  
        f = 0.1342902  
    sig.level = 0.05  
      power = 1  
 NOTE: n is number in each group  

To compute the power we need to run again the function pwr.anova.test, but this time without specifying the option power, but replacing it with the option n, which is the number of samples per group.
As you remember from the previous post this was an unbalanced design, so the number of samples per group is not constant. We could either use a vector as input for n, with all the samples per each group. In that case the function will return a power for each group. However, what I did here is putting the lowest number, so that we are sure to reach good power for the lowest sample size.

As you can see even with the small effect size we are still able to reach a power of 1, meaning 100%. This is because the sample size is more than adequate to catch even such a small effect. You could try to run again the sample size calculation to actually see what would be the minimum sample requirement for the observed effect size.


Linear Model

The method we have seen above is only valid for one-way ANOVAs. For more complex model, which may simply be ANOVA with two treatments we should use the function specific for linear models.

Sample Size

To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test.
Using this function is slightly more complex because here we start reasoning in terms of degrees of freedom for the F ratio, which can be obtained using the following equation:

From: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO
where MS between is the mean square variance between groups and MS within is the mean square variance within each group.
These two terms have the following equations (again from: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO) :



The degrees of freedom we need to consider are the denominators of the last two equations. For an a priori power analysis we need to input the option u, with the degrees of freedom of the numerator of the F ratio, thus MS between. As you can see this can be computed as k-1, for a one-way ANOVA.
For more complex model we need to calculate the degrees of freedom ourselves. This is not difficult because we can generate dummy datasets in R with the specific treatment structure we require, so that R will compute the degrees of freedom for us.
We can generate dummy dataset very easily with the function expand.grid:

 > data = expand.grid(rep=1:3, FC1=c("A","B","C"), FC2=c("TR1","TR2"))  
 > data  
   rep FC1 FC2  
 1  1  A TR1  
 2  2  A TR1  
 3  3  A TR1  
 4  1  B TR1  
 5  2  B TR1  
 6  3  B TR1  
 7  1  C TR1  
 8  2  C TR1  
 9  3  C TR1  
 10  1  A TR2  
 11  2  A TR2  
 12  3  A TR2  
 13  1  B TR2  
 14  2  B TR2  
 15  3  B TR2  
 16  1  C TR2  
 17  2  C TR2  
 18  3  C TR2  

Working with expand.grid is very simple. We just need to specify the level for each treatment and the number of replicates (or blocks) and the function will generate a dataset with every combination.
Now we just need to add the dependent variable, which we can generate randomly from a normal distribution:

 data$Y = rnorm(nrow(data))  

Now our dataset is ready so we can fit a linear model to it and generate the ANOVA table:

 > mod.pilot = lm(Y ~ FC1*FC2, data=data)  
 > anova(mod.pilot)  
 Analysis of Variance Table  
 Response: Y  
      Df Sum Sq Mean Sq F value Pr(>F)  
 FC1    2 0.8627 0.4314 0.3586 0.7059  
 FC2    1 3.3515 3.3515 2.7859 0.1210  
 FC1:FC2  2 1.8915 0.9458 0.7862 0.4777  
 Residuals 12 14.4359 1.2030  

Since this is a dummy dataset all the sum of squares and the other values are meaningless. We are only interested in looking at the degrees of freedom.
To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test, as follows:

 pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)  

The first option in the function is u, which represents the degrees of freedom of the numerator of the F ratio. This is related to the degrees of freedom of the component we want to focus on. As you probably noticed from the model, we are trying to see if there is an interaction between two treatments. From the ANOVA table above we can see that the degrees of freedom of the interaction are equal to 2, so that it what we include as u.
Other options are again power and significance level, which we already discussed. Moreover, in this function the effect size is f2, which is again different from the f we've seen before. F2 again has its own table:

Since we assume we have no idea about the real effect size we use a medium value for the a priori testing.

The function returns the following table:

 > pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)  
    Multiple regression power calculation   
        u = 2  
        v = 38.68562  
        f2 = 0.25  
    sig.level = 0.05  
      power = 0.8  

As you can see what the function is actually providing us is the value of the degrees of freedom for the denominator of the F test (with v), which results in 38.68, so 39 since we always round it by excess.
If we look to the equation to compute MS withing we can see that the degrees of freedom is given by n-k, meaning that to transform the degrees of freedom into a sample size we need to add what we calculated before for the option u. The sample size is then equal to n = v + u + 1, so in this case the sample size is equal 39 + 2 + 1 = 42

This is not the number of samples per group but it is the total number of samples.


Another way of looking at the problem would be to compute the total power of our model, and not just how much power we have to discriminate between levels of one of the treatments (as we saw above). To do so we can still use the function pwr.f2.test, but with some differences. The first is that we need to compute u using all elements in the model, so basically sum the decrees of freedom of the ANOVA table, or sum all the coefficients in the model minus the intercept:

 u = length(coef(mod3))-1  

Another difference is in how we compute the effects size f2. Before we used its relation with partial eta square, now we can use its relation with the R2 of the model:


With these additional element we can compute the power of the model.


Power Calculation

Now we look at estimating the power for a model we've already fitted, which can be done with the same function.
We will work with one of the models we used in the post about Linear Models:

 mod3 = lm(yield ~ nf + bv, data=dat)   

Once again we first need to calculate the observed effect size as the eta squared, using again the sum of squares:

 > Anova(mod3, type="III")  
 Anova Table (Type III tests)  
 Response: yield  
       Sum Sq  Df F value  Pr(>F)    
 (Intercept) 747872  1 2877.809 < 2.2e-16 ***  
 nf      24111  5  18.555 < 2.2e-16 ***  
 bv     437177  1 1682.256 < 2.2e-16 ***  
 Residuals  892933 3436              
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  

In this example, I used the function Anova (with option type="III") in the package car just to remind you that if you have an unbalanced design, like in this case, you should use the type III sum of squares.
From this table we can obtain the sum of squares we need to compute the eta squared, for example for nf we will use the following code:

 > EtaSQ = 24111/(24111+892933)  
 > EtaSQ  
 [1] 0.02629209  

Then we need to transform this into f2 (of f squared), which is what the pwr.f2.test function uses:

 > f2 = EtaSQ / (1-EtaSQ)  
 > f2  
 [1] 0.02700203  

The only thing we need to do now is calculating the value of v, i.e. the denominator degrees of freedom. This is equal to the n (number of samples) - u - 1, but a quick way of obtaining this number is looking at the anova table above and take the degrees of freedom of the residuals, i.e. 3436.

Now we have everything we need to obtain the observed power:

 > pwr.f2.test(u = 5, v = 3436, f2 = f2, sig.level = 0.05)  
    Multiple regression power calculation   
        u = 5  
        v = 3436  
        f2 = 0.02700203  
    sig.level = 0.05  
      power = 1  

which again returns a very high power, since we have a lot of samples.



Generalized Linear Models

For GLM we need to install the package lmsupport:

 #install.packages("lmSupport")  
 library(lmSupport)  

Sample Size

For calculating the sample size for GLM we can use the same procedure we used for linear models.


Power Calculation

For this example we are going to use one of the model we discussed in the post about GLM, using the dataset beall.webworms (n = 1300):

 dat = beall.webworms  
 pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))   

Once again we would need to compute effect size and degrees of freedom. As before, we can use the function anova to generate the data we need:

 > anova(pois.mod2)  
 Analysis of Deviance Table  
 Model: poisson, link: log  
 Response: y  
 Terms added sequentially (first to last)  
       Df Deviance Resid. Df Resid. Dev  
 NULL            1299   1955.9  
 block   12 122.040   1287   1833.8  
 spray    1 188.707   1286   1645.2  
 lead    1  42.294   1285   1602.8  
 spray:lead 1  4.452   1284   1598.4  

Let's say we are interested in looking at the interaction between spray and lead, its degrees of freedom are 1, so this is our u. On its side we also have the residuals degrees of freedom, so v is 1284.
The other thing we need is the effect size, which we can compute with the function modelEffectSizes from the package lmSupport:

 > modelEffectSizes(pois.mod2)  
 glm(formula = y ~ block + spray * lead, family = c("poisson"),   
   data = dat)  
 Coefficients  
       SSR df pEta-sqr dR-sqr  
 block 122.0402 12  0.0709   NA  
 spray 142.3487 1  0.0818 0.0849  
 lead  43.7211 1  0.0266 0.0261  
 Sum of squared errors (SSE): 1598.4  
 Sum of squared total (SST): 1675.9  

This function calculates the partial eta squares, and it works also for lm models. As you can see it does not provide the eta squared for the interaction, but just to be on the safe side we can use the lowest value (0.03) from the values provided for spray and lead.
Now that we have the observed eta squared we can use the function modelPower:

> modelPower(u=1, v=1284, alpha=0.05, peta2=0.03)
Results from Power Analysis

pEta2 = 0.030
u =     1 
v =     1284.0 
alpha = 0.050 
power = 1.000 

This function can take the option f2, as we've seen before for the package pwr. However, since computing the partial eta square is generally easier, we can use the option peta2 and use directly this index.
Once again our power is very high.


Note 12/12/2017

Please note that the line above only works with the older version of the package lmSupport (version 2.9.8). The new version features a different syntax.
You can download the old version from here: https://cran.r-project.org/src/contrib/Archive/lmSupport/lmSupport_2.9.8.tar.gz


Linear Mixed Effects Models

For power analysis with mixed effects models we would need to install the following packages:

 #install.packages("simr")  
 library(simr)  

In this example we will be working with models fitted with the package lme4, but what is discussed here should work also with models fitted with nlme.

Sample Size

A priori power analysis for mixed effect model is not easy. There are packages that should provide functions to do that (e.g. simstudy and longpower), but they are probably more related to the medical sciences and I found them difficult to use. For this reason I decided that probably the easiest way to test the power of an experiment for which we need to use a mixed-effect model (e.g. involving clustering or repeated measures) would be to use a dummy dataset again and simulation. However, please be advised that I am not 100% sure of the validity of this procedure.

To create the dummy dataset we can use the same procedure we employed above, with expand.grid:

 data = expand.grid(subject=1:5, treatment=c("Tr1", "Tr2", "Tr3"))  
 data$Y = numeric(nrow(data))  

In this case we are simulating a simple experiment with 5 subjects, 3 treatments and a within subject design, like a crossover I suppose.
As you can see the Y has not been drawn from a normal distribution, this is because for the time being it is just a list of zeroes. We need to create data for each treatment as follows:

 data[data$treatment=="Tr1","Y"] = rnorm(nrow(data[data$treatment=="Tr1",]), mean=20, sd=1)  
 data[data$treatment=="Tr2","Y"] = rnorm(nrow(data[data$treatment=="Tr2",]), mean=20.5, sd=1)  
 data[data$treatment=="Tr3","Y"] = rnorm(nrow(data[data$treatment=="Tr3",]), mean=21, sd=1)  

In these lines I created three samples, from normal distributions, which means differ by half their standard deviation. This (when SD = 1)  provides an effect size (d) of 0.5, so medium.

Now we can create the model:

 mod1 = lmer(Y ~ treatment + (1|subject), data=data)  
 summary(mod1)  

and then test its power with the function powerSim from the package simr. This function runs 1000 simulation and provide a measure for the power of the experiment:

 > powerSim(mod1, alpha=0.05)  
 Power for predictor 'treatment', (95% confidence interval):  
    25.90% (23.21, 28.73)  
 Test: Likelihood ratio  
 Based on 1000 simulations, (84 warnings, 0 errors)  
 alpha = 0.05, nrow = 15  
 Time elapsed: 0 h 3 m 2 s  
 nb: result might be an observed power calculation  
 Warning message:  
 In observedPowerWarning(sim) :  
  This appears to be an "observed power" calculation  

From this output we can see that our power is very low, so we probably need to increase the number of subjects and then try again the simulation.

Let's now look at repeated measures. In this case we do not only have the effect size to account for in the data, but also the correlation between in time between measures.

 library(mvtnorm)  
   
 sigma <- matrix(c(1, 0.5, 0.5, 0.5,   
          0.5, 1, 0.5, 0.5,  
          0.5, 0.5, 1, 0.5,  
                0.5, 0.5, 0.5 ,1 ), ncol=4, byrow=T)  
   
   
 data = expand.grid(subject=1:4, treatment=c("Tr1", "Tr2", "Tr3"), time=c("t1","t2","t3","t4"))  
 data$Y = numeric(nrow(data))  
   
 T1 = rmvnorm(4, mean=rep(20, 4), sigma=sigma)  
 T2 = rmvnorm(4, mean=rep(20.5, 4), sigma=sigma)  
 T3 = rmvnorm(4, mean=rep(21, 4), sigma=sigma)  
   
 data[data$subject==1&data$treatment=="Tr1","Y"] = T1[,1]  
 data[data$subject==2&data$treatment=="Tr1","Y"] = T1[,2]  
 data[data$subject==3&data$treatment=="Tr1","Y"] = T1[,3]  
 data[data$subject==4&data$treatment=="Tr1","Y"] = T1[,4]  
   
 data[data$subject==1&data$treatment=="Tr2","Y"] = T2[,1]  
 data[data$subject==2&data$treatment=="Tr2","Y"] = T2[,2]  
 data[data$subject==3&data$treatment=="Tr2","Y"] = T2[,3]  
 data[data$subject==4&data$treatment=="Tr2","Y"] = T2[,4]  
   
 data[data$subject==1&data$treatment=="Tr3","Y"] = T3[,1]  
 data[data$subject==2&data$treatment=="Tr3","Y"] = T3[,2]  
 data[data$subject==3&data$treatment=="Tr3","Y"] = T3[,3]  
 data[data$subject==4&data$treatment=="Tr3","Y"] = T3[,4]  
   
   
 modRM = lmer(Y ~ treatment + (time|subject), data=data)  
 summary(modRM)  
   
 powerSim(modRM, alpha=0.05)  

In this case we need to use the function rmvnorm to draw, from a normal distribution, samples with a certain correlation. For this example I followed the approach suggested by William Huber here: https://stats.stackexchange.com/questions/24257/how-to-simulate-multivariate-outcomes-in-r/24271#24271

Basically, if we assume a correlation equal to 0.5 between time samples (which is what the software G*Power does for repeated measures), we first need to create a symmetrical matrix in sigma. This will allow rmvnorm to produce values from distributions with standard deviation equal to 1 and 0.5 correlation.
A more elegant approach is the one suggested by Ben Amsel on his blog: https://cognitivedatascientist.com/2015/12/14/power-simulation-in-r-the-repeated-measures-anova-5/

 sigma = 1 # population standard deviation  
 rho = 0.5 #Correlation between repeated measurs  
 # create k x k matrix populated with sigma  
 sigma.mat <- rep(sigma, 4)  
 S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))  
 # compute covariance between measures  
 Sigma <- t(S) * S * rho   
 # put the variances on the diagonal   
 diag(Sigma) <- sigma^2   

The result is the same but at least here you can specify different values for SD and correlation.

The other elementS the function needs are mean values, for which I used the same as before. This should guarantee a difference of around half a standard deviation between treatments.
The remaining of the procedure is the same we used before with no changes.

As I said before, I am not sure if this is the correct way of computing power for linear mixed effects models. It may be completely or partially wrong, and if you know how to do this or you have comments please do not hesitate to write them below.


Power Analysis

As we have seen with the a priori analysis, computing the power of mixed effects models is actually very easy with the function powerSim.




References

PWR package Vignette: https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html

William E. Berndtson (1991). "A simple, rapid and reliable method for selecting or assessing the number of replicates for animal experiments"
http://scholars.unh.edu/cgi/viewcontent.cgi?article=1312&context=nhaes

NOTE:
This paper is what some of my colleagues, who deal particularly with animal experiments, use to calculate how many subjects to use for their experiments. The method presented here is base on the coefficient of variation (CV%), which is something that also in agriculture is often used to estimate the number of replicates needed.




Berkowitz J. "Sample Size Estimation" - http://www.columbia.edu/~mvp19/RMC/M6/M6.doc

This document gives you some rule of thumb to determine the sample size for a number of experiments.


Update 26/07/2017

For computing effect size automatically you also have the option to use the package sjstats. This has function to compute eta-squared, partial eta-squared and others, but it also has an option to print a comprehensive ANOVA table with everything you get from a normal call to anova plus the effects sizes.
You can find some example on this blog post from the author of the package Daniel Lüdecke here:



Final Note about the use of CV% 

AS I mentioned above, CV% and the percentage of difference between means is one of the indexes used to estimate the number of replicates needed to run experiments. For this reason I decided to create some code to test whether power analysis and the method based on CV% provide similar results.
Below is the function I created for this:

 d.ES = function(n, M, SD, DV){  
 M1=M  
 M2=M+(SD/DV)  
 PC.DIFF = (abs(M1-M2)/((M1+M2)/2))*100  
 numerator = (mean(M2)-mean(M1))  
 denominator = sqrt((((n-1)*SD^2)+((n-1)*SD^2))/(n+n-2))  
 ES=numerator/denominator  
 samp = sapply(ES, function(x){pwr.anova.test(k=2, f=x/2, sig.level=0.05, power=0.8)$n})  
 CV1=SD/M1  
 return(list(EffectSize=ES, PercentageDifference=PC.DIFF, CV.Control=CV1*100, n.samples=samp))  
 }  

This function takes 4 arguments: number of samples (n), mean of control (M), standard deviation (here we assume the standard deviation to be identical between groups), and DV, which indicates the number of times to divide the standard deviation to compute the mean of the treatment. If DV is equal to 2 then the mean of the treatment will be half the mean of control.

The equation for the percentage of difference was taken from: https://www.calculatorsoup.com/calculators/algebra/percent-difference-calculator.php

Now we can use this function to estimate Effect Size, percentage of differences in means, CV% and number of samples from power analysis (assuming an ANOVA with 2 groups).

The first example looks at changing the standard deviation, and keeping everything else constant:

 > d.ES(n=10, M=20, SD=seq(1, 15, by=1), DV=8)  
 $EffectSize  
  [1] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000 0.16666667  
  [7] 0.14285714 0.12500000 0.11111111 0.10000000 0.09090909 0.08333333  
 [13] 0.07692308 0.07142857 0.06666667  
 $PercentageDifference  
  [1] 0.623053 1.242236 1.857585 2.469136 3.076923 3.680982 4.281346 4.878049  
  [9] 5.471125 6.060606 6.646526 7.228916 7.807808 8.383234 8.955224  
 $CV.Control  
  [1] 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75  
 $n.samples  
  [1]  10.54166  39.15340  86.88807 153.72338 239.65639 344.68632  
  [7] 468.81295 612.03614 774.35589 955.77211 1156.28483 1375.89403  
 [13] 1614.59972 1872.40183 2149.30043  

If you look at the tables presented in the paper by Berndtson you will see that the results are similar in terms of number of samples.

Larger differences are seen when we look at changes in mean, while everything else stays constant:

 > d.ES(n=10, M=seq(1,25,1), SD=5, DV=8)  
 $EffectSize  
 [1] 0.125  
 $PercentageDifference  
  [1] 47.619048 27.027027 18.867925 14.492754 11.764706 9.900990 8.547009  
  [8] 7.518797 6.711409 6.060606 5.524862 5.076142 4.694836 4.366812  
 [15] 4.081633 3.831418 3.610108 3.412969 3.236246 3.076923 2.932551  
 [22] 2.801120 2.680965 2.570694 2.469136  
 $CV.Control  
  [1] 500.00000 250.00000 166.66667 125.00000 100.00000 83.33333 71.42857  
  [8] 62.50000 55.55556 50.00000 45.45455 41.66667 38.46154 35.71429  
 [15] 33.33333 31.25000 29.41176 27.77778 26.31579 25.00000 23.80952  
 [22] 22.72727 21.73913 20.83333 20.00000  
 $n.samples  
 [1] 1005.615  

In this case the mean of the treatment is again 1/8 of the mean of the control, and the standard deviation is fixed at 5. Since the difference in means is the same, and the standard deviation is constant, the effect size also stays constant at 0.125, so very small.
However, both percentage of difference and CV% change quite a bit and therefore the estimates from Berndtson could differ.



Monday, 10 July 2017

Generalized Linear Models and Mixed-Effects in Agriculture

After publishing my previous post, I realized that it was way too long and so I decided to split it in 2-3 parts. If you think something is missing in the explanation here it may be related to the fact that this was originally part of the previous post (http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html), so please look there first (otherwise please post your question in the comment section and I will try to answer).


Dealing with non-normal data – Generalized Linear Models

As you remember, when we first introduced the simple linear model (Linear Models in Agriculture) we defined a set of assumptions that need to be met to apply this model. In the same post, we talked about methods to deal with deviations from assumptions of independence, equality of variances and balanced designs and the fact that, particularly if our dataset is large, we may reach robust results even if our data are not perfectly normal. However, there are datasets for which the target variable has a completely different distribution from the normal, this means that also the error terms will not be normally distributed.
In these cases we need to change our modelling method and employ generalized linear models (GLM). Common scenarios where GLM should be considered are studies where the variable of interest is binary, for example presence or absence of a species, or where we are interested in modelling counts, for example the number of insects present in a particular location. In these cases, where the target variable is not continuous but rather discrete or categorical, the assumption of normality is usually not met. In this section we will focus on the two scenarios mentioned above, but GLM can be used to deal with data distributed in many different ways, and we will introduce how to deal with more general cases.


Count Data

Data of this type, i.e. counts or rates, are characterized by the fact that their lower bound is always zero. This does not fit well with a normal linear model, where the regression line may well estimate negative values. For this type of variable we can employ a Poisson Regression, which fits the following model:

As you can see the equation is very similar to the standard linear model, the difference is that to insure that all Y are positive (since we cannot have negative values for count data) we are estimating the log of Y. This is referred to as the link function, meaning the transformation of y that we need to make to insure linearity of the response. For the model we are going to look at below, which are probably the most common, the link function is implicitly called, meaning that glm call the right function for us and we do not have to specify it explicitly. However, we can do that if needed.

From this equation you may think that instead of using glm we could log transform y and run a normal lm. The problem with that would be that lm is assuming a normal error term with constant variance (as we saw with the plot fitted versus residuals), but for this model this assumption would be violated. That is why we need to use glm.

In R fitting this model is very easy. For this example we are going to use another dataset available in the package agridat, named beall.webworms, which represents counts of webworms in a beet field, with insecticide treatments:

 > dat = beall.webworms  
 > str(dat)  
 'data.frame':  1300 obs. of 7 variables:  
  $ row : int 1 2 3 4 5 6 7 8 9 10 ...  
  $ col : int 1 1 1 1 1 1 1 1 1 1 ...  
  $ y  : int 1 0 1 3 6 0 2 2 1 3 ...  
  $ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ...  
  $ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...  
  $ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...  
  $ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...  
   

We can check the distribution of our data with the function hist:

 hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms")  


We are going to fit a simple model first to see how to interpret its results, and then compare it with a more complex model:

 pois.mod = glm(y ~ trt, data=dat, family=c("poisson"))  

As you can see the model features a new option called family, here you specify the distribution of the error term, in this case a poisson distribution. We should also specify a log link function as we saw before, but this is the default setting so there is no need to include it.

Once again the function summary will show some useful details about this model:

 > summary(pois.mod)  
   
 Call:  
 glm(formula = y ~ trt, family = c("poisson"), data = dat)  
   
 Deviance Residuals:   
   Min    1Q  Median    3Q   Max   
 -1.6733 -1.0046 -0.9081  0.6141  4.2771   
   
 Coefficients:  
       Estimate Std. Error z value Pr(>|z|)    
 (Intercept) 0.33647  0.04688  7.177 7.12e-13 ***  
 trtT2    -1.02043  0.09108 -11.204 < 2e-16 ***  
 trtT3    -0.49628  0.07621 -6.512 7.41e-11 ***  
 trtT4    -1.22246  0.09829 -12.438 < 2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
   
 (Dispersion parameter for poisson family taken to be 1)  
   
   Null deviance: 1955.9 on 1299 degrees of freedom  
 Residual deviance: 1720.4 on 1296 degrees of freedom  
 AIC: 3125.5  
   
 Number of Fisher Scoring iterations: 6  
   


Update 08/12/2017

Note on interpretation

To interpret the coefficients of the model we need to remember that this GLM uses a log link function. Therefore for example the -1.02 is log transformed, so the coefficient for T2 would be exp(-1.02)=0.36.
In terms of interpretation, we can say that the number of worms for T2 is 0.36 times the number of worms for T1 (this is because the coefficient is always referred to the reference level). So there is a decrease, and that is why the coefficient is negative. 




The first valuable information is related to the residuals of the model, which should be symmetrical as for any normal linear model. From this output we can see that minimum and maximum, as well as the first and third quartiles, are similar, so this assumption is confirmed. Then we can see that the variable trt (i.e. treatment factor) is highly significant for the model, with very low p-values. The statistical test in this case is not a t-test, as in the output of the function lm, but a Wald Test (Wald Test). This test computes the probability that the coefficient is 0, if the p is significant it means the chances the coefficient is zero are very low so the variable should be included in the model since it has an effect on y.
Another important information is the deviance, particularly the residual deviance. As a general rule, this value should be lower or in line than the residuals degrees of freedom for the model to be good. In this case the fact that the residual deviance is high (even though not dramatically) may suggests the explanatory power of the model is low. We will see below how to obtain p-value for the significance of the model.

We can use the function plot to obtain more info about how the model fits our data:

 par(mfrow=c(2,2))  
 plot(pois.mod)  

This creates the following plot, where the four outputs are included in the same image:


This plots tell a lot about the goodness of fit of the model. The first image in top left corner is the same we created for lm (i.e. residuals versus fitted values). This again does not show any trend, just a general underestimation. Then we have the normal QQ plot, where we see that the residuals are not normal, which violates one of the assumptions of the model. Even though we are talking about non-linear error term, we are "linearizing" the model with the link function and by specifying a different family for the error term. Therefore, we still need to obtain normal residuals.

The effects of the treatments are all negative and referred to the first level T1, meaning for example that a change from T1 to T2 will decrease the count by 1.02. We can check this effect by estimating changes between T1 and T2 with the function predict, and the option newdata:

 > predict(pois.mod, newdata=data.frame(trt=c("T1","T2")))  
      1     2   
  0.3364722 -0.6839588  
   

Another important piece of information are the Null and Residuals deviances, which allow us to compute the probability that this model is better than the Null hypothesis, which states that a constant (with no variables) model would be better.
We can compute the p-value of the model with the following line:

 > 1-pchisq(deviance(pois.mod), df.residual(pois.mod))  
 [1] 1.709743e-14  
   

This p-value is very low, meaning that this model fits the data well. However, it may not be the best possible model, and we can use the AIC parameter to compare it to other models. For example, we could include more variables:

 pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))  

How does this new model compare with the previous?

 > AIC(pois.mod, pois.mod2)  
      df   AIC  
 pois.mod  4 3125.478  
 pois.mod2 16 3027.438  
   

As you can see the second model has a lower AIC, meaning that fits the data better than the first.

One of the assumptions of the Poisson distribution is that its mean and variance have the same value. We can check by simply comparing mean and variance of our data:

 > mean(dat$y)  
 [1] 0.7923077  
 > var(dat$y)  
 [1] 1.290164  
   

In cases such as this when the variance is larger than the mean (in this case we talk about overdispersed count data) we should employ different methods, for example a quasipoisson distribution:

 pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson"))  

The summary function provides us with the dispersion parameter, which for a Poisson distribution should be 1:

 > summary(pois.mod3)  
   
 Call:  
 glm(formula = y ~ trt, family = c("quasipoisson"), data = dat)  
   
 Deviance Residuals:   
   Min    1Q  Median    3Q   Max   
 -1.6733 -1.0046 -0.9081  0.6141  4.2771   
   
 Coefficients:  
       Estimate Std. Error t value Pr(>|t|)    
 (Intercept) 0.33647  0.05457  6.166 9.32e-10 ***  
 trtT2    -1.02043  0.10601 -9.626 < 2e-16 ***  
 trtT3    -0.49628  0.08870 -5.595 2.69e-08 ***  
 trtT4    -1.22246  0.11440 -10.686 < 2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
   
 (Dispersion parameter for quasipoisson family taken to be 1.35472)  
   
   Null deviance: 1955.9 on 1299 degrees of freedom  
 Residual deviance: 1720.4 on 1296 degrees of freedom  
 AIC: NA  
   
 Number of Fisher Scoring iterations: 6  
   

Since the dispersion parameter is 1.35, we can conclude that our data are not terrible overdispersed, so maybe a Poisson regression would still be appropriate for this dataset.


Update 28/07/2017 - Overdispersion Test

In the package AER there is a function to directly test for overdispersion. The procedure to do so is quite simple:

First of all we install the package:

 install.packages("AER")  
 library(AER)  

The we run the following line, which test whether the dispersion parameter is higher than 1, which is the assumption for the Poisson regression:

 > dispersiontest(pois.mod, alternative="greater")  
     Overdispersion test  
 data: pois.mod  
 z = 6.0532, p-value = 7.101e-10  
 alternative hypothesis: true dispersion is greater than 1  
 sample estimates:  
 dispersion   
  1.350551   

As you can see the alternative hypothesis is that the dispersion parameter is higher than 1. Since the p-value is very low we can accept this alternative hypothesis and therefore use other forms of modelling. Since the dispersion is still close to 1, I think we can use the quasipoisson family. However, if this was higher it would have been better to use the negative binomial family with the function glm.nb in the package MASS, see below.



Another way of directly comparing the two models is with the analysis of deviance, which can be performed with the function anova:

 > anova(pois.mod, pois.mod2, test="Chisq")  
 Analysis of Deviance Table  
 Model 1: y ~ trt  
 Model 2: y ~ block + spray * lead  
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
 1   1296   1720.4               
 2   1284   1598.4 12  122.04 < 2.2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  

This test compares the residual deviance of the two models to see whether they are different and calculates a p-values. In this case the p-value is highly significant, meaning that the models are different. Since we already compared the AIC, we can conclude that pois.mod2 is significantly (low p-value) better (lower AIC) than pois.mod.

However, there are cases where the data are very overdispersed. In those cases, when we see that the distribution has lots of peaks we need to employ the negative binomial regression, with the function glm.nb available in the package MASS:

 library(MASS)  
   
 NB.mod1 = glm.nb(y ~ trt, data=dat)  


NOTE:
For GLM it is possible to also compute pseudo R-Squared to ease the interpretation of their accuracy. This can be done with the function pR2 from the package pscl. Please read below (Logistic Regression section) for an example on the use of this function.


Logistic Regression

Another popular form of regression that can be tackled with GLM is the logistic regression, where the variable of interest is binary (0 or 1, presence or absence and any other binary outcome). In this case the regression model takes the following equation:

Again, the equation is identical to the standard linear model, but what we are computing from this model is the log of the probability that one of the two outcomes will occur, also referred as logit function.

Update 07/02/2018

By reviewing some literature I realized that the term logistic regression can be confusing. Sometimes it is used indistinctly to indicate both model with a binary outcome, and models that involve proportions. However, this is probably not totally correct because binary outcome follow a Bernoulli distribution, and in such cases we should probably talk about Bernoulli regression. In R there is no distinction between the two, and both models can be fitted with the option family="binomial", but in other software there is, e.g. Genstat. In Genstat to run a regression with binary outcome you would select a Bernoulli distribution, while for proportions you would select a Binamial distribution. 

In regards to the link function, even though the logit is probably the most commonly used, some authors also employ the probit transformation. According to Geyer (2003, http://www.stat.umn.edu/geyer/5931/mle/glm.pdf) but are legitimate transformations and they should not differ much in terms of fitting. The regression coefficient would be different when using the probit compared to the logit. However, if we use the function predict (as suggested here) and we do not rely on the coefficient, we should come up with very similar solutions.


For this example we are going to use another dataset available in the package agridat called johnson.blight, where the binary variable of interest is the presence or absence of blight (either 0 or 1) in potatoes:

 > dat = johnson.blight  
 > str(dat)  
 'data.frame':  25 obs. of 6 variables:  
  $ year  : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...  
  $ area  : int 0 0 0 0 50 810 120 40 0 0 ...  
  $ blight : int 0 0 0 0 1 1 1 1 0 0 ...  
  $ rain.am : int 8 9 9 6 16 10 12 10 11 8 ...  
  $ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ...  
  $ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...  
   

In R fitting this model is very easy. In this case we are trying to see if the presence of blight is related to the number of rainy days in April and May (column rain.am):

 mod9 = glm(blight ~ rain.am, data=dat, family=binomial)  

we are now using the binomial distribution for a logistic regression. To check the model we can rely again on summary:

 > summary(mod9)  
   
 Call:  
 glm(formula = blight ~ rain.am, family = binomial, data = dat)  
   
 Deviance Residuals:   
   Min    1Q  Median    3Q   Max   
 -1.9395 -0.6605 -0.3517  1.0228  1.6048   
   
 Coefficients:  
       Estimate Std. Error z value Pr(>|z|)   
 (Intercept) -4.9854   2.0720 -2.406  0.0161 *  
 rain.am    0.4467   0.1860  2.402  0.0163 *  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
   
 (Dispersion parameter for binomial family taken to be 1)  
   
   Null deviance: 34.617 on 24 degrees of freedom  
 Residual deviance: 24.782 on 23 degrees of freedom  
 AIC: 28.782  
   
 Number of Fisher Scoring iterations: 5  
   

This table is very similar to the one created for count data, so a lot of the discussion above can be used here. The main difference is in the way we can interpret the coefficients, because we need to remember that here we are calculating the logit function of the probability, so 0.4467 (coefficient for rain.am) is not the actual probability associated with an increase in rain. However, what we can say by just looking at the coefficients is that rain has a positive effect on blight, meaning that more rain increases the chances of finding blight in potatoes. 

To estimate probabilities we need to use the function predict (we could do it manually: Logistic Regression but this is easier):

 > predict(mod9, type="response")  
      1     2     3     4     5     6     7   
 0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722   
      8     9     10     11     12     13     14   
 0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141   
     15     16     17     18     19     20     21   
 0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722   
     22     23     24     25   
 0.48214935 0.59273722 0.98109229 0.89680283  
   

This calculates the probability associated with the values of rain in the dataset. To know the probability associated with new values of rain we can again use predict with the option newdata:

 >predict(mod9,newdata=data.frame(rain.am=c(15)),type="response")   
     1   
 0.8475443  
   

This tells us that when rain is equal to 15 days between April and May, we have 84% chances of finding blight (i.e. chances of finding 1) in potatoes.

We could use the same method to compute probabilities for a series of values of rain to see what is the threshold of rain that increases the probability of blight above 50%:

 prob.NEW = predict(mod9,newdata=data.frame(rain.am=1:30),type="response")   
 plot(1:30, prob.NEW, xlab="Rain", ylab="Probability of Blight")  
 abline(h=0.5)  

As you can see we are using once again the function predict, but in this case we are estimating the probabilities for increasing values of rain. Then we are plotting the results:

From this plot it is clear that we reach a 50% probability at around 12 rainy days between April and May.

Update 26/07/2017

Another simpler way to create the plot above would be with the function plotPredy, in the package rcompanion:

 install.packages("rcompanion")  
 library(rcompanion)  
   
 plotPredy(data = dat,  
      y   = blight,  
      x   = rain.am,  
      model = mod9,  
      type = "response",  
      xlab = "Rain",  
      ylab = "Blight")  

which creates the following plot:




To assess the accuracy of the model we can use two approaches, the first is based on the deviances listed in the summary. The Residual deviance compares this model with the one that fits the data perfectly. So if we calculate the following p-value (using the deviance and df in the summary table for residuals):

 > 1-pchisq(24.782, 23)  
 [1] 0.3616226  
   

We see that because it is higher than 0.05 we cannot reject that this model fits the data as well as the perfect model, therefore our model seems to be good.

We can repeat the same procedure for the Null hypothesis, which again tests whether this model fits the data well:

 > 1-pchisq(34.617, 24)  
 [1] 0.07428544  
   

Since this is again not significant it suggests (contrary to what we obtained before) that this model is not very good.

An additional and probably easier to understand way to assess the accuracy of a logistic model is calculating the pseudo R2, which can be done by installing the package pscl:

 install.packages("pscl")  
 library(pscl)  
   

Now we can run the following function:

 > pR2(mod9)  
     llh   llhNull     G2  McFadden    r2ML    r2CU   
   -12.3910108 -17.3086742  9.8353268  0.2841155  0.3252500  0.4338984  
   

From this we can see that our model explains around 30-40% of the variation in blight, which is not particularly good. We can use this index to compare models, as we did for AIC.
Each of these R squared is computed in a different way and you can read the documentation to know more. In general, one of the most commonly reported is the McFadden, which however tends to be conservative, and the r2ML. In this paper you can find a complete overview/comparison of various pseudo R-Squared: http://www.glmj.org/archives/articles/Smith_v39n2.pdf


Update 26/07/2017

In the package rcompanion there is the function nagelkerke, which computes other pseudo R squared, like McFadden, Cox and Snell and Nagelkerke:

 > nagelkerke(mod9)  
 $Models  
                          
 Model: "glm, blight ~ rain.am, binomial, dat"  
 Null: "glm, blight ~ 1, binomial, dat"     
   
 $Pseudo.R.squared.for.model.vs.null  
                Pseudo.R.squared  
 McFadden               0.284116  
 Cox and Snell (ML)          0.325250  
 Nagelkerke (Cragg and Uhler)     0.433898  
   
 $Likelihood.ratio.test  
  Df.diff LogLik.diff Chisq  p.value  
    -1   -4.9177 9.8353 0.0017119  
   


Update 26/07/2016 - Proportions

Proportions are generally in the range between 0 and 1, and so they may fit in the binomial family. In fact, one way of modelling proportion is to use the same glm code we saw for the logistic regression.
Let's look at one example:

 > data = crowder.seeds  
 > str(data)  
 'data.frame':  21 obs. of 5 variables:  
  $ plate : Factor w/ 21 levels "P1","P10","P11",..: 1 12 15 16 17 18 19 20 21 2 ...  
  $ gen  : Factor w/ 2 levels "O73","O75": 2 2 2 2 2 1 1 1 1 1 ...  
  $ extract: Factor w/ 2 levels "bean","cucumber": 1 1 1 1 1 1 1 1 1 1 ...  
  $ germ  : int 10 23 23 26 17 8 10 8 23 0 ...  
  $ n   : int 39 62 81 51 39 16 30 28 45 4 ...  

We can load the dataset crowder.seeds from agridat. In here the variable germ is the number of seeds that germinated, while n is the total number of seeds. Thus, we can obtain the proportion of seeds that germinated and we can try to model it.
The syntax to do so is the following:

 > mod1 = glm(cbind(germ, n) ~ gen + extract, data=data, family="binomial")  
 > summary(mod1)  
 Call:  
 glm(formula = cbind(germ, n) ~ gen + extract, family = "binomial",   
   data = data)  
 Deviance Residuals:   
   Min    1Q  Median    3Q   Max   
 -1.5431 -0.5006 -0.1852  0.3968  1.4796   
 Coefficients:  
         Estimate Std. Error z value Pr(>|z|)    
 (Intercept)   -1.0594   0.1326 -7.989 1.37e-15 ***  
 genO75      0.1128   0.1311  0.860   0.39    
 extractcucumber  0.5232   0.1233  4.242 2.22e-05 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 (Dispersion parameter for binomial family taken to be 1)  
   Null deviance: 33.870 on 20 degrees of freedom  
 Residual deviance: 14.678 on 18 degrees of freedom  
 AIC: 104.65  
 Number of Fisher Scoring iterations: 4  

Notice the use of cbind within the formula to calculate proportions.

Another technique you could use when dealing with proportions is the beta-regression. This can be fitted with the function betareg, in the package betareg. You can find more info at the following links:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.8394&rep=rep1&type=pdf

A good tutorial on beta regression written by Salvatore S. Mangiafico can be found here:
http://rcompanion.org/handbook/J_02.html

The sample code to perform a beta regression on these data is the following:

 #Beta-Regression  
 y.transf.betareg <- function(y){  
   n.obs <- sum(!is.na(y))  
   (y * (n.obs - 1) + 0.5) / n.obs  
 }  
 y.transf.betareg(data$germ/data$n)  
 mod2 = betareg(y.transf.betareg(data$germ/data$n) ~ gen + extract, data=data)  
 summary(mod2)  

I had to transform the proportion first using a correction suggested here https://stackoverflow.com/questions/26385617/proportion-modeling-betareg-errors
This is because one sample had value 0, and betareg does not work with either values 0 or 1.

Dealing with other distributions and transformation

As mentioned, GLM can be used for fitting linear models not only in the two scenarios we described above, but in any occasion where data do not comply with the normality assumption. For example, we can look at another dataset available in agridat, where the variable of interest is slightly non-normal:

 > dat = hughes.grapes  
 > str(dat)  
 'data.frame':  270 obs. of 6 variables:  
  $ block  : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ...  
  $ trt   : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ...  
  $ vine  : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ...  
  $ shoot  : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ...  
  $ diseased: int 1 2 0 0 3 0 7 0 1 0 ...  
  $ total  : int 14 12 12 13 8 9 8 10 14 10 ...  
   

The variable total presents a skewness of 0.73, which means that probably with a transformation it should fit with a normal distribution. However, for the sake of the discussion we will assume it cannot be transformed. So now our problem is identify the best distribution for our data, to do so we can use the function descdist in the package fitdistrplus we already loaded:

 descdist(dat$total, discrete = FALSE)  

this returns the following plot:


Where we can see that our data (blue dot) are close to normal and maybe closer to a gamma distribution. So now we can further check this using another function from the same package:

 plot(fitdist(dat$total,distr="gamma"))  

which creates the following plot:


From this we can see that in fact our data seem to be close to a gamma distribution, so now we can proceed with modelling:

 mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity))  

in the option family we included the name of the distribution, plus a link function that is used if we want to transform our data (in this case the function identity is for leaving data not transformed).

This is what we do to model other types of data that do not fit with a normal distribution. Other possible families supported by GLM are:

binomial, gaussian, Gamma, inverse.gaussian, poisson, quasi, quasibinomial, quasipoisson

Other possible link functions (which availability depends on the family) are:

logit, probit, cauchit, cloglog, identity, log, sqrt, 1/mu^2, inverse.



Generalized Linear Mixed Effects models

As linear model, linear mixed effects model need to comply with normality. If our data deviates too much we need to apply the generalized form, which is available in the package lme4:

 install.packages("lme4")  
 library(lme4)  

For this example we will use again the dataset johnson.blight:

 dat = johnson.blight  

Now we can fit a GLME model with random effects for area, and compare it with a model with only fixed effects:

 mod10 = glm(blight ~ precip.m, data=dat, family="binomial")  
   
 mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial")   
   
 > AIC(mod10, mod11)  
    df    AIC  
 mod10 2 37.698821  
 mod11 3 9.287692  
   

As you can see this new model reduces the AIC substantially.

The same function can be used for Poisson regression, but it does not work for quasipoisson overdispersed data. However, within lme4 there is the function glmer.nb for negative binomial mixed effect. The syntax is the same as glmer, except that in glmer.nb we do not need to include family.