Monday, June 11, 2012

Random regression coefficients using lme4

What's the gain over lm()?
By Ben Ogorek

Random effects models have always intrigued me. They offer the flexibility of many parameters under a single unified, cohesive and parsimonious system. But with the growing size of data sets and increased ability to estimate many parameters with a high level of accuracy, will the subtleties of the random effects analysis be lost?

In this article, we will look an example that could be analyzed with either a traditional regression approach, using lm(), or a more sophisticated approach using random effects via the lme4 package by Douglas Bates, Martin Maechler and Ben Bolker (2011). And then I'll pose the question, "is it worth it?"

Random effects for the uninitiated

For the uninitiated in random effects models, suppose we have the linear model

yj =  βxj+ εj


for j = 1,...,J, where εj is iid gaussian noise. But also suppose that this pattern repeats itself for some set of units i = 1,...,n. If the units are different, then maybe their slope and intercepts are different, and we'd have to rewrite the equation as

yij = αi + βixij + εij


where again the εijs are iid gaussian variables.
But if every unit has its own α and β, and each unit is part of some stable population, shouldn't there be a common thread underlying the parameters? And if you know something about α1 and α2, maybe you should also know something about α3 - not because of dependence - but because α1 and α2 give you an idea of what is typical of αs among units. Expand this and you'll get an idea of what is atypical as well. 


Placing a joint probability distribution over these parameters is a more formal way to accomplish this. If you have a probability distribution (for better or for worse), then you're "random." Welcome to the world of random coefficients.

Setting up

Let's experiment. First, start with a clean R session. Below we set up the shell of our future data set. The code below produces 30 primary units and 15 measurements within each primary unit - a nice rectangular structure.


rm(list = ls())
set.seed(5432)
 
J <- 15
N <- 30
 
test.df <- data.frame( unit = sort(rep(c(1:N),J)), 
                          J = rep(c(1:J),N) , x = rnorm(n = J*N) )

Next, we'll generate data from our above model, where βi ~ N(3, .22), αi = 1 for all i, and εij ~ N(0, .752). 

beta <- 3 + .2*rnorm(N)
test.df$beta <- beta[test.df$unit]
test.df$y <- 1 + test.df$x * test.df$beta + .75*rnorm(n = J*N)
head(test.df, 18)

 The last line of code allows us to view the first 18 records.

> head(test.df, 18)
   unit  J           x     beta           y
1     1  1 -0.11300084 3.306203  0.26360591
2     1  2  1.28001884 3.306203  4.37991072
3     1  3 -0.99839530 3.306203 -2.41951563
4     1  4  0.92637466 3.306203  3.66430487
5     1  5  0.81537679 3.306203  3.10488869
6     1  6  1.60623340 3.306203  6.89280816
7     1  7 -0.12976856 3.306203  1.20187488
8     1  8  0.18872381 3.306203  2.23705005
9     1  9  0.22488639 3.306203  1.80095766
10    1 10 -0.23402213 3.306203  0.68786438
11    1 11  0.68923084 3.306203  3.10347646
12    1 12  0.05152819 3.306203  2.14595637
13    1 13  0.86556045 3.306203  4.74330873
14    1 14 -0.46953578 3.306203 -0.83396291
15    1 15 -0.45621596 3.306203 -0.33406847
16    2  1 -1.17441143 2.936359 -3.48678813
17    2  2 -0.40591147 2.936359 -0.03560251
18    2  3  0.69477215 2.936359  2.59961931

Notice how βi changes only as the unit id changes, whereas y and x vary at the row level.

Separate regressions

Suppose we run seperate regressions for each unit using a loop of lm() calls:

beta.hat <- list()
for(i in 1:N){
  unit.lm <- lm(y ~ x, data = subset(test.df, unit == i) )
  beta.hat[i] <- coef(unit.lm)[2]
}
beta.hat <- as.numeric(beta.hat)

The cost of having to estimate the βis can be seen by the histograms below, whereas the estimated βis exhibit noticeably more variability.

par(mfrow = c(2, 1))
hist(beta, main = "Histogram of Actual Slopes", col = "blue",
     xlab = expression(beta[i]), cex.axis = 1.5, cex.lab = 1.5,
     breaks = seq(from = 2.4, to = 3.6, by = .1) )
hist(as.numeric(beta.hat), main = "Histogram of Estimated Slopes",
     xlab = expression(hat(beta)[i]), col = "blue", cex.axis = 1.5,
     cex.lab = 1.5, breaks = seq(from = 2.4, to = 3.6, by = .1) )


Random effects modeling using lme4

if you haven't already, install the lme4 package using the command 


install.packages("lme4").


The flagship function of the lme4 package is the lmer() function, a likelihood based system for estimating random effects models. Its formula notation works like lm()'s for fixed effects, but if you try to run a basic lm() model in it, you'll get an error message -  lmer() needs random effects! When you have fixed effects, you do enter them as in lm(). For random effects, the form is

 (formula for random terms | unit for which these terms apply).

Starting on the left side of the bar, the formula for a random intercept, by itself, is simply "1". The formula for a random regression coeficient for a variable x, without the corresponding random intercept, is "0 + x".  Random intercepts are included by default, so "x" and "1 + x" are equivalent specifications of both a random slope and a random intercept.

Random effects must vary at a coarser grain than at the finest level, or else they'd be confounded (with εij in our case). The second part of the random formula specification requires a variable that identifies this coarser grain, which in our case is the unit identifier.

Below we apply lmer() to our generated data. For demonstration purposes, both a random intercept and slope are specified, even though we know the intercept is fixed. And as a general point about lmer(), we'll need to include the mean of our random βis as fixed effect. This is accomplished by adding x as a fixed regressor alongside its specification as a random effect (think random deviations from a fixed mean).

library(lme4)
re.lm <- lmer(y ~ x + (1+x|unit), data = test.df) 
summary(re.lm)

Note that lme4 is an S4 class, so running names(re.lm) to explore the structure won't be of any help. The summary function, however, gets us the overview we need.

> summary(re.lm)
Linear mixed model fit by REML 
Formula: y ~ x + (1 + x | unit) 
   Data: test.df 
  AIC  BIC logLik deviance REMLdev
 1034 1059 -511.1     1013    1022
Random effects:
 Groups   Name        Variance  Std.Dev. Corr   
 unit     (Intercept) 0.0040782 0.063861        
          x           0.0134348 0.115909 -1.000 
 Residual             0.5427579 0.736721        
Number of obs: 450, groups: unit, 30

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.02645    0.03704   27.71
x            3.05527    0.04059   75.28

Correlation of Fixed Effects:
  (Intr)
x -0.099

The colors are to link the estimates to their true values based on  βi ~ N(3, .22) and 
εij ~ N(0, .752). Note that the ε error variance has an estimate much closer to its true value than the other two parameters. This shouldn't come as a surprise; there are 15 times more opportunities (450 vs. 30) to observe variation in the random variable εij than in βi.

At 1.03, the true common intercept of 1.00 is recovered without much error, but what can  lmer() say about whether the intercept is random? Since a random variable with zero variance is for all practical purposes a fixed constant, we want to test the hypothesis:

H0: σα2 = 0

versus the alternative

H1σα 2 > 0.

We can see that the estimated variance for the random intercept, at 0.004, is much less than that of the random slope, at 0.014. Due to the importance of the zero-variance hypothesis, I would have liked to see it included as part of the default output. But with a little extra work, we can search for evidence of positive variance. It starts by rerunning lmer() without the random intercept.

re.lm <- lmer(y ~ x + (0+x|unit), data = test.df)
summary(re.lm)

Linear mixed model fit by REML 
Formula: y ~ x + (0 + x | unit) 
   Data: test.df 
  AIC  BIC logLik deviance REMLdev
 1031 1048 -511.7     1014    1023
Random effects:
 Groups   Name Variance Std.Dev.
 unit     x    0.013324 0.11543 
 Residual      0.547969 0.74025 
Number of obs: 450, groups: unit, 30

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.02755    0.03538   29.05
x            3.05594    0.04056   75.34

Correlation of Fixed Effects:
  (Intr)
x 0.076 

We could get a p-value by subtracting the log likelihoods, multiplying by two, and comparing to a χ2 with 1 df, but I'll be lazy and notice that both the AIC and BIC are smaller for the second model, and that the log likelihood barely chaged at all (knowing the true answer helps too) . We conclude the intercept has zero variance and can be represented with a single fixed parameter.

Maybe it's been a while since you've done a likelihood ratio test in a regression context, but you're not in least squares land anymore. You'll notice at the top of the output above that the model was fit by REML, a variation on maximum likelihood. A search for the maximum was conducted behind the scenes, and lme4's "Computational Methods" vignette discusses the related complexities. A warning from Douglas Bates within the vignette "lmer for SAS PROC MIXED Users" (the SASmixed package, 2011): when using REML (the default), model comparisons based on AIC, BIC or the log likelihood should only be made when the competing models having identical fixed effect structures.

The lme4 gain

Here "gain" refers the reduction of mean squared error from estimating the βis from lmer() rather than separate lm()s. The data was generated according to lmer()'s assumptions, so the question is, how much more information can we extract?

Before answering this question, we first need to decide what it means to "estimate" an individual βi. Remember, these are random variables from the standpoint of lmer() and unknown fixed constants from the standpoint of lm().



The coef() function in lme4 returns the posterior modes of the βis. That is for a given i, knowledge of the overall distribution of βi, which is N~ (3.06, .122) is updated via Bayes' rule conditional on the values of yij and xij for j = 1,...,J. Below a few lines of output from the coef() function are shown.

coef(re.lm)

> coef(re.lm)
$unit
   (Intercept)        x
1     1.027547 3.093437
2     1.027547 3.016321
3     1.027547 3.120390
4     1.027547 3.030907
5     1.027547 2.947606

After fumbling with the structure above, I eventually managed to extract the posterior modes of βi and store it in the vector re.beta.

re.beta <- coef(re.lm)$unit[,"x"]

Recall that the lm() estimates of βi are stored in the vector beta.hat We can plot them using the code below:

par(mfrow = c(1,1))
plot(re.beta ~ beta.hat, xlim = c(2.4, 3.6), ylim = c(2.4,3.6), 
           pch = 20, col = "blue",  cex = 1.6, cex.axis = 1.2,
           cex.lab = 1.5, xlab = expression(hat(beta)[i]), 
           ylab =  "", 
           main = "Posterior modes versus least squares estimates")  
title(ylab =  expression(paste("mode of ", hat(italic(p)), 
           "(", beta[i], "|", bold(y)[i], ",", bold(x)[i],")"),
            col.main = "red" ) , cex.lab = 1.5, mgp = c(2.4,0,0))
abline(h = 3.0, v = 3.0, col = "red", lty = 2)
abline(h = 3.05594, col = "grey", lty = 3)
The above graph contains (x,y) pairs of the form 


(lm estimate of βi, lmer estimate of βi). 


Notice how there is more variation in the x-axis than the y-axis; lmer's estimates exhibit less variability than lm


Also notice the red dotted lines at 3.0, the true mean of the βi population, and how the points fall relative to them. The estimates from lm() fall more or less symmetrically about the vertical line x = 3.0, whereas the estimates from lmer() tend to fall above the horizontal line y = 3.0.  The light grey dotted line corresponds to the estimated mean of the βis by lmer(), which at 3.06 is slightly higher than the true value. The lmer() estimates are much more symmetrically distributed about this line, illustrating an important point: lmer()'s estimates are shrunk towards the population mean estimate. This is a conditional bias given the population mean estimate. (I believe there is no bias unconditionally, but I'm a little rusty on my REML theory.)

To get an overall picture of both the variability and bias of the two estimators of βi for our given sample, we turn to Mean Squared Error (MSE). Recall that MSE = Bias2 + Variance. The MSE of the two estimators can be computed with the following code:


re.mse <- mean( (re.beta - beta)^2 )
lm.mse <- mean( (beta.hat - beta)^2 )
 
cat("Mean Squared Error for first pass linear models:", lm.mse, 
  "\n                 versus random effects modeling:", re.mse)

The above code produces the output

Mean Squared Error for first pass linear models: 0.03892798 
                 versus random effects modeling: 0.01715932

showing that lmer() is the clear winner. 


But least-squares regression estimates are consistent and the βis will be estimated with increasing precision as the within-subject sample size J increases. Let's see what happens as J is increased to 30 and then 100.
J = 30


Mean Squared Error for first pass linear models: 0.0148474 
                 versus random effects modeling: 0.008529688


J = 100


Mean Squared Error for first pass linear models: 0.00569652 
                 versus random effects modeling: 0.005613028


By J = 100, the superiority of lmer() to lm() is negligible.

Discussion

We considered the random effects model in a very specific context. Still, we arrive at the following discussion points.


Does using lmer() make one a bayesian? 

No. But with model parameters that are random, the use of posterior distributions, shrinkage of estimates towards a more stable target, life suddenly feels more bayesian. Bates (2006) even recommends Markov Chain Monte Carlo to investigate the properties of the random coefficients through lme4's mcmcsamp() function.

My academic advisor Len Stefanski convinced me that simply using Bayes' rule does not make one a bayesian, and I suppose that holds true for other tools bayesians use like shrinkage and Markov Chain Monte Carlo. 


If those units indexed by i and described by βi constitute a conceptually infinite population, then probability theory is still conceptually aligned with relative frequency, and the frequentists should be happy. When I studied this subject at NC State, in a frequentist context, much more emphasis was placed on the fixed parameters like the mean and variance of the βis than on improved estimates of the βis themselves. 

Particularities of lmer()

From the following post from Douglas Bates back in 2006, you can tell that a lot of people have had different ideas about the direction lme4 and lmer() should go, and not all of them particularly helpful.


As for my own opinions, I would like to see tests for zero variance components, and to be able to fit a model with only fixed effects so that testing against a null model without any random effects is easier. I also could not find a predict() function available, so I don't see how you can use lme4 to truly predict out of sample (i.e., to obtain a posterior estimate of βi* for a unit i* not used in estimation). 


Overall, I believe lmer() is an excellent function within an excellent package. I hope it continues to be enhanced by the larger community.

Is it worth it?

I guess it depends. If you have lots of secondary data points for every primary unit, then the incremental benefit of a true random effects analysis will probably be small. Gumpertz and Pantula (1989) even show how you can use the least square regression estimates to estimate the mean and covariance matrix of random effects distribution, with decent large sample properties.


But if you have high measurement error, small secondary sample sizes, or both, then random effects analysis can really make a difference by sharing information among subjects and providing a stable target to shrink regression estimates towards. Also in these cases, the estimation and testing of variance components may have their own implications.


I'll be continuing to look for my own "killer app" for random effects analysis. Please share if you have one of your own!

All highlighted R-code segments were created by Pretty R at inside-R.org. Keep up with ours and other great articles relating to R on R-bloggers, and follow me on Twitter (@baogorek) for my latest research updates.




References
  • Douglas Bates (2006). "[R] lemer, p-values and all that." The R-help Archives. Date of posting: Fri May 19 22:40:27 CEST 2006. URL https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html  (accessed June 10, 2012).
  • Douglas Bates, Martin Maechler and Ben Bolker (2011). lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-42. URL http://CRAN.R-project.org/package=lme4
  • M. L. Gumpertz and S. G. Pantula (1989) . "A simple approach to inferences in random coefficient models." The American Statistician, Volume 43, pp 203–210.
  • Original by Littel, Milliken, Stroup and Wolfinger modifications by Douglas Bates (2011). SASmixed: Data sets from "SAS System  for Mixed Models". R package version 1.0-1. URL http://CRAN.R-project.org/package=SASmixed
  • Pretty R Syntax Highlighter. inside-R.org. URL http://www.inside-r.org/pretty-r (accessed June 10, 2012).
  • R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/