Anything but R-bitrary: June 2012

What's the gain over lm()?
By Ben Ogorek

Random effects models have always intrigued me. They offer the flexibility of many parameters under a single unified, cohesive and parsimonious system. But with the growing size of data sets and increased ability to estimate many parameters with a high level of accuracy, will the subtleties of the random effects analysis be lost?

In this article, we will look an example that could be analyzed with either a traditional regression approach, using lm(), or a more sophisticated approach using random effects via the lme4 package by Douglas Bates, Martin Maechler and Ben Bolker (2011). And then I'll pose the question, "is it worth it?"

Random effects for the uninitiated

For the uninitiated in random effects models, suppose we have the linear model

y_j = βx_j+ ε_j

for j = 1,...,J, where ε_j is iid gaussian noise. But also suppose that this pattern repeats itself for some set of units i = 1,...,n. If the units are different, then maybe their slope and intercepts are different, and we'd have to rewrite the equation as

y_ij = α_i + β_ix_ij + ε_ij

where again the ε_ijs are iid gaussian variables.

But if every unit has its own α and β, and each unit is part of some stable population, shouldn't there be a common thread underlying the parameters? And if you know something about α₁ and α₂, maybe you should also know something about α₃ - not because of dependence - but because α₁ and α₂ give you an idea of what is typical of αs among units. Expand this and you'll get an idea of what is atypical as well.

Placing a joint probability distribution over these parameters is a more formal way to accomplish this. If you have a probability distribution (for better or for worse), then you're "random." Welcome to the world of random coefficients.

Setting up

Let's experiment. First, start with a clean R session. Below we set up the shell of our future data set. The code below produces 30 primary units and 15 measurements within each primary unit - a nice rectangular structure.

rm(list = ls())
set.seed(5432)
 
J <- 15
N <- 30
 
test.df <- data.frame( unit = sort(rep(c(1:N),J)),

                          J = rep(c(1:J),N) , x = rnorm(n = J*N) )

Next, we'll generate data from our above model, where β_i ~ N(3, .2²), α_i = 1 for all i, and ε_ij ~ N(0, .75²).

beta <- 3 + .2*rnorm(N)

test.df$beta <- beta[test.df$unit]
test.df$y <- 1 + test.df$x * test.df$beta + .75*rnorm(n = J*N)
head(test.df, 18)

The last line of code allows us to view the first 18 records.

> head(test.df, 18)
   unit  J           x     beta           y
1     1  1 -0.11300084 3.306203  0.26360591
2     1  2  1.28001884 3.306203  4.37991072
3     1  3 -0.99839530 3.306203 -2.41951563
4     1  4  0.92637466 3.306203  3.66430487
5     1  5  0.81537679 3.306203  3.10488869
6     1  6  1.60623340 3.306203  6.89280816
7     1  7 -0.12976856 3.306203  1.20187488
8     1  8  0.18872381 3.306203  2.23705005
9     1  9  0.22488639 3.306203  1.80095766
10    1 10 -0.23402213 3.306203  0.68786438
11    1 11  0.68923084 3.306203  3.10347646
12    1 12  0.05152819 3.306203  2.14595637
13    1 13  0.86556045 3.306203  4.74330873
14    1 14 -0.46953578 3.306203 -0.83396291
15    1 15 -0.45621596 3.306203 -0.33406847
16    2  1 -1.17441143 2.936359 -3.48678813
17    2  2 -0.40591147 2.936359 -0.03560251
18    2  3  0.69477215 2.936359  2.59961931

Notice how β_i changes only as the unit id changes, whereas y and x vary at the row level.

Separate regressions

Suppose we run seperate regressions for each unit using a loop of lm() calls:

beta.hat <- list()
for(i in 1:N){
  unit.lm <- lm(y ~ x, data = subset(test.df, unit == i) )
  beta.hat[i] <- coef(unit.lm)[2]
}
beta.hat <- as.numeric(beta.hat)

The cost of having to estimate the β_is can be seen by the histograms below, whereas the estimated β_is exhibit noticeably more variability.

par(mfrow = c(2, 1))
hist(beta, main = "Histogram of Actual Slopes", col = "blue",

     xlab = expression(beta[i]), cex.axis = 1.5, cex.lab = 1.5,

     breaks = seq(from = 2.4, to = 3.6, by = .1) )

hist(as.numeric(beta.hat), main = "Histogram of Estimated Slopes",

     xlab = expression(hat(beta)[i]), col = "blue", cex.axis = 1.5,

     cex.lab = 1.5, breaks = seq(from = 2.4, to = 3.6, by = .1) )

Random effects modeling using lme4

if you haven't already, install the lme4 package using the command

install.packages("lme4").

The flagship function of the lme4 package is the lmer() function, a likelihood based system for estimating random effects models. Its formula notation works like lm()'s for fixed effects, but if you try to run a basic lm() model in it, you'll get an error message - lmer() needs random effects! When you have fixed effects, you do enter them as in lm(). For random effects, the form is

(formula for random terms | unit for which these terms apply).

Starting on the left side of the bar, the formula for a random intercept, by itself, is simply "1". The formula for a random regression coeficient for a variable x, without the corresponding random intercept, is "0 + x". Random intercepts are included by default, so "x" and "1 + x" are equivalent specifications of both a random slope and a random intercept.

Random effects must vary at a coarser grain than at the finest level, or else they'd be confounded (with ε_ij in our case). The second part of the random formula specification requires a variable that identifies this coarser grain, which in our case is the unit identifier.

Below we apply lmer() to our generated data. For demonstration purposes, both a random intercept and slope are specified, even though we know the intercept is fixed. And as a general point about lmer(), we'll need to include the mean of our random β_is as fixed effect. This is accomplished by adding x as a fixed regressor alongside its specification as a random effect (think random deviations from a fixed mean).

library(lme4)
re.lm <- lmer(y ~ x + (1+x|unit), data = test.df) 
summary(re.lm)

Note that lme4 is an S4 class, so running names(re.lm) to explore the structure won't be of any help. The summary function, however, gets us the overview we need.

> summary(re.lm)
Linear mixed model fit by REML 
Formula: y ~ x + (1 + x | unit) 
   Data: test.df 
  AIC  BIC logLik deviance REMLdev
 1034 1059 -511.1     1013    1022
Random effects:
 Groups   Name        Variance  Std.Dev. Corr   
 unit     (Intercept) 0.0040782 0.063861        
          x           0.0134348 0.115909 -1.000 
 Residual             0.5427579 0.736721        
Number of obs: 450, groups: unit, 30

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.02645    0.03704   27.71
x            3.05527    0.04059   75.28

Correlation of Fixed Effects:
  (Intr)
x -0.099

The colors are to link the estimates to their true values based on β_i ~ N(3, .2²) and
ε_ij ~ N(0, .75²). Note that the ε error variance has an estimate much closer to its true value than the other two parameters. This shouldn't come as a surprise; there are 15 times more opportunities (450 vs. 30) to observe variation in the random variable ε_ij than in β_i.

At 1.03, the true common intercept of 1.00 is recovered without much error, but what can lmer() say about whether the intercept is random? Since a random variable with zero variance is for all practical purposes a fixed constant, we want to test the hypothesis:

H₀: σ_α² = 0

versus the alternative

H₁: σ_α ² > 0.

We can see that the estimated variance for the random intercept, at 0.004, is much less than that of the random slope, at 0.014. Due to the importance of the zero-variance hypothesis, I would have liked to see it included as part of the default output. But with a little extra work, we can search for evidence of positive variance. It starts by rerunning lmer() without the random intercept.

re.lm <- lmer(y ~ x + (0+x|unit), data = test.df)

summary(re.lm)

Linear mixed model fit by REML 
Formula: y ~ x + (0 + x | unit) 
   Data: test.df 
  AIC  BIC logLik deviance REMLdev
 1031 1048 -511.7     1014    1023
Random effects:
 Groups   Name Variance Std.Dev.
 unit     x    0.013324 0.11543 
 Residual      0.547969 0.74025 
Number of obs: 450, groups: unit, 30

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.02755    0.03538   29.05
x            3.05594    0.04056   75.34

Correlation of Fixed Effects:
  (Intr)
x 0.076

We could get a p-value by subtracting the log likelihoods, multiplying by two, and comparing to a χ² with 1 df, but I'll be lazy and notice that both the AIC and BIC are smaller for the second model, and that the log likelihood barely chaged at all (knowing the true answer helps too) . We conclude the intercept has zero variance and can be represented with a single fixed parameter.

Maybe it's been a while since you've done a likelihood ratio test in a regression context, but you're not in least squares land anymore. You'll notice at the top of the output above that the model was fit by REML, a variation on maximum likelihood. A search for the maximum was conducted behind the scenes, and lme4's "Computational Methods" vignette discusses the related complexities. A warning from Douglas Bates within the vignette "lmer for SAS PROC MIXED Users" (the SASmixed package, 2011): when using REML (the default), model comparisons based on AIC, BIC or the log likelihood should only be made when the competing models having identical fixed effect structures.

The lme4 gain

Here "gain" refers the reduction of mean squared error from estimating the β_is from lmer() rather than separate lm()s. The data was generated according to lmer()'s assumptions, so the question is, how much more information can we extract?

Before answering this question, we first need to decide what it means to "estimate" an individual β_i. Remember, these are random variables from the standpoint of lmer() and unknown fixed constants from the standpoint of lm().

(image taken from http://en.wikipedia.org/wiki/File:Comparison_mean_median_mode.svg)

The coef() function in lme4 returns the posterior modes of the β_is. That is for a given i, knowledge of the overall distribution of β_i, which is N~ (3.06, .12²) is updated via Bayes' rule conditional on the values of y_ij and x_ij for j = 1,...,J. Below a few lines of output from the coef() function are shown.

coef(re.lm)

> coef(re.lm)
$unit
   (Intercept)        x
1     1.027547 3.093437
2     1.027547 3.016321
3     1.027547 3.120390
4     1.027547 3.030907
5     1.027547 2.947606

After fumbling with the structure above, I eventually managed to extract the posterior modes of β_i and store it in the vector re.beta.

re.beta <- coef(re.lm)$unit[,"x"]

Recall that the lm() estimates of β_i are stored in the vector beta.hat. We can plot them using the code below:

par(mfrow = c(1,1))
plot(re.beta ~ beta.hat, xlim = c(2.4, 3.6), ylim = c(2.4,3.6),

           pch = 20, col = "blue",  cex = 1.6, cex.axis = 1.2,

           cex.lab = 1.5, xlab = expression(hat(beta)[i]),

           ylab =  "",

           main = "Posterior modes versus least squares estimates")

title(ylab =  expression(paste("mode of ", hat(italic(p)),

           "(", beta[i], "|", bold(y)[i], ",", bold(x)[i],")"),

            col.main = "red" ) , cex.lab = 1.5, mgp = c(2.4,0,0))

abline(h = 3.0, v = 3.0, col = "red", lty = 2)
abline(h = 3.05594, col = "grey", lty = 3)

The above graph contains (x,y) pairs of the form

(lm estimate of β_i,lmer estimate of β_i).

Notice how there is more variation in the x-axis than the y-axis; lmer's estimates exhibit less variability than lm.

Also notice the red dotted lines at 3.0, the true mean of the β_i population, and how the points fall relative to them. The estimates from lm() fall more or less symmetrically about the vertical line x = 3.0, whereas the estimates from lmer() tend to fall above the horizontal line y = 3.0. The light grey dotted line corresponds to the estimated mean of the β_is by lmer(), which at 3.06 is slightly higher than the true value. The lmer() estimates are much more symmetrically distributed about this line, illustrating an important point: lmer()'s estimates are shrunk towards the population mean estimate. This is a conditional bias given the population mean estimate. (I believe there is no bias unconditionally, but I'm a little rusty on my REML theory.)

To get an overall picture of both the variability and bias of the two estimators of β_i for our given sample, we turn to Mean Squared Error (MSE). Recall that MSE = Bias² + Variance. The MSE of the two estimators can be computed with the following code:

re.mse <- mean( (re.beta - beta)^2 )
lm.mse <- mean( (beta.hat - beta)^2 )
 
cat("Mean Squared Error for first pass linear models:", lm.mse, 
  "\n                 versus random effects modeling:", re.mse)

The above code produces the output

Mean Squared Error for first pass linear models: 0.03892798 
                 versus random effects modeling: 0.01715932

showing that lmer() is the clear winner.

But least-squares regression estimates are consistent and the β_is will be estimated with increasing precision as the within-subject sample size J increases. Let's see what happens as J is increased to 30 and then 100.
J = 30

Mean Squared Error for first pass linear models: 0.0148474 
                 versus random effects modeling: 0.008529688

J = 100

Mean Squared Error for first pass linear models: 0.00569652 
                 versus random effects modeling: 0.005613028

By J = 100, the superiority of lmer() to lm() is negligible.

Discussion

We considered the random effects model in a very specific context. Still, we arrive at the following discussion points.

Does using lmer() make one a bayesian?

No. But with model parameters that are random, the use of posterior distributions, shrinkage of estimates towards a more stable target, life suddenly feels more bayesian. Bates (2006) even recommends Markov Chain Monte Carlo to investigate the properties of the random coefficients through lme4's mcmcsamp() function.

My academic advisor Len Stefanski convinced me that simply using Bayes' rule does not make one a bayesian, and I suppose that holds true for other tools bayesians use like shrinkage and Markov Chain Monte Carlo.

If those units indexed by i and described by β_i constitute a conceptually infinite population, then probability theory is still conceptually aligned with relative frequency, and the frequentists should be happy. When I studied this subject at NC State, in a frequentist context, much more emphasis was placed on the fixed parameters like the mean and variance of the β_is than on improved estimates of the β_is themselves.

Particularities of lmer()

From the following post from Douglas Bates back in 2006, you can tell that a lot of people have had different ideas about the direction lme4 and lmer() should go, and not all of them particularly helpful.

As for my own opinions, I would like to see tests for zero variance components, and to be able to fit a model with only fixed effects so that testing against a null model without any random effects is easier. I also could not find a predict() function available, so I don't see how you can use lme4 to truly predict out of sample (i.e., to obtain a posterior estimate of β_i* for a unit i* not used in estimation).

Overall, I believe lmer() is an excellent function within an excellent package. I hope it continues to be enhanced by the larger community.

Is it worth it?

I guess it depends. If you have lots of secondary data points for every primary unit, then the incremental benefit of a true random effects analysis will probably be small. Gumpertz and Pantula (1989) even show how you can use the least square regression estimates to estimate the mean and covariance matrix of random effects distribution, with decent large sample properties.

But if you have high measurement error, small secondary sample sizes, or both, then random effects analysis can really make a difference by sharing information among subjects and providing a stable target to shrink regression estimates towards. Also in these cases, the estimation and testing of variance components may have their own implications.

I'll be continuing to look for my own "killer app" for random effects analysis. Please share if you have one of your own!

All highlighted R-code segments were created by Pretty R at inside-R.org. Keep up with ours and other great articles relating to R on R-bloggers, and follow me on Twitter (@baogorek) for my latest research updates.

References

Douglas Bates (2006). "[R] lemer, p-values and all that." The R-help Archives. Date of posting: Fri May 19 22:40:27 CEST 2006. URL https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html (accessed June 10, 2012).
Douglas Bates, Martin Maechler and Ben Bolker (2011). lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-42. URL http://CRAN.R-project.org/package=lme4
M. L. Gumpertz and S. G. Pantula (1989) . "A simple approach to inferences in random coefficient models." The American Statistician, Volume 43, pp 203–210.
Original by Littel, Milliken, Stroup and Wolfinger modifications by Douglas Bates (2011). SASmixed: Data sets from "SAS System for Mixed Models". R package version 1.0-1. URL http://CRAN.R-project.org/package=SASmixed
Pretty R Syntax Highlighter. inside-R.org. URL http://www.inside-r.org/pretty-r (accessed June 10, 2012).
R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

Anything but R-bitrary

Monday, June 11, 2012

Random regression coefficients using lme4