Sunday, April 8, 2012

Interpreting R's lm with categorical predictors

What's with those estimates?
By Ben Ogorek

In R, categorical variables can be added to a regression using the lm() function without a hint of extra work. But have you ever look at the resulting estimates and wondered exactly what they were?

First, let's define a data set. (I recommend running this article's R code, interactively, as it is displayed.)

set.seed(12255)
n <- 30
AOV.df <- data.frame(category = c(rep("category1", n),
                                  rep("category2", n),
                                  rep("category3", n)),
                            j = c(1:n, 1:n, 1:n),
                            y = c(8.0 + 2.0*rnorm(n),
                                  9.5 + 2.0*rnorm(n),
                                  11.0 + 2.0*rnorm(n)))
Using mathematical notation, the above data are described by y1j ~ N(8, 22), y2j ~ N(9.5, 22),  and y3j ~ N(11, 22), for categories i = 1,2,3 and observations with category j=1,2,...,10. Below, we visualize the data using code modified from Winston Chang's Cookbook for R.

library(ggplot2)
ggplot(AOV.df, aes(x= y, fill=category))
geom_bar(binwidth = 1.5, alpha = .5, position = "identity")
Clearly there is separation, although less so for categories 1 versus 2. Regression can recover the true means, with some noise, by adding categorical variables to the lm function.

AOV.lm <- lm(y ~ category, data = AOV.df)
summary(AOV.lm)

But why does the output look like this?

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         8.4514     0.3405  24.823  &lt; 2e-16 ***
categorycategory2   0.8343     0.4815   1.733   0.0867 .  
categorycategory3   3.0017     0.4815   6.234 1.58e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.865 on 87 degrees of freedom
Multiple R-squared: 0.3225, Adjusted R-squared: 0.307 
F-statistic: 20.71 on 2 and 87 DF,  p-value: 4.403e-08 

The estimate of the residual standard error is 1.865, close to the true value of 2. But what about the intercept of 8.45, category 2's estimate of 0.83, and the lack of a category 1 estimate? Why is category 3's associated p-value so very small, whereas category 2's is only marginally significant? Surely there is real separation between categories 2 and 3. 

You might imagine that the intercept is estimating 8, category 2's parameter is estimating 1.5 (so that 8 + 1.5 = 9.5), and category 3's parameter is estimating 3.0 (so that 8 + 3 = 11). But why?

The Classical Effects Model

The classical effects model, as presented in Rawlings, Pantula, and Dickey (1999, Chapter 9.2),  describes the data generating mechanism as yij = μ + τi + εij, for categories i = 1,2,3 and within-category observations j = 1,...,30. The effects model is nice conceptually because you can think of μ as a baseline level, and τ1, τ2, and τ3 as deviations from that baseline.

But this model has an embarrassing little secret - you can't actually estimate the parameters. And it's not that you can't know them without error, the data literally does not contain the information to estimate all the parameters. In statistical terminology, these theoretical quantities are not estimable. 

When you think about the three category means, and our four location parameters μ and τ1 - τ3, it should make sense. The classical effects model is known to be overparameterized.

If an overparameterized model seems pointless, the means model presents an alternative: y = μi + εij. If you guessed that the μi's are just the category means, you'd be right. So why doesn't lm function use the μi's as parameters?

It turns out that there is some utility in the effects model after all. Even though you can't estimate  μ,τ1, τ2, and τ3 individually, consider what happens when you take a difference of observations from two categories:

y2j - y1j = (μ + τ2 + ε2j) - (μ + τ1 + ε2j
                =  τ2 -  τ1 + (ε2j - ε1j).

The term in the parentheses is just noise (a random variable with an expectation of zero), which means that y2j - y1j is an estimate of τ2 -  τ1.  The function τ2 -  τ1 is estimable! Linking the means model with the classical effects model, we have μ2 - μ1 = τ2 - τ1

Arguably, the classical effects model more clearly articulates the quantities we are interested in : the differences between the groups. And when there are multiple factors,  additive effects provide a way to simplify a model. The number of cell means will grow exponentially with the number of factors, but in the absence of interaction, the number of effects grow linearly on the order of the number of factors. Rawlings, Pantula, and Dickey (citing Hocking, 1985), suggest that the means model is "more direct," but that the effects model "conveys the structure of the data, which in turn generates logical hypotheses and sums of squares in the analysis."

Interpreting lm

If you tried to estimate the parameters of the overparameterized classical effects model with standard least squares, you'd run into problems, specifically with the matrix inversion that is involved. There are ways around this. One is to attack the inversion problem directly with a generalized inverse, or pseudoinverse (Strang, 1993). Another is to use a reparameterization.

We already saw that there is a relationship between the parameters in the cell means model and the classical effects model. Indeed, the cell means parameterization is one way to proceed, but it's not the one that R uses. R's lm function uses a reparameterization is called the reference cell model, where one of the τi's is set to zero to allow for a solution. Rawlings, Pantula, and Dickey say it is usually the last τi, but in the case of the lm function, it is actually the first. 

With τ1 set to zero, the mean of category 1, μ + τ1 is really just μ, or μ* as well call it, because our constraint was really quite arbitrary. Think back to our lm output - you guessed it, μ* is the intercept, and also the mean of category 1. The p-value for the test of whether the true first category mean equals 0, which is not likely to be useful.

Also since τ1 is set to zero, τ2 - τ1, estimable from the differences between the category means, is really just τ2, or τ*since (again) the constraint was arbitrary. Remembering our lm output, the quantity being estimated for category 2 is the difference in means between category 2 and category 1. Similarly category 3's coefficient is the difference between the category 3 mean and the category 1 mean. The p-values for these tests are more likely to be meaningful, as they are testing group level differences. Now it should be clear why the category 2 coefficients p-value is larger than category 3's (refer to the plot) and why there is not a category 1 coefficient.

To finish the discussion, set n = 100000. After regenerating the data and rerunning lm, we arrive at the following:

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.997530   0.006334  1262.7   &lt;2e-16 ***
categorycategory2 1.499163   0.008957   167.4   &lt;2e-16 ***
categorycategory3 4.004012   0.008957   447.0   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.003 on 299997 degrees of freedom
Multiple R-squared: 0.4048, Adjusted R-squared: 0.4048 
F-statistic: 1.02e+05 on 2 and 299997 DF,  p-value: &lt; 2.2e-16 

It is clear that the intercept is indeed the mean of category 1, category 2's coefficient is 1.5 = 9.5 - 8, the difference of category 2's and category 1's mean. Finally, category 3's coefficient is 4 = 12 - 8, the difference of category 3's and category 1's mean.

Discussion

This article was meant to answer the question of why the lm function works the way it does for a single factor regressor, written for those more familiar with regression and less familiar with the classical analysis of variance. We did not cover switching between parameterizations, getting p-values for the various testing combinations, or what happens when there are multiple factors. For more information, this article from UCLA's Academic Technology Services discusses R's contrasts function and other tools for switching between parameterizations

All highlighted R-code segments were created by Pretty R at inside-R.org. Keep up with ours and other great articles relating to R on R-bloggers, and follow me on Twitter (@baogorek) for my latest research updates.

References


  • Hocking, R. R. (1985). The analysis of linear models. Brooks/Cole, Monterey, California, USA.
  • Plotting Distributions (ggplot2). Cookbook for R. URL http://wiki.stdout.org/rcookbook/Graphs/Plotting%20distributions%20%28ggplot2%29/ (accessed April 6, 2012).
  • Pretty R Syntax Highlighter. inside-R.org. URL http://www.inside-r.org/pretty-r (accessed April 8, 2012).
  • Rawlings, J. O., S. G. Pantula, and D. A. Dickey. (1998). Applied regression analysis: a research tool. Second edition. Springer-Verlag, New York, New York, USA.
  • R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
  • R Learning Module: Coding for categorical variables in regression models. UCLA: Academic Technology Services, Statistical Consulting Group. URL http://www.ats.ucla.edu/stat/r/modules/dummy_vars.htm (accessed April 7, 2012).
  • Strang, G. (1993). "The fundamental theorem of linear algebra." American Mathematical MonthlyVolume 100, number 9, pp 848 - 859.
  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer New York.