Stepwise Regression and Automated Model Selection

I recently came upon this interesting blog post, which builds on a FAQ on the Stata website.

I learned about the step, add1, and drop1 functions in R. Here’s an example using step with simulated data:

set.seed(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
y <- rnorm(100)
dataset <- list(x1=x1, x2=x2, x3=x3)
fit <- lm(y ~ ., data=dataset)
summary(step(fit))

And the output is

Start:  AIC=20.68
y ~ x1 + x2 + x3

       Df Sum of Sq    RSS    AIC
- x1    1   0.37431 113.89 19.006
- x2    1   0.38008 113.90 19.012
- x3    1   1.12722 114.64 19.665
<none>              113.52 20.677

Step:  AIC=19.01
y ~ x2 + x3

       Df Sum of Sq    RSS    AIC
- x2    1   0.30453 114.19 17.273
- x3    1   1.03862 114.93 17.914
<none>              113.89 19.006

Step:  AIC=17.27
y ~ x3

       Df Sum of Sq    RSS    AIC
- x3    1     1.463 115.66 16.547
<none>              114.19 17.273

Step:  AIC=16.55
y ~ 1


Call:
lm(formula = y ~ 1, data = dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5931 -0.6086 -0.0097  0.7544  3.3874 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08325    0.10809   -0.77    0.443

Residual standard error: 1.081 on 99 degrees of freedom

We start with a regression including all variables and a constant. Then each of the variables are dropped individually, and the AIC is computed for each regression. The AIC is lowest when dropping x1, so the model with x2 and x3 as regressors is the best two-variable model. Then x2 and x3 are individually dropped. The best model arises when x3 is dropped. Finally, all variables are dropped from the regression, and since that AIC is the best of all estimated models, we are left with

lm(y ~ 1)

as the best model. In this case, stepwise model selection did its job.

All of this got me thinking about model selection in general. A quote from Ronan Conroy represents a common view on model selection:

They end with a quote from Henderson and Velleman’s paper “Building multiple regression models interactively” (1981, Biometrics 37: 391–411): “The data analyst knows more than the computer,” and they add “failure to use that knowledge produces inadequate data analysis”.

Personally, I would no more let an automatic routine select my model than I would let some best-fit procedure pack my suitcase.

Does the data analyst really have the sort of information necessary to specify the model? Why do we use statistical methods at all if we know best? Maybe I’m just not as skilled as the average data analyst, but I struggle with the specification of models, and I think it’s a good idea to listen to the data.

I’m not saying I can’t specify a model if I have to. I’ve got a PhD, so I can wave my hands and come up with a justification for pretty much any model. I’m just not convinced that doing so is meaningful. The purpose of model selection is not to choose which variables, nonlinearities, and interactions should be added to the model - that’s easy - but rather to choose which variables, nonlinearities, and interactions you should leave out.

In principle, you should estimate a model that encompasses everything. In practice, the limitations of our datasets don’t allow us to do that, because we’ll be overwhelmed with parameter estimation error and collinearity. There’s a vaguely defined concept of “the benefits of dropping that term from the model outweigh the potential costs of doing so” and that’s model selection. Economic theory tells us which variables shouldn’t be omitted from the model, and the signs of some of the coefficients, but it doesn’t provide much guidance when you’re deciding what to leave out of the model. Ultimately, model selection is done for statistical reasons, and it should be resolved using statistical methods. Maybe the currently available procedures aren’t optimal, maybe there is a criteria for model selection with a better economic interpretation, but we need use a formal, explicit approach. “That’s the model because I said so” isn’t helpful.

I agree that a data analyst can suggest additional models to estimate. If you’re saying a data analyst can overrule the results of automated model selection, or skip that step altogether, though, I’m not on board. Otherwise we end up in a world where one model specification gives one result, another model specification gives a conflicting result, and we say the empirical results are “inconclusive” (leading everyone to conclude that their preferred answer is correct).

It goes without saying that the points in that FAQ are valid. You should not trust the R-squared values, the F-statistics don’t have the claimed distribution, etc. On the other hand, one difference between automated model selection procedures and data analyst model selection is that the sources of bias are explicit for the former and impossible to quantify for the latter. Data analyst model selection is particularly messy when the choice of model is the result of data mining that takes place as researchers look for the most publishable result.

Last Update: 2015-07-02