Taking A Deep Dive Into systemfit
The systemfit package for R description says
Econometric estimation of simultaneous systems of linear and nonlinear equations using Ordinary Least Squares (OLS), Weighted Least Squares (WLS), Seemingly Unrelated Regressions (SUR), Two-Stage Least Squares (2SLS), Weighted Two-Stage Least Squares (W2SLS), and Three-Stage Least Squares (3SLS).
It provides convenient facilities for estimation of systems of equations using R. For a high-level overview of its use, see the vignette.
Let’s explore a bit the systemfit
function. (Note the creative reuse of the package name as the name of the main function in the package!) Although we’ll work with the dataset provided by the authors of the package, we’ll dig deeper into its functionality than in the provided examples. That will remove its black box nature, giving us a better understanding of what’s going on.
Note that the dataset is quite short (\(T=20\)). There are two equations in the system: \[c_{t} = \alpha_{0} + \alpha_{1} p_{t} + \alpha_{2} y_{t} + \varepsilon_{dt}\] \[c_{t} = \beta_{0} + \beta_{1} p_{t} + \beta_{2} f_{t} + \beta_{3} T_{t} + \varepsilon_{st}\] \(c\) is per capita food consumption, \(p\) is the relative price of food, \(y\) is disposable income, \(f\) is the relative price of food received by farmers, and \(T\) is a linear time trend. The first equation is the demand equation (because it includes income) and the second is the supply equation (because it includes the farm price and a time trend to account for technology).
Load the data and print it out:
library(systemfit)
data("Kmenta")
Kmenta
OLS estimation. Let’s start by estimating the model using equation-by-equation OLS.
fit.demand <- lm(consump ~ price + income, data=Kmenta)
fit.supply <- lm(consump ~ price + farmPrice + trend, data=Kmenta)
print(fit.demand)
print(fit.supply)
Now estimate the equations as one system with a call to systemfit
. You can confirm that the estimates are the same. Strictly speaking, the method="OLS"
argument is not needed, since that’s the default. I decided to include it anyway to be explicit.
eq.demand <- consump ~ price + income
eq.supply <- consump ~ price + farmPrice + trend
fit.sys <- systemfit(list(demand=eq.demand, supply=eq.supply), method="OLS", data=Kmenta)
print(fit.sys)
The first argument to systemfit
defines the system. Note that eq.demand
and eq.supply
do not do any estimation - they are formulas that, when combined with a dataset, are used to do the estimation.
It’s perfectly acceptable to define our system of two equations as
list(eq.demand, eq.supply)
As a practical matter, you probably want to give the equations names so the output is easier to work with. That’s why we defined the system as
list(demand=eq.demand, supply=eq.supply)
Benefits of systemfit. What’s the benefit to estimating the system using systemfit
rather than estimating the individual equations by OLS? If all you want is the coefficients of the equations, there is no benefit. The advantage is that you get other stuff for free if you use systemfit
. Here are a few of the more useful options.
To get the full (stacked) vector of coefficients with proper names:
fit.sys$coefficients
To get just the first equation:
fit.sys$eq[[1]]
Although this prints out the equation as if it’s a regular call to the lm
function, it’s a systemfit.equation
object, with other information attached:
class(fit.sys$eq)
class(fit.sys$eq[[1]])
names(fit.sys$eq[[1]])
fit.sys$eq[[1]]$eqnLabel
fit.sys$eq[[1]]$eqnNo
I’m not sure why the names of the equations aren’t included, but a call to lapply
can be used to print them out:
lapply(fit.sys$eq, function(z) { z$eqnLabel })
Residual covariance matrix. One reason to use systemfit
for estimation is that you’re using it as the first step in SUR estimation. The residual covariance matrix used (the covariances of the residuals across equations) is
fit.sys$residCov
Note that this is not the same output as creating a matrix to hold the residuals of the two equations and calculating the covariance matrix using the cov
function:
ols.residuals <- cbind(residuals(fit.demand), residuals(fit.supply))
cov(ols.residuals)
Although this approach is fine asymptotically, it matters a lot which degrees of freedom adjustment you apply with such a small sample. Looking at the source code (line 848 of file systemfit.R in the version currently installed on my computer), there’s a reference to a function named .calcResidCov
, called using the code:
results$residCov <- .calcResidCov( resids, methodResidCov = control$methodResidCov,
validObsEq = validObsEq, nCoefEq = nCoefLiEq, xEq = xMatEq,
centered = control$centerResiduals, solvetol = control$solvetol )
The default value of systemfit.control$methodResidCov
is "geomean"
(geometric mean approach). Lines 40-42 of calcResidCov.R contain these lines to do the covariance calculations:
result[ i, j ] <-
sum( resids[ validObsAll, i ] * resids[ validObsAll, j ] ) /
sqrt( ( sum( validObsAll ) - nCoefEq[i] ) * ( sum( validObsAll ) - nCoefEq[j] ) )
The degrees of freedom adjustment for equation i
is the number of estimated parameters in equation i
. There are three estimated parameters in equation 1, and four estimated parameters in equation 2. Thus, the formulas for calculating the residual variance terms are \[\frac{\sum_{t=1}^{20} e_{1t}^{2}}{17}\] for equation 1, \[\frac{\sum_{t=1}^{20} e_{2t}^{2}}{16}\] for equation 2, and \[\frac{\sum_{t=1}^{20} e_{1t} e_{2t}}{\sqrt{(17)(16)}}\] for the covariance term.
After going through that long detour, let’s reproduce the elements of the residual covariance matrix:
sum(residuals(fit.demand)^2)/17
sum(residuals(fit.supply)^2)/16
sum(residuals(fit.demand)*residuals(fit.supply))/sqrt(17*16)
fit.sys$residCov
The cov
function uses a denominator of 19 (which is \(T-1\)). Reproducing the output of the call to cov
:
sum(residuals(fit.demand)^2)/19
sum(residuals(fit.supply)^2)/19
sum(residuals(fit.demand)*residuals(fit.supply))/19
cov(ols.residuals)
Of course, with even a moderate sample size of \(T=500\), the exact nature of the degrees of freedom adjustment would not be important. The denominator used to compute the covariance term would be 499 vs \(\sqrt{(497)(496)} = 496.5\), a difference of only 0.5%.
Whether or not it matters in practice, calling cov
doesn’t account for the fact that the residuals are the output of an estimated regression model rather than variables in your dataset.
Coefficient covariance matrix. This is typically denoted something like \(\widehat{V}_{OLS}\). The most common use would be as an input in a Wald or Hausman test statistic calculation.
fit.sys$coefCov
The non-zero elements are pulled directly from the individual estimated equations:
vcov(fit.demand)
vcov(fit.supply)
You can compute \(\widehat{V}_{OLS}\) for the first equation using the formula \(s^{2} \left(X^{\prime} X\right)^{-1}\), where \(s^{2}\) is given by the formula above:
s2 <- sum(residuals(fit.demand)^2)/17
x1 <- model.matrix(fit.sys$eq[[1]])
s2 * solve(t(x1) %*% x1)
fit.sys$coefCov
The upper-left block of the latter matrix is equal to the matrix we calculated.
SUR estimation.
2SLS estimation.