6  Distribution of Regression Coefficients

Reading: If you want a little more information about the topics in this chapter, take a look at Dougherty 2.3 and 2.5 (pages 118 - 125 and 130 - 139).

6.1 Chapter Preview

At this point, we know a lot about using OLS to fit linear models. We know how to calculate parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\) and we know how to interpret them. For instance, if we estimate \(final\_grade = 60 + 4\ study\_hours + u\), that would mean that someone who studies zero hours would get a 60 on average, and for every extra hour someone studies, they should expect to earn 4 more points.

But how certain are we about those estimates? If we’re extremely confident that \(\hat{\beta_1}\) is between 3.9 and 4.1, perhaps you’d spend some extra time studying. But if we’re not very confident in \(\hat{\beta_1}\) and all we can claim is that it’s likely between -6 and 14, then it could very well be true that studying has a negative effect on your final grade!

In this chapter, I’ll explain how we can measure the precision of the regression estimates and what assumptions are necessary to do that. In particular, I’ll explain how the distribution of \(\hat{\beta_1}\) is Normal, it’s centered on the true value \(\beta_1\), and its variance is \(\frac{\sigma_u^2}{\sum_i (x_i - \bar{x})^2}\). Since \(u\) is unobservable, we can’t calculate the variance of \(\hat{\beta_1}\) directly, but we can estimate it using residuals \(e_i\). In this way, at the end of the chapter I’ll explain that you can calculate the precision of \(\hat{\beta_1}\) using what we call the standard error: \(se(\hat{\beta_1}) = \sqrt{\frac{\sum_i e_i^2}{(n - 2) \sum_i (x_i - \bar{x})^2}}\). You’ll use this standard error in hypothesis tests to test things like whether \(\hat{\beta_1} > 0\) (are we extremely confident that studying improves a person’s final grade?).

This chapter is a little math heavy. I’d suggest that you carefully step through all the proofs so that you understand them thoroughly.

6.2 Proof: OLS \(\hat{\beta_1}\) is unbiased

In classwork 5, you showed that \(\hat{\beta_1} = \beta_1 + \frac{\sum_i(x_i - \bar{x})u_i}{\sum_i(x_i - \bar{x})^2}\). Again letting \(w_i = \frac{x_i - \bar{x}}{\sum_i (x_i - \bar{x})^2}\), then we have that:

\[ \hat{\beta_1} = \beta_1 + \sum_{i=1}^n w_i u_i \tag{6.1}\]

In this section, I’ll prove that this OLS estimator \(\hat{\beta_1}\) is an unbiased estimator for \(\beta_1\), that is, \(E[\hat{\beta_1}] = \beta_1\).

The first step is to take expectations of both sides of Equation 6.1:

\[ E[\hat{\beta_1}] = E[\beta_1 + \sum_{i=1}^n w_i u_i] \]

And then distribute the expectation across the sum on the right hand side. Also note that \(\beta_1\) is a constant: it’s the true amount by which we can expect \(Y\) to increase given a one-unit increase in \(X\). So \(E[\beta_1] = \beta_1\), just like \(E[3] = 3\).

\[ E[\hat{\beta_1}] = \beta_1 + E[ \sum_{i=1}^n w_i u_i] \]

Again distributing the expectation across a sum (the expectation of a sum is the same as the sum of expectations):

\[E[\hat{\beta_1}] = \beta_1 + \sum_{i=1}^n E[w_i u_i]\]

From here, we have one singular goal: show that \(\sum_{i=1}^n E[w_i u_i]\) is zero. Then we’ll have achieved what we set out to do, which was to show that \(E[\hat{\beta_1}] = \beta_1\). Here’s the plan of attack: take \(E[w_i u_i]\) and show that the conditional expectation of \(w_i u_i\) is zero, which implies that the unconditional expectation \(E[w_i u_i]\) is zero too, by the law of iterated expectations (see last chapter)!

If we take the conditional expectation of \(w_i u_i\) conditioned on the explanatory variables \(X\), we can treat \(w_i\) as a constant because it’s a function of only \(X\) (recall that \(w_i = \frac{x_i - \bar{x}}{\sum_i (x_i - \bar{x})^2}\)). Since we can treat \(w_i\) as a constant, we can bring it ouside of the expectation:

\[E[w_i u_i | X] = w_i E[u_i | X]\]

And this is the key assumption: in econometrics, we have to assume \(E[u_i | X] = 0\). This key assumption is called exogeneity. If exogeneity holds, then:

\[\begin{align} E[w_i u_i | X] &= w_i \times 0\\ &= 0 \end{align}\]

And since the conditional expectation of \(w_i u_i\) is a constant (zero), then the unconditional expectation \(E[w_i u_i]\) is that same constant (0). So:

\[\begin{align} E[\hat{\beta_1}] &= \beta_1 + \sum_{i=1}^n E[w_i u_i]\\ &= \beta_1 + \sum_i 0\\ &= \beta_1 \end{align}\]

And we’ve proven that \(\hat{\beta_1}\) is an unbiased estimator of \(\beta_1\) (as long as one crucial assumption holds, which is \(E[u_i | X] = 0\), called exogeneity).

Exercise 1: \(\hat{\beta_1}\) is an unbiased estimator of \(\beta_1\) as long as this key assumption holds:

  1. homoskedasticity: \(E[u_i | X] = 0\)

  2. exogeneity: \(E[u_i | X] = 0\)

  3. homoskedasticity: \(Var(u_i | X) = 0\)

  4. exogeneity: \(Var(u_i | X) = 0\)

You can use the same techniques to show that if exogeneity holds, \(\hat{\beta_0}\) is also unbiased (proof on Dougherty pages 121-122).

6.3 Distribution of \(\hat{\beta_1}\)

We’ve shown that \(\hat{\beta_1}\) is a weighted sum of \(u_i\) added to a constant \(\beta_1\):

\[\hat{\beta_1} = \beta_1 + \sum_i w_i u_i\]

And so, \(\hat{\beta_1}\) will have a normal distribution if \(u\) has a normal distribution (which was one of the assumptions of the model from classwork 5). But even if \(u\) doesn’t have a normal distribution, it still could very well be true that \(\hat{\beta_1}\) is normally distributed because, as long as the sample size is large enough, we can invoke a central limit theorem.

Exercise 2: Under OLS assumptions, \(\hat{\beta_1}\) is a random variable with this type of distribution:

  1. F distribution

  2. T distribution

  3. Type 1 Extreme Value distribution

  4. Normal distribution

6.4 Variance of \(\hat{\beta_1}\)

Here’s a proof that the variance of \(\hat{\beta_1}\) is \(\frac{\sigma_u^2}{\sum_i (x_i - \bar{x})^2}\).

Start with the familiar formula for \(\hat{\beta_1}\):

\[\hat{\beta_1} = \beta_1 + \sum_i w_i u_i\]

Take the variance of both sides and recognize that \(\beta_1\) is a constant that has zero variance:

\[Var(\hat{\beta_1}) = Var\left(\sum_i w_i u_i\right)\]

Recall the definition of the variance of a random variable: \(Var(Z) = E\left[ \left (Z - E[Z] \right )^2 \right]\)

\[Var(\hat{\beta_1}) = E\left[\left(\sum_i w_i u_i - E\left[\sum_i w_i u_i\right]\right)^2\right]\]

By exogeneity, we’ve already shown that \(E\left[\sum_i w_i u_i\right] = 0\). So we have

\[Var(\hat{\beta_1}) = E\left[\left(\sum_i w_i u_i\right)^2\right]\]

Which “foils” to be:

\[Var(\hat{\beta_1}) = E\left[\sum_i w_i^2 u_i^2 + 2 \sum_i \sum_{j \neq i} w_i w_j u_i u_j\right]\]

An expected value of a sum is the same as the sum of the expected values:

\[Var(\hat{\beta_1}) = \sum_i E\left[w_i^2 u_i^2\right] + 2 \sum_i \sum_{j \neq i} E\left[w_i w_j u_i u_j\right]\]

We’re stuck unless we consider the conditional expectations instead of the unconditional ones. If we can show that the conditional expectations are constants, then the unconditional expectations are the same constants:

\[\sum_i E\left[w_i^2 u_i^2 | X\right] = \sum_i w_i^2 E[u_i^2 | X]\]

\[2 \sum_i \sum_{j \neq i} E\left[w_i w_j u_i u_j | X\right] = 2 \sum_i \sum_{j \neq i} w_i w_j E[u_i u_j | X]\]

Note: \(Var(u_i | X) = E\left[(u_i - E(u_i | X))^2 | X\right]\), and since we’re assuming exogeneity holds, \(Var(u_i | X) = E[u_i^2 | X]\). Here we make our next assumption called homoskedasticity: that \(Var(u_i | X)\) is a constant which we’ll call \(\sigma_u^2\).

The same way, note that \(Cov(u_i, u_j | X) = E\left[(u_i - E[u_i | X])(u_j - E[u_j|X])|X\right]\), and with exogeneity, \(Cov(u_i, u_j | X) = E[u_i u_j]\). If we assume that \(u_i\) is not autocorrelated, we can assume \(Cov(u_i, u_j | X) = 0\). That will be our next big assumption.

So under these two assumptions of homoskedasticity and no autocorrelation,

\[Var(\hat{\beta_1}) = \sigma^2_u \sum_i w_i^2 + 0\]

Since \(w_i = \frac{x_i - \bar{x}}{\sum_i (x_i - \bar{x})^2}\), we have \(\sum_i w_i^2 = \frac{1}{\sum_i (x_i - \bar{x})^2}\). And finally:

\[Var(\hat{\beta_1}) = \frac{\sigma^2_u}{\sum_i (x_i - \bar{x})^2}\]

And the standard deviation of \(\hat{\beta_1}\) is the square root of the variance:

\[sd(\hat{\beta_1}) = \sqrt{\frac{\sigma^2_u}{\sum_i (x_i - \bar{x})^2}} \tag{6.2}\]

6.5 Standard Errors

There’s just one last problem: \(u\) is unobservable, so we can’t calculate \(\sigma^2_u\) or \(sd(\hat{\beta_1})\) directly. Instead, we can estimate \(sd(\hat{\beta_1})\) using \(\hat{\sigma}^2_e\) as an approximation for \(\sigma^2_u\), and the estimation of the standard deviation of \(\hat{\beta_1}\) is what we call the standard error of \(\hat{\beta_1}\).

The sample variance of residuals \(e_i\) is \(\hat{\sigma^2_e} = \frac{\sum_i (e_i - \bar{e})^2}{n-1}\). Recall that \(\bar{e} = 0\), so we have \[\hat{\sigma^2_e} = \frac{\sum_i e_i^2}{n-1}\]

To estimate \(\sigma^2_u\) using \(\hat{\sigma^2}_e\), we lose another degree of freedom and divide by \(n-2\) instead of \(n-1\). Another way of thinking about losing the degree of freedom:

Which line is likely to be closer to the points representing the sample of observations on X and Y, the true line \(y_i = \beta_0 + \beta_1 x_i\) or the regression line \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i\)? The answer is the regression line, because by definition it is drawn in such a way as to minimize the sum of the squares of the distances between it and the observations. Dougherty, page 135

So the variance of \(e_i\) will always slightly underestimate the variance of \(u_i\).

Therefore \(\sigma^2_u\) is estimated by \(\frac{\sum_i e_i^2}{n - 2}\), which we can plug into Equation 6.2 to make the unobservable standard deviation of \(\hat{\beta_1}\) into its standard error which we can calculate using the sample data:

\[se(\hat{\beta_1}) = \sqrt{\frac{\sum_i e_i^2}{(n - 2) \sum_i (x_i - \bar{x})^2}}\]

Exercise 3: Fill in the blanks in the table below. Then use the formula above to calculate \(se(\hat{\beta_1})\). (hint: your first step is to find \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\). Use one of the formulas we’ve used in previous chapters to find \(\hat{\beta_1}\) and \(\hat{\beta_0}\), or use functions in R).

\(x\) \(y\) \(\hat{y}\) \(e = y - \hat{y}\) \(e^2\)
1 1.5 ___ ___ ___
2 0 ___ ___ ___
4 1 ___ ___ ___
5 3.5 ___ ___ ___

6.6 Summary

So now you know quite a bit about the distribution of the random variable \(\hat{\beta_1}\): its expected value (the true value \(\beta_1\)), the shape of the distribution (normal), and its variance. You also know a formula for estimating its standard deviation, which we call its standard error.

6.7 Classwork