1  Least Squares

1.1 Overview

In this chapter, I will remind you about the method of least squares. I’ll start from scratch, so if EC320 is not top of mind, don’t worry!

In section 1.2, I’ll introduce the method of least squares as the method to combine observations in order to make a guess about a linear relationship. In section 1.3, I’ll derive OLS estimators from scratch by using the definition of the method of least squares. Finally in section 1.4, I’ll do a numerical example where you’ll practice finding \(\hat{\beta_0}\) and \(\hat{\beta_1}\) with 3 observations.

1.1.1 Key Terms and Notation

Symbol Meaning Example
\(\beta_0\) Intercept parameter in a linear model \(y_i = \beta_0 + \beta_1 x_i + u_i\)
\(\beta_1\) Slope parameter in a linear model see above
\(y_i\) dependent variable, outcome variable see above
\(x_i\) explanatory variable see above
\(u_i\) unobservable term, disturbance, shock see above
\(\hat{\beta}_0\) Estimate of the intercept \(y_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i\)
\(\hat{\beta}_1\) Estimate of the slope see above
\(\hat{y}_i\) Fitted value, prediction \(\hat{y}_i = \hat{\beta_0} + \hat{\beta_1} x_i\)
\(e_i\) residual \(y_i - \hat{y}_i\)

1.2 Least Squares as the Combination of Observations

Suppose education (x) has a linear effect on wage (y). If someone has zero years of education, they will earn $5 per hour on average, and every extra year of education a person has results in an extra 50 cents added to their wage. Then a linear model would be the correct specification:

\[wage_i = \beta_0 + \beta_1 education_i + u_i\]

Where \(\beta_0 = 5\) and \(\beta_1 = 0.50\).

If I took some data on the education and earnings of a bunch of people, I could use OLS to estimate \(\beta_0\) and \(\beta_1\). I’ll put hats on the betas to indicate they are estimates: \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are our estimates of the true parameters \(\beta_0\) and \(\beta_1\). We might get \(\hat{\beta_0} = 4\) and \(\hat{\beta_1} = 0.75\) instead of the true values of the parameters \(\beta_0 = 5\) and \(\beta_1 = 0.50\).

\(\beta_0\) is the true value of the intercept: if x takes on a 0, this is the expected value for y to take on. In mathematical terms, this is a conditional expectation: \(E[y | x = 0] = \beta_0\), which is pronounced “the expectation of y given x takes 0 is \(\beta_0\)”. And \(\beta_1\) is the true effect of x on y: if x increases by one unit, \(\beta_1\) is the amount by which y is expected to increase. In mathematical terms: \(E[y | x = \alpha + 1] - E[y | x = \alpha] = \beta_1\) for any \(\alpha\).

The method of least squares was first published by Frenchman Adrien Marie Legendre in 1805, but there is controversy about whether he was the first inventor or it was the German mathematician and physicist Carl Friedrich Gauss. The method of least squares founded the study of statistics, which was then called “the combination of observations,” because that’s what least squares helps you do: combine observations to understand a true underlying process. Least squares helped to solve two huge scientific problems in the beginning of the 1800s:

  1. There’s a field of science called Geodesy that was, at the time, concerned with measuring the circumference of the globe. They had measurements of distances between cities and angles of the stars at each of the cities, done by different observers through different procedures. But until least squares, they had no favored way to combine those observations.

  2. Ceres (the largest object in the asteroid belt between Mars and Jupiter) was discovered. “Speculation about extra-terrestrial life on other planets was open to debate, and the potential new discovery of such a close neighbour to Earth was the buzz of the scientific community,” Lim et al. (2021). Astronomers wanted to figure out the position and orbit of Ceres, but couldn’t agree about how to extrapolate that with only a few noisy observations. Until least squares came along.

The method of least squares quickly became the dominant way to solve this statistical problem and remains dominant today.

One reason the method of least squares is so popular is that it’s so simple and mathematically tractable: the entire procedure can be summed up in one statement: the method of least squares fits a linear model that minimizes the sum of the squared residuals.

In the next few videos, I’ll show you that for a simple regression, I can take that statement of the method of least squares and derive these two formulas for the estimate for the intercept and then the estimate for the slope:

\[\hat{\beta_0} = \bar{y} - \beta_1 \bar{x}\]

\[\hat{\beta_1} = \frac{\sum_i x_i y_i - \bar{x}\bar{y}n}{\sum_i x_i^2 - \bar{x}^2 n}\]

1.3 Deriving OLS Estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

3: Residuals are vertical distances: \(e_i = y_i - \hat{y_i}\)

4: OLS as \(\displaystyle\min_{\hat{\beta_0}, \hat{\beta_1}} \sum_i e_i^2 = \min_{\hat{\beta_0}, \hat{\beta_1}} \sum_i (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2\)

5: \(e_i^2 = y_i^2 - 2 \hat{\beta_0} y_i - 2 \hat{\beta_1} x_i y_i + 2 \hat{\beta_0} \hat{\beta_1} x_i + \hat{\beta_0}^2 + \hat{\beta_1}^2 x_i^2\)

6: Some summation rules

Reference these summation rules in the future here.

7: Taking first order conditions

8: Simplifying the FOC for \(\hat{\beta_0}\)

9: Simplifying the FOC for \(\hat{\beta_1}\)

1.4 Numerical Example

10: Calculate \(\hat{\beta_0}\) and \(\hat{\beta_1}\) for a 3 observation example

11: Calculate fitted values \(\hat{y_i}\) and residuals \(e_i\) for a 3 observation example

12: \(u_i\) versus \(e_i\)

1.5 Exercises

Classwork 1: Deriving OLS Estimators

Koans 1-3: Vectors, Tibbles, and Pipes

Classwork 2: lm and qplot

Koans 4-7: dplyr

Classwork 3: dplyr murder mystery

1.6 References

Dougherty (2016) Chapter 1: Simple Regression Analysis

Lim et al. (2021)