3  Estimators

Reading: If you want a little more information about the topics in this chapter, take a look at Dougherty R.7 (pages 33 - 34) and then Dougherty Chapter 1 (pages 85 - 112).

3.1 Chapter Preview

In this chapter I’ll spend some time talking about estimators of variance, covariance, and correlation when you have sample data about random variables. Then I’ll move on to an introduction of Ordinary Least Squares (OLS) to fit a linear model of the relationship between two (or more) random variables. We’ll spend the rest of the course on this topic.

3.2 Estimators of Variance, Covariance, and Correlation

In Classwork 2, we discovered that if you want to estimate the expected value of a random variable, you can take its sample mean. That is, the sample mean \(\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\) is an unbiased and low-variance estimator for \(\mu_X\).

Note that \(x_i\) now refers to an observation in a sample of the random variable \(X\) and not the potential outcomes of the random variable \(X\) like in Chapter 1. \(x_i\) will continue to refer to an observation in a sample for the rest of this workbook.

If you wanted to also estimate the variance of \(X\), \(\sigma^2_X\), or the covariance between two random variables \(X\) and \(Y\), \(\sigma_{XY}\), these estimators are unbiased (proof in Dougherty Appendix R.1):

  • The variance of a random variable \(X\) is \(\sigma^2_X = E[(X - \mu_X)^2]\) and it can be estimated with the formula \(\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2\). Notice how the formula mirrors the variance formula, but replaces \(\mu_X\) with its estimator \(\bar{x}\) and takes the average instead of the expectation. One difference is that it divides by \(n - 1\) instead of \(n\) like in the sample mean formula. The reason for this is that since \(\bar{x}\) is in the middle of the sample by definition, \((X_i - \bar{x})^2\) will always be a little smaller than it would be if we instead were able to use \(\mu_X\).

  • The covariance of two random variables \(X\) and \(Y\) is \(\sigma_{XY} = E[(X - \mu_X)(Y - \mu_Y)]\) and it can be estimated with the formula \(\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})\).

A note on notation: when we’re referring to an estimator throughout this course, we’ll put a hat on top of the symbol. So the variance of the random variable is \(\sigma_X^2\) and the estimator for the variance is \(\hat{\sigma}_X^2\). Likewise, the covariance of two random variables is \(\sigma_{XY}\) and the estimator for that covariance is \(\hat{\sigma}_{XY}\).

Finally, the correlation of two random variables \(X\) and \(Y\) is \(\rho_{XY} = \frac{\sigma_{XY}}{\sqrt{\sigma_X^2 \sigma_Y^2}}\) and is estimated by \(\frac{\hat{\sigma}_{XY}}{\sqrt{\hat{\sigma}_X^2 \hat{\sigma}_Y^2}}\), that is, the estimate for covariance divided by the square root of the product of the estimates of the variance of \(X\) and \(Y\).

Example: Consider again the random variable “outcome of a roll of a die”. Suppose we rolled the die five times and got this sample: {1, 1, 6, 6, 5}. To estimate the expected value and variance of the dice roll, use the sample mean and the formula above for \(\hat{\sigma}_X^2\).

# sample mean: estimate of the expected value
(1 / 5) * sum(c(1, 1, 6, 6, 5))
[1] 3.8
# estimate of the variance
(1 / 4) * sum((c(1, 1, 6, 6, 5) - 3.8) ^ 2)
[1] 6.7

Example: A shortcut is to use the R functions mean() and var(), which use the same formulas.

# attach the tidyverse so we can use the pipe
library(tidyverse)

# sample mean: estimate of the expected value
c(1, 1, 6, 6, 5) %>% mean()
[1] 3.8
# estimate of the variance
c(1, 1, 6, 6, 5) %>% var()
[1] 6.7

Exercise 1: Suppose we again rolled the die five times and got this different sample: {2, 2, 3, 1, 2}. What would our estimate be for the expected value and variance of the dice roll? You can either use the formulas above or, as a shortcut, you can use the R functions mean() and var().

Exercise 2: Consider two random variables X and Y and suppose we have a sample of X: {2, 2, 3, 1, 2} and a sample of Y: {3, 3, 9, 7, 7}. Estimate the covariance and correlation between X and Y using either the formulas above or the R functions cov and cor.

3.3 Estimators of parameters of a linear model

We’ll spend the rest of this course exploring how to use Ordinary Least Squares (OLS) to fit a linear model like this:

\[y_i = \beta_0 + \beta_1 x_i + u_i\]

That is, if we wanted to hypoethesize that some random variable \(Y\) depends on another random variable \(X\) and that there’s a linear relationship between them, \(\beta_0\) and \(\beta_1\) are the parameters which describe the nature of that relationship.

Given a sample of X and Y, we’ll derive unbiased estimators for the intercept \(\beta_0\) and slope \(\beta_1\). Those estimators help us combine observations of X and Y to estimate underlying relationships between those variables.

The method of least squares was first published by Frenchman Adrien Marie Legendre in 1805, but there is controversy about whether he was the first inventor or it was the German mathematician and physicist Carl Friedrich Gauss. Regardless, the method of least squares founded the study of statistics, which was then called “the combination of observations,” because that’s what least squares helps you do: combine observations to estimate some underlying relationship. Least squares helped to solve two huge scientific problems in the beginning of the 1800s:

  1. There’s a field of science called Geodesy that was, at the time, concerned with measuring the circumference of the globe. They had measurements of distances between cities and angles of the stars at each of the cities, done by different observers through different procedures. But until least squares, they had no way to combine those observations into a single estimate.

  2. Ceres (the largest object in the asteroid belt between Mars and Jupiter) was discovered. “Speculation about extra-terrestrial life on other planets was open to debate, and the potential new discovery of such a close neighbour to Earth was the buzz of the scientific community,” (least_squares_web?). Astronomers wanted to figure out the position and orbit of Ceres, but couldn’t extrapolate that with only a few noisy observations. Until least squares came along.

The method of least squares quickly became the dominant way to solve this statistical problem and remains dominant today.

3.3.1 Summation Rules

3.3.2 Notation: Linear Models

Here’s a preview of the notation I’ll use in the next part of this chapter. Basically it’s just saying some true linear model exists between \(X\) and \(Y\), which is \(y_i = \beta_0 + \beta_1 x_i + u_i\). Then when I’m referring to our estimates of the intercept and slope, I’ll put hats over the \(\beta\)s and turn the disturbance \(u_i\) into residuals \(e_i\) because they turn out to be pretty different conceptually: \(y_i = \hat{\beta_0} + \hat{\beta_1}x_i + e_i\).

Symbol Meaning Example
\(\beta_0\) Intercept parameter in a linear model \(y_i = \beta_0 + \beta_1 x_i + u_i\)
\(\beta_1\) Slope parameter in a linear model see above
\(y_i\) dependent variable or outcome variable see above
\(x_i\) explanatory variable see above
\(u_i\) unobservable term, disturbance, shock see above
\(\hat{\beta}_0\) Estimate of the intercept \(y_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i\)
\(\hat{\beta}_1\) Estimate of the slope see above
\(e_i\) residuals see above

Exercise 3: OLS residuals are the (vertical/horizontal) distances between the observation and the (true/estimated) linear model.

3.4 Least Squares as the Combination of Observations

One reason the method of least squares is so popular is that it’s so simple and mathematically tractable: the entire procedure can be summed up in one statement: the method of least squares fits a linear model that minimizes the sum of the squared residuals.

In the next few videos, we’ll see that for a simple regression, we can take that statement of the method of least squares and derive:

\[\hat{\beta_0} = \bar{y} - \beta_1 \bar{x}\]

\[\hat{\beta_1} = \frac{\sum_i x_i y_i - \bar{x}\bar{y}n}{\sum_i x_i^2 - \bar{x}^2 n}\]

3.5 Deriving OLS Estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

Exercise 4: What does OLS do?

  1. It implements complex machine learning algorithms.

  2. It allows us to create new variables by multiplying existing ones.

  3. It estimates the relationship between variables using maximum likelihood estimation.

  4. It allows us to combine observations to estimate a linear model of two (or more) variables.

Exercise 5: How does OLS work?

  1. It estimates the intercept and slope as the values which minimize the sum of the squared residuals.

  2. It uses neural networks to predict future values.

  3. It works by maximizing the likelihood function.

Exercise 6: (T/F) The residuals \(e_i\) are equal to: \(y_i - \hat{\beta_0} - \hat{\beta_1}x_i\).

Exercise 7: (T/F) For vector \(x\), it is always true that: \(\sum_{i = 1}^n x_i = x_1 + x_2 + ... + x_n\)

Exercise 8: (T/F) For vectors \(x\) and \(y\), it is always true that \(\sum_{i=1}^n x_i y_i = \sum_{i=1}^n x_i \sum_{i=1}^n y_i\)

3.6 Classwork