Fitting a Linear Model

Probability Review

A random variable is any variable whose value cannot be predicted exactly. For example:

  • The message you get in a fortune cookie
  • The time you spend searching for your keys after you’ve misplaced them
  • The number of customers who enter a small retail store on a given day

A coin flip is also a random variable. Consider this “coin flip game”:

  • You earn $1 if the coin lands on heads
  • You earn $0 if the coin lands on tails

How much money should you expect to earn on average each time you play? That’s the expected value, defined as: \[E[X] = \sum_i x_i p_i\]

where:

  • \(x_i\) is each possible outcome
  • \(p_i\) is the probability each of those outcomes occur

You can think of the expected value as the long-run average payoff: if you played the game many times, it’s how much you’d earn per round on average.

Question 1

If it costs 30 cents to play one round of the coin flip game, do you expect to come out ahead?

Select (yes/no): you expect to (gain/lose) ___ each round.

Question 2

Suppose it still costs 30 cents to play, but now the coin is not fair. The coin lands on heads ___ of the time, making the expected value exactly 30 cents.

Variance

We measure how spread out a random variable is using its variance. Variance tells us how far, on average, the random variable is from its mean.

\[\begin{align} Var(X) = \sigma_X^2 &= E\left[(X - E[X])^2\right]\\ &= (x_1 - E[X])^2 p_1 + (x_2 - E[X])^2 p_2 + ... + (x_n - E[X])^2 p_n\\ &= \sum_{i = 1}^n (x_i - E[X])^2 p_i \end{align}\]

We square the differences so that:

  • negative and positive deviations don’t cancel out
  • larger deviations count more than smaller ones

Question 3

Let \(X\) be a random variable that takes on values 0 through 4, each with equal probability. What is the expected value and variance of \(X\)?

Estimators

Now suppose you have a sample of data on a random variable \(X\). An estimator is a rule for producing your best guess of a population value (like \(E[X]\) or \(\text{Var}(X)\)) given sample data.

Estimator for the expected value

The best estimator for the expected value is the sample mean: \[\bar{x} = \frac{\sum_i x_i}{n}\] where \(x_i\) is each observation in the sample and \(n\) is the sample size.

Estimator for the variance

Recall: \[\text{Var}(X) = E[(X - E[X])^2]\]

So, based on sample data, a natural estimator is: \[\frac{\sum_i (x_i - \bar{x})^2}{n - 1}\] Using \(n - 1\) instead of \(n\) is called Bessel’s correction. The idea is:

  • We don’t know the true mean \(E[X]\), so we use \(\bar{x}\) instead
  • Because \(\bar{x}\) is, by definition, in the middle of the sample, squared deviations will always be a little smaller compared to if we were able to use \(E[X]\) itself
  • Dividing by \(n-1\) corrects for this and gives a better estimate of the true variance of the random variable

Question 4

Consider the sample below x. Using R functions sum() and length(), find the estimates for \(E[X]\) and \(\text{Var}(X)\).

x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)

# estimate for E[X]:
# ___

# estimate for Var(X):
# ___

Note: mean() and var() are functions in R that take vectors and give you the sample mean and sample variance.

mean(x)
[1] 2
var(x)
[1] 0.6666667

Least-Squares Estimators

The least-squares estimator finds a line of best fit for a data set.

Question 5

Given the line \(y = 3x + 5\), find:

  • the slope (when x increases by 1, what is the change to y)?
  • the y-intercept (when x is equal to zero, what is y)?

Least Squares Example

Suppose you have data on the number of years of education someone has and their earnings, like this:

educ earn
12 45000
16 75000
16 60000

We model the true relationship as: \[\text{earn}_i = \beta_0 + \beta_1 \text{educ}_i + u_i\]

Using sample data, we estimate: \[\text{earn}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{educ}_i + e_i\]

Where the hats indicate estimates of the true relationship (\(\hat{\beta}_0\) is the y-intercept; \(\hat{\beta}_1\) is the slope), and \(e_i\) are the residuals (vertical distance between the observation and the line of best fit).

\(\hat{\beta}_1\), as the slope parameter, may or may not have a causal interpretation: a one-unit increase in X (1 more year of education) may or may not cause \(\hat{\beta}_1\) higher earnings (Y) on average.

The slope estimator is: \[\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var(x)}}\]

And the estimator for the intercept parameter (average earnings of someone with 0 years of education) is given by: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Question 6

Using the sample data below, compute \(\hat{\beta}_1\) and \(\hat{\beta}_0\) using:

  • cov() for sample covariance
  • var() for sample variance
  • mean() for sample mean
educ <- c(8, 8, 12, 12, 12, 12, 16, 16, 16, 18, 18)
earn <- c(60, 62, 72, 70, 79, 69, 72, 89, 105, 110, 106)

# b1
# ___

# b0
# ___

Question 7

Now check your work using lm(). This function finds the parameters for the line of best fit using least squares.

  • lm() takes a formula of the form y ~ x as the first argument
  • It takes a data set (tibble) as the second argument
  • The . tells R to pipe the data set into the second argument of lm instead of the first
library(tidyverse)

# Read the qelp docs on lm():
?qelp::lm

# tibble(___) %>%
#   lm(earnings ~ education, data = .)

Interpretation:

  • Someone with 0 years of education is predicted to earn ___ thousand dollars per year
  • One additional year of education is associated with ___ thousand more dollars per year in earnings

Download this Assignment

Here’s a link to download this assignment.

Autograder

Here’s the autograder for this assignment.