Fitting a Linear Model

Probability Review

A random variable is any variable whose value cannot be predicted exactly. For example:

The message you get in a fortune cookie
The time you spend searching for your keys after you’ve misplaced them
The number of customers who enter a small retail store on a given day

A coin flip is also a random variable. Consider this “coin flip game”:

You earn $1 if the coin lands on heads
You earn $0 if the coin lands on tails

How much money should you expect to earn on average each time you play? That’s the expected value, defined as: \[E[X] = \sum_i x_i p_i\]

where:

$x_i$ is each possible outcome
$p_i$ is the probability each of those outcomes occur

You can think of the expected value as the long-run average payoff: if you played the game many times, it’s how much you’d earn per round on average.

Question 1

If it costs 30 cents to play one round of the coin flip game, do you expect to come out ahead?

Select (yes/no): you expect to (gain/lose) ___ each round.

Question 2

Suppose it still costs 30 cents to play, but now the coin is not fair. The coin lands on heads ___ of the time, making the expected value exactly 30 cents.

Variance

We measure how spread out a random variable is using its variance. Variance tells us how far, on average, the random variable is from its mean.

\[\begin{align} Var(X) = \sigma_X^2 &= E\left[(X - E[X])^2\right]\\ &= (x_1 - E[X])^2 p_1 + (x_2 - E[X])^2 p_2 + ... + (x_n - E[X])^2 p_n\\ &= \sum_{i = 1}^n (x_i - E[X])^2 p_i \end{align}\]

We square the differences so that:

negative and positive deviations don’t cancel out
larger deviations count more than smaller ones

Question 3

Let $X$ be a random variable that takes on values 0 through 4, each with equal probability. What is the expected value and variance of $X$?

Estimators

Now suppose you have a sample of data on a random variable $X$. An estimator is a rule for producing your best guess of a population value (like $E[X]$ or $\text{Var}(X)$) given sample data.

Estimator for the expected value

The best estimator for the expected value is the sample mean: \[\bar{x} = \frac{\sum_i x_i}{n}\] where $x_i$ is each observation in the sample and $n$ is the sample size.

Estimator for the variance

Recall: \[\text{Var}(X) = E[(X - E[X])^2]\]

So, based on sample data, a natural estimator is: \[\frac{\sum_i (x_i - \bar{x})^2}{n - 1}\] Using $n - 1$ instead of $n$ is called Bessel’s correction. The idea is:

We don’t know the true mean $E[X]$, so we use $\bar{x}$ instead
Because $\bar{x}$ is, by definition, in the middle of the sample, squared deviations will always be a little smaller compared to if we were able to use $E[X]$ itself
Dividing by $n-1$ corrects for this and gives a better estimate of the true variance of the random variable

Question 4

Consider the sample below x. Using R functions sum() and length(), find the estimates for $E[X]$ and $\text{Var}(X)$.

x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)

# estimate for E[X]:
# ___

# estimate for Var(X):
# ___

Note: mean() and var() are functions in R that take vectors and give you the sample mean and sample variance.

mean(x)

[1] 2

var(x)

[1] 0.6666667

Least-Squares Estimators

The least-squares estimator finds a line of best fit for a data set.

Question 5

Given the line $y = 3x + 5$, find:

the slope (when x increases by 1, what is the change to y)?
the y-intercept (when x is equal to zero, what is y)?

Least Squares Example

Suppose you have data on the number of years of education someone has and their earnings, like this:

educ	earn
12	45000
16	75000
16	60000

We model the true relationship as: \[\text{earn}_i = \beta_0 + \beta_1 \text{educ}_i + u_i\]

Using sample data, we estimate: \[\text{earn}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{educ}_i + e_i\]

Where the hats indicate estimates of the true relationship ($\hat{\beta}_0$ is the y-intercept; $\hat{\beta}_1$ is the slope), and $e_i$ are the residuals (vertical distance between the observation and the line of best fit).

$\hat{\beta}_1$, as the slope parameter, may or may not have a causal interpretation: a one-unit increase in X (1 more year of education) may or may not cause $\hat{\beta}_1$ higher earnings (Y) on average.

The slope estimator is: \[\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var(x)}}\]

And the estimator for the intercept parameter (average earnings of someone with 0 years of education) is given by: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Question 6

Using the sample data below, compute $\hat{\beta}_1$ and $\hat{\beta}_0$ using:

cov() for sample covariance
var() for sample variance
mean() for sample mean

educ <- c(8, 8, 12, 12, 12, 12, 16, 16, 16, 18, 18)
earn <- c(60, 62, 72, 70, 79, 69, 72, 89, 105, 110, 106)

# b1
# ___

# b0
# ___

Question 7

Now check your work using lm(). This function finds the parameters for the line of best fit using least squares.

lm() takes a formula of the form y ~ x as the first argument
It takes a data set (tibble) as the second argument
The . tells R to pipe the data set into the second argument of lm instead of the first

library(tidyverse)

# Read the qelp docs on lm():
?qelp::lm

# tibble(___) %>%
#   lm(earnings ~ education, data = .)

Interpretation:

Someone with 0 years of education is predicted to earn ___ thousand dollars per year
One additional year of education is associated with ___ thousand more dollars per year in earnings

Download this Assignment

Here’s a link to download this assignment.

Autograder

Here’s the autograder for this assignment.