x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)
# estimate for E[X]:
# ___
# estimate for Var(X):
# ___Fitting a Linear Model
Probability Review
A random variable is any variable whose value cannot be predicted exactly. For example:
- The message you get in a fortune cookie
- The time you spend searching for your keys after you’ve misplaced them
- The number of customers who enter a small retail store on a given day
A coin flip is also a random variable. Consider this “coin flip game”:
- You earn $1 if the coin lands on heads
- You earn $0 if the coin lands on tails
How much money should you expect to earn on average each time you play? That’s the expected value, defined as: \[E[X] = \sum_i x_i p_i\]
where:
- \(x_i\) is each possible outcome
- \(p_i\) is the probability each of those outcomes occur
You can think of the expected value as the long-run average payoff: if you played the game many times, it’s how much you’d earn per round on average.
Question 1
If it costs 30 cents to play one round of the coin flip game, do you expect to come out ahead?
Select (yes/no): you expect to (gain/lose) ___ each round.
Question 2
Suppose it still costs 30 cents to play, but now the coin is not fair. The coin lands on heads ___ of the time, making the expected value exactly 30 cents.
Variance
We measure how spread out a random variable is using its variance. Variance tells us how far, on average, the random variable is from its mean.
\[\begin{align} Var(X) = \sigma_X^2 &= E\left[(X - E[X])^2\right]\\ &= (x_1 - E[X])^2 p_1 + (x_2 - E[X])^2 p_2 + ... + (x_n - E[X])^2 p_n\\ &= \sum_{i = 1}^n (x_i - E[X])^2 p_i \end{align}\]
We square the differences so that:
- negative and positive deviations don’t cancel out
- larger deviations count more than smaller ones
Question 3
Let \(X\) be a random variable that takes on values 0 through 4, each with equal probability. What is the expected value and variance of \(X\)?
Estimators
Now suppose you have a sample of data on a random variable \(X\). An estimator is a rule for producing your best guess of a population value (like \(E[X]\) or \(\text{Var}(X)\)) given sample data.
Estimator for the expected value
The best estimator for the expected value is the sample mean: \[\bar{x} = \frac{\sum_i x_i}{n}\] where \(x_i\) is each observation in the sample and \(n\) is the sample size.
Estimator for the variance
Recall: \[\text{Var}(X) = E[(X - E[X])^2]\]
So, based on sample data, a natural estimator is: \[\frac{\sum_i (x_i - \bar{x})^2}{n - 1}\] Using \(n - 1\) instead of \(n\) is called Bessel’s correction. The idea is:
- We don’t know the true mean \(E[X]\), so we use \(\bar{x}\) instead
- Because \(\bar{x}\) is, by definition, in the middle of the sample, squared deviations will always be a little smaller compared to if we were able to use \(E[X]\) itself
- Dividing by \(n-1\) corrects for this and gives a better estimate of the true variance of the random variable
Question 4
Consider the sample below x. Using R functions sum() and length(), find the estimates for \(E[X]\) and \(\text{Var}(X)\).
Note: mean() and var() are functions in R that take vectors and give you the sample mean and sample variance.
mean(x)[1] 2
var(x)[1] 0.6666667
Least-Squares Estimators
The least-squares estimator finds a line of best fit for a data set.
Question 5
Given the line \(y = 3x + 5\), find:
- the slope (when x increases by 1, what is the change to y)?
- the y-intercept (when x is equal to zero, what is y)?
Least Squares Example
Suppose you have data on the number of years of education someone has and their earnings, like this:
| educ | earn |
|---|---|
| 12 | 45000 |
| 16 | 75000 |
| 16 | 60000 |
We model the true relationship as: \[\text{earn}_i = \beta_0 + \beta_1 \text{educ}_i + u_i\]
Using sample data, we estimate: \[\text{earn}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{educ}_i + e_i\]
Where the hats indicate estimates of the true relationship (\(\hat{\beta}_0\) is the y-intercept; \(\hat{\beta}_1\) is the slope), and \(e_i\) are the residuals (vertical distance between the observation and the line of best fit).
\(\hat{\beta}_1\), as the slope parameter, may or may not have a causal interpretation: a one-unit increase in X (1 more year of education) may or may not cause \(\hat{\beta}_1\) higher earnings (Y) on average.
The slope estimator is: \[\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var(x)}}\]
And the estimator for the intercept parameter (average earnings of someone with 0 years of education) is given by: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Question 6
Using the sample data below, compute \(\hat{\beta}_1\) and \(\hat{\beta}_0\) using:
cov()for sample covariancevar()for sample variancemean()for sample mean
educ <- c(8, 8, 12, 12, 12, 12, 16, 16, 16, 18, 18)
earn <- c(60, 62, 72, 70, 79, 69, 72, 89, 105, 110, 106)
# b1
# ___
# b0
# ___Question 7
Now check your work using lm(). This function finds the parameters for the line of best fit using least squares.
lm()takes a formula of the formy ~ xas the first argument- It takes a data set (tibble) as the second argument
- The
.tells R to pipe the data set into the second argument oflminstead of the first
library(tidyverse)
# Read the qelp docs on lm():
?qelp::lm
# tibble(___) %>%
# lm(earnings ~ education, data = .)Interpretation:
- Someone with 0 years of education is predicted to earn ___ thousand dollars per year
- One additional year of education is associated with ___ thousand more dollars per year in earnings
Download this Assignment
Here’s a link to download this assignment.
Autograder
Here’s the autograder for this assignment.