x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)
# estimate for E[X]:
# ___
# estimate for Var(X):
# ___4 Ordinary Least Squares in R
Probability Review
A random variable is any variable whose value cannot be predicted exactly. For example:
- The message you get in a fortune cookie
- The time you spend searching for your keys after you’ve misplaced them
- The number of customers who enter a small retail store on a given day
A coin flip is also a random variable. Consider this “coin flip game”:
- You earn $1 if the coin lands on heads
- You earn $0 if the coin lands on tails
How much money should you expect to earn on average each time you play? That’s the expected value, defined as: \[E[X] = \sum_i x_i p_i\]
where:
- \(x_i\) is each possible outcome
- \(p_i\) is the probability each of those outcomes occur
You can think of the expected value as the long-run average payoff: if you played the game many times, it’s how much you’d earn per round on average.
Question 1
If it costs 30 cents to play one round of the coin flip game, do you expect to come out ahead?
Select (yes/no): you expect to (gain/lose) ___ each round.
Question 2
Suppose it still costs 30 cents to play, but now the coin is not fair. The coin lands on heads ___ of the time, making the expected value exactly 30 cents.
Variance
We measure how spread out a random variable is using its variance. Variance tells us how far, on average, the random variable is from its mean.
\[\begin{align} Var(X) = \sigma_X^2 &= E\left[(X - E[X])^2\right]\\ &= (x_1 - E[X])^2 p_1 + (x_2 - E[X])^2 p_2 + ... + (x_n - E[X])^2 p_n\\ &= \sum_{i = 1}^n (x_i - E[X])^2 p_i \end{align}\]
We square the differences so that:
- negative and positive deviations don’t cancel out
- larger deviations count more than smaller ones
Question 3
Let \(X\) be a random variable that takes on values 0 through 4, each with equal probability. What is the expected value and variance of \(X\)?
Estimators
Now suppose you have a sample of data on a random variable \(X\). An estimator is a rule for producing your best guess of a population value (like \(E[X]\) or \(\text{Var}(X)\)) given sample data.
Estimator for the expected value
The best estimator for the expected value is the sample mean: \[\bar{x} = \frac{\sum_i x_i}{n}\], where \(x_i\) is each observation in the sample and \(n\) is the sample size.
Estimator for the variance
Recall: \[\text{Var}(X) = E[(X - E[X])^2]\]
So, based on sample data, a natural estimator is: \[\frac{\sum_i (x_i - \bar{x})^2}{n - 1}\] Using \(n - 1\) instead of \(n\) is called Bessel’s correction. The idea is:
- We don’t know the true mean \(E[X]\), so we use \(\bar{x}\) instead
- Because \(\bar{x}\) is computed from the sample, it tends to make the squared deviations a little too small: \(\bar{x}\) by definition is right in the middle of the sample
- Dividing by \(n-1\) corrects for this and gives a better estimate of the true variance
Question 4
Consider the sample below x. Using R functions sum() and length(), find the estimates for \(E[X]\) and \(\text{Var}(X)\).
Note: mean() and var() are functions in R that take vectors and give you the sample mean and sample variance.
mean(x)[1] 2
var(x)[1] 0.6666667
Least-Squares Estimators
Suppose you have data on the number of years of education someone has and their earnings, like this:
| educ | earn |
|---|---|
| 12 | 45000 |
| 16 | 75000 |
| 16 | 60000 |
We model the true relationship as: \[\text{earn}_i = \beta_0 + \beta_1 \text{educ}_i + u_i\]
Using sample data, we estimate: \[\text{earn}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{educ}_i + e_i\]
Where the hats indicate estimates of the true relationship, and \(e_i\) are the residuals (vertical distance between the observation and the line of best fit).
\(\hat{\beta}_1\), as the slope parameter, may or may not have a causal interpretation: a one-unit change in X (1 more year of education) results in \(\hat{\beta}_1\) higher earnings (Y) on average.
The slope estimator is: \[\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var(x)}}\]
And the estimator for the intercept parameter (average earnings of someone with 0 years of education) is given by: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Question 5
Using the sample data below, compute \(\hat{\beta}_1\) and \(\hat{\beta}_0\) using:
cov()for sample covariancevar()for sample variancemean()for sample mean
educ <- c(8, 8, 12, 12, 12, 12, 16, 16, 16, 18, 18)
earn <- c(60, 62, 72, 70, 79, 69, 72, 89, 105, 110, 106)
# b1
# ___
# b0
# ___Question 6
Now check your work using lm(). This function finds the parameters for the line of best fit using least squares.
lm()takes a formula of the formy ~ xas the first argument- It takes a data set (tibble) as the second argument
- The
.tells R to pipe the data set into the second argument oflminstead of the first
library(tidyverse)
# Read the qelp docs on lm():
?qelp::lm
# tibble(___) %>%
# lm(earnings ~ education, data = .)Interpretation:
- Someone with 0 years of education is predicted to earn ___ thousand dollars per year
- One additional year of education is associated with ___ thousand more dollars per year in earnings
Love Island Project
A Love Island superfan (not me, to be clear) recorded detailed data on contestants across three seasons of the show. We’ll use this data set to practice dplyr, ggplot2, and linear regression in R.
Run this to get started:
love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>%
mutate(win = if_else(outcome %in% c("winner", "runner up", "third place"), 1, 0))Rows: 96 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, outcome, profession, region_origin_UK, gender, first_arrive
dbl (6): day_left, age, day_joined, n_dates, n_challenges_won, series
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 7: Dplyr
Answer these questions using dplyr verbs.
count() will be especially useful for a couple of the questions here. Recall that count(x) is equivalent to group_by(x) %>% summarize(n = n()).
- Out of the three seasons, how many people won (got third place, runner up, or winner)?
# love %>%
# ___- What is the minimum, maximum, and median age of contestants?
# love %>%
# ___- Are male contestants, on average, older than female contestants?
# love %>%
# ___- What are the three most common professions among contestants?
# love %>%
# ___- Continuing from part d, what are the two most common professions for male contestants and what are the two most common professions for female contestants?
# love %>%
# ___- What region of the UK are most of the contestants from?
# love %>%
# ___Question 8: Ggplot2
Create a visualization to explore whether age affects someone’s chances of winning.
Make a scatterplot with:
- age on the x-axis
- win (0 or 1) on the y-axis
Since win only takes values 0 and 1, many points will overlap if you use geom_point(). Instead, use geom_jitter() to add a small amount of random noise.
Add a line of best fit using geom_smooth(method = lm).
# love %>%
# ___Question 9: lm
- Use
lm()to estimate the model \(\text{win}_i = \beta_0 + \beta_1 \text{age}_i + u_i\). The baseline probability someone who is zero years old wins is , which (is/is not) statistically significantly different from zero: the p-value (is/is not) less than 0.05. A one year increase in age means someone’s probability of winning increases by , which (is/is not) statistically significantly different from zero.
# love %>%
# ___ %>%
# broom::tidy()- Use
lm()to estimate the model \(\text{win}_i = \beta_0 + \beta_1 \text{day joined}_i + u_i\). The baseline probability that someone who joins the show on day zero wins is , which (is/is not) statistically significantly different from zero. A one day increase in day joined means someone’s probability of winning (increases/decreases) by , which (is/is not) statistically significantly different from zero.
# love %>%
# ___ %>%
# broom::tidy()Download this Assignment
Here’s a link to download this assignment.
Autograder
Here’s the autograder for this assignment.