4 Ordinary Least Squares in R

Probability Review

A random variable is any variable whose value cannot be predicted exactly. For example:

The message you get in a fortune cookie
The time you spend searching for your keys after you’ve misplaced them
The number of customers who enter a small retail store on a given day

A coin flip is also a random variable. Consider this “coin flip game”:

You earn $1 if the coin lands on heads
You earn $0 if the coin lands on tails

How much money should you expect to earn on average each time you play? That’s the expected value, defined as: \[E[X] = \sum_i x_i p_i\]

where:

$x_i$ is each possible outcome
$p_i$ is the probability each of those outcomes occur

You can think of the expected value as the long-run average payoff: if you played the game many times, it’s how much you’d earn per round on average.

Question 1

If it costs 30 cents to play one round of the coin flip game, do you expect to come out ahead?

Select (yes/no): you expect to (gain/lose) ___ each round.

Question 2

Suppose it still costs 30 cents to play, but now the coin is not fair. The coin lands on heads ___ of the time, making the expected value exactly 30 cents.

Variance

We measure how spread out a random variable is using its variance. Variance tells us how far, on average, the random variable is from its mean.

\[\begin{align} Var(X) = \sigma_X^2 &= E\left[(X - E[X])^2\right]\\ &= (x_1 - E[X])^2 p_1 + (x_2 - E[X])^2 p_2 + ... + (x_n - E[X])^2 p_n\\ &= \sum_{i = 1}^n (x_i - E[X])^2 p_i \end{align}\]

We square the differences so that:

negative and positive deviations don’t cancel out
larger deviations count more than smaller ones

Question 3

Let $X$ be a random variable that takes on values 0 through 4, each with equal probability. What is the expected value and variance of $X$?

Estimators

Now suppose you have a sample of data on a random variable $X$. An estimator is a rule for producing your best guess of a population value (like $E[X]$ or $\text{Var}(X)$) given sample data.

Estimator for the expected value

The best estimator for the expected value is the sample mean: \[\bar{x} = \frac{\sum_i x_i}{n}\], where $x_i$ is each observation in the sample and $n$ is the sample size.

Estimator for the variance

Recall: \[\text{Var}(X) = E[(X - E[X])^2]\]

So, based on sample data, a natural estimator is: \[\frac{\sum_i (x_i - \bar{x})^2}{n - 1}\] Using $n - 1$ instead of $n$ is called Bessel’s correction. The idea is:

We don’t know the true mean $E[X]$, so we use $\bar{x}$ instead
Because $\bar{x}$ is computed from the sample, it tends to make the squared deviations a little too small: $\bar{x}$ by definition is right in the middle of the sample
Dividing by $n-1$ corrects for this and gives a better estimate of the true variance

Question 4

Consider the sample below x. Using R functions sum() and length(), find the estimates for $E[X]$ and $\text{Var}(X)$.

x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)

# estimate for E[X]:
# ___

# estimate for Var(X):
# ___

Note: mean() and var() are functions in R that take vectors and give you the sample mean and sample variance.

mean(x)

[1] 2

var(x)

[1] 0.6666667

Least-Squares Estimators

Suppose you have data on the number of years of education someone has and their earnings, like this:

educ	earn
12	45000
16	75000
16	60000

We model the true relationship as: \[\text{earn}_i = \beta_0 + \beta_1 \text{educ}_i + u_i\]

Using sample data, we estimate: \[\text{earn}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{educ}_i + e_i\]

Where the hats indicate estimates of the true relationship, and $e_i$ are the residuals (vertical distance between the observation and the line of best fit).

$\hat{\beta}_1$, as the slope parameter, may or may not have a causal interpretation: a one-unit change in X (1 more year of education) results in $\hat{\beta}_1$ higher earnings (Y) on average.

The slope estimator is: \[\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var(x)}}\]

And the estimator for the intercept parameter (average earnings of someone with 0 years of education) is given by: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Question 5

Using the sample data below, compute $\hat{\beta}_1$ and $\hat{\beta}_0$ using:

cov() for sample covariance
var() for sample variance
mean() for sample mean

educ <- c(8, 8, 12, 12, 12, 12, 16, 16, 16, 18, 18)
earn <- c(60, 62, 72, 70, 79, 69, 72, 89, 105, 110, 106)

# b1
# ___

# b0
# ___

Question 6

Now check your work using lm(). This function finds the parameters for the line of best fit using least squares.

lm() takes a formula of the form y ~ x as the first argument
It takes a data set (tibble) as the second argument
The . tells R to pipe the data set into the second argument of lm instead of the first

library(tidyverse)

# Read the qelp docs on lm():
?qelp::lm

# tibble(___) %>%
#   lm(earnings ~ education, data = .)

Interpretation:

Someone with 0 years of education is predicted to earn ___ thousand dollars per year
One additional year of education is associated with ___ thousand more dollars per year in earnings

Love Island Project

A Love Island superfan (not me, to be clear) recorded detailed data on contestants across three seasons of the show. We’ll use this data set to practice dplyr, ggplot2, and linear regression in R.

Run this to get started:

love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>% 
  mutate(win = if_else(outcome %in% c("winner", "runner up", "third place"), 1, 0))

Rows: 96 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, outcome, profession, region_origin_UK, gender, first_arrive
dbl (6): day_left, age, day_joined, n_dates, n_challenges_won, series

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 7: Dplyr

Answer these questions using dplyr verbs.

count() will be especially useful for a couple of the questions here. Recall that count(x) is equivalent to group_by(x) %>% summarize(n = n()).

Out of the three seasons, how many people won (got third place, runner up, or winner)?

# love %>%
#   ___

What is the minimum, maximum, and median age of contestants?

# love %>%
#   ___

Are male contestants, on average, older than female contestants?

# love %>%
#   ___

What are the three most common professions among contestants?

# love %>%
#   ___

Continuing from part d, what are the two most common professions for male contestants and what are the two most common professions for female contestants?

# love %>%
#   ___

What region of the UK are most of the contestants from?

# love %>%
#   ___

Question 8: Ggplot2

Create a visualization to explore whether age affects someone’s chances of winning.

Make a scatterplot with:

age on the x-axis
win (0 or 1) on the y-axis

Since win only takes values 0 and 1, many points will overlap if you use geom_point(). Instead, use geom_jitter() to add a small amount of random noise.

Add a line of best fit using geom_smooth(method = lm).

# love %>%
#   ___

Question 9: lm

Use lm() to estimate the model $\text{win}_i = \beta_0 + \beta_1 \text{age}_i + u_i$. The baseline probability someone who is zero years old wins is , which (is/is not) statistically significantly different from zero: the p-value (is/is not) less than 0.05. A one year increase in age means someone’s probability of winning increases by , which (is/is not) statistically significantly different from zero.

# love %>%
#   ___ %>%
#   broom::tidy()

Use lm() to estimate the model $\text{win}_i = \beta_0 + \beta_1 \text{day joined}_i + u_i$. The baseline probability that someone who joins the show on day zero wins is , which (is/is not) statistically significantly different from zero. A one day increase in day joined means someone’s probability of winning (increases/decreases) by , which (is/is not) statistically significantly different from zero.

# love %>%
#   ___ %>%
#   broom::tidy()

Download this Assignment

Here’s a link to download this assignment.

Autograder

Here’s the autograder for this assignment.