Love Island lm() Project

In this assignment, we’ll continue learning how to fit linear models. We’ll start with a quick review of random variables and how sample data can be used to estimate key properties like expected value and variance. We’ll also practice running hypothesis tests using those estimates.

Next, you’ll complete a short data project that reinforces what you’ve learned in dplyr and ggplot2. Finally, we’ll fit a linear probability model and use hypothesis tests to interpret the results.

Statistics Review

Question 1

Consider a lottery ticket: you win nothing 90% of the time, and you win $1000 10% of the time.

What are the potential outcomes?
What are the probabilities each of those outcomes occur?
What is the expected value of this random variable? Recall: $E[X] = \sum_i x_i p_i$.
What is the variance of this random variable (on average, how far are observations from the mean)? Recall: $\text{Var}(X) = \sum_i (x_i - E[X])^2 p_i$.
Suppose you don’t know the chances of winning, but you get 20 tickets and win $1000 one of those 20 times. Based on that data, what is your estimate of the expected value of a lottery ticket? Recall, the estimator for the expected value of a random variable is its sample mean.
Based on your 20 tickets, what is your estimate of the variance? Recall, the estimator for variance is $\frac{1}{n-1} \sum_i (x_i - \bar{x})^2$.
Hypothesis testing: based on your data, can you reject the hypothesis that the lottery ticket has an expected vale of $100? Here’s how a hypothesis test proceeds:

Define the hypothesis. The null hypothesis will be that the lottery ticket has an expected value of 100 ($\mu = 100$); the alternative is that it doesn’t ($\mu \neq 100$).
Compute the test statistic. The formula we’ll use is: $t = \frac{\bar{x} - \mu_0}{s/ \sqrt{n}}$ where $\bar{x}$ is the sample mean, $\mu_0$ is the value from the null hypothesis, $s$ is the estimate for the standard deviation (take the square root of the estimate for the variance), and $n$ is the sample size. Why do we use this formula? We want to know whether our sample mean is “far” from what the null hypothesis claims. The numerator $\bar{x} - \mu_0$ is how far off our estimate is from the null hypothesis. But “far” depends on how noisy the estimate is. If the lottery outcomes vary a lot, $\bar{x}$ will naturally bounce around from sample to sample. The quantity $\frac{s}{\sqrt{n}}$ is the standard error of the sample mean, which estimates the typical amount $\bar{x}$ differs from $\mu$ just due to randomness. The larger the variance, the larger the standard error. But the larger the sample size, the lower the standard error. So $t$ measures the distance from the null $\div$ typical random error. That makes $t$ a “how many standard errors away” score. Calculate $t$.
Finally, compare $t$ to a critical value. For a two-sided test at the 5% level, we reject the null hypothesis if $t < -2.093$ or $t > 2.093$. That means: if the null is true, only 5% of samples would produce a $t$ value whose magnitude is bigger than 2.093.

Data Project

A Love Island superfan (not me, to be clear) recorded detailed data on contestants across three seasons of the show. We’ll use this data set to practice dplyr, ggplot2, and linear regression in R.

Run this to get started:

library(tidyverse)

love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>% 
  mutate(win = if_else(outcome %in% c("winner", "runner up", "third place"), 1, 0))

Question 2: Dplyr

Answer these questions using dplyr verbs.

count() will be especially useful for a couple of the questions here. Recall that count(x) is equivalent to group_by(x) %>% summarize(n = n()).

Out of the three seasons, how many people won (got third place, runner up, or winner)? Use the variable I created win.

# love %>%
#   ___

What is the minimum, maximum, and median age of contestants?

# love %>%
#   ___

Are male contestants, on average, older than female contestants?

# love %>%
#   ___

What are the three most common professions among contestants?

# love %>%
#   ___

Continuing from part d, what are the two most common professions for male contestants and what are the two most common professions for female contestants?

# love %>%
#   ___

What region of the UK are most of the contestants from?

# love %>%
#   ___

Question 3: Ggplot2

Create a visualization to explore whether age affects someone’s chances of winning. We learned that visualizing the relationship between a discrete variable (win) and a continuous variable (age) is best done with a boxplot or a violin plot, but here we’ll break that rule because the focus will be on the line of best fit.

Make a scatterplot with:

age on the x-axis
win (0 or 1) on the y-axis

Since win only takes values 0 and 1, many points will overlap if you use geom_point(). Instead, use geom_jitter() to add a small amount of random noise.

Add a line of best fit using geom_smooth(method = lm).

# love %>%
#   ___

Question 4: lm: win ~ age

Use lm() to estimate the model $\text{win}_i = \beta_0 + \beta_1 \text{age}_i + u_i$.

# love %>%
#   ___(___, data = .) %>%
#   broom::tidy()

Interpret your lm() results.

We’ve found that the baseline probability someone who is zero years old wins is ____. What’s wrong with this value as a probability? Why does linear probability models give you such values?
A one year increase in age means someone’s probability of winning is estimated to increase by ____.

lm() hypothesis testing: Test the hypothesis that age seems to affect the probability of winning (that is, $H_0: \beta_1 = 0$ versus $H_a: \beta_1 \neq 0$).

Read the test statistic from the broom::tidy() table. This is the number of standard errors the estimate is away from 0.
Compare this to the 5% critical value of 1.985 (if the null is true, only 5% of samples would produce a $t$ value whose magnitude is bigger than 1.985).
Reject or fail to reject: if $|t| > 1.985$, reject the null hypothesis: age seems to affect the probability of winning in this linear probability model. Otherwise, fail to reject: age may not have an affect on the probability of winning.

A quick way to check your answer: broom::tidy() also gives you p-values. If the p-value is less than 0.05, you can reject the null at the 5% level. Otherwise, fail to reject.

Use lm() to estimate the model $\text{win}_i = \beta_0 + \beta_1 \text{day joined}_i + u_i$.

# love %>%
#   ___ %>%
#   broom::tidy()

The baseline probability that someone who joins the show on day zero wins is ____. This (is/is not) a valid probability.
A one day increase in day_joined means someone’s probability of winning (increases/decreases) by ___.
Hypothesis test: look at the p-value for day_joined. Can you reject the null hypothesis that day_joined does not affect the probability of winning?

Download this Assignment

Here’s a link to download this assignment.

Autograder

Here’s the autograder for this assignment.