1.5 `lm()` with Love Island

The data for this assignment: a Love Island superfan recorded very detailed data on every contestant over the course of three seasons of the show. I had never seen Love Island before, but I was still pretty tickled about the idea of a data project on the topic. So I started watching the first episode of the 2016 series and I would like to make it clear, while I think this is sort of a fun, interesting data project, I do not necessarily recommend this show for everyone. If you’re looking to get into reality TV, start with the Golden Bachelor series instead.

Run this to get started:

library(tidyverse)
love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv")

1. Dplyr: answer these questions using dplyr verbs. `count` is especially useful for a couple of the questions here. Recall that `count(x)` is equivalent to `group_by(x) %>% summarize(n = n())`.

1.1 Out of the three seasons, how many people won?

1.2 What is the minimum, maximum, and median age of contestants?

1.3 Are male contestants, on average, older than female contestants?

1.4 What are the three most common professions among contestants?

1.5 Continuing from 1.4, what are the three most common professions for male contestants and what are the three most common professions for female contestants?

1.6 What region of the UK are most of the contestants from?

1.7 Love Island seems to introduce people in waves, so people enter at different times. Show that in 2016, 42.3% of the cast arrived on day 1. Then in 2017, 34.4% arrived on day 1, and then in 2018, 28.9% arrived on day 1.

2 Explore: which characteristics seem to help people win the show?

2.1 There aren’t really enough winners in only 3 seasons to do much inference, so we’ll expand the definition of “win” to include the runner ups and third place winners.

Go up to the top of this document, where love was defined, and add a new column win that takes 1 if the person had an outcome of “winner”, “runner up”, or “third place”, and 0 otherwise. These functions may be useful: mutate, if_else, and the logical operator for “or”: |. Run that code to update love to include that new column.

Answer:

love %>% 
  mutate(win = str_detect(outcome, "winner|runner up|third place")) %>%
  lm(win ~ age, data = .) %>%
  broom::tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -0.493     0.351      -1.41  0.163 
2 age           0.0288    0.0147      1.95  0.0539

2.2 Create a visualization to explore if age affects someone’s chances of winning

First, create a scatterplot with age on the x-axis and win (0 or 1) on the y-axis Since ‘win’ only takes values of 0 and 1, many points would overlap using regular geom_point(). We’ll use geom_jitter() instead, which adds small random noise to prevent overlapping. Add a line of best fit to see if there’s a trend using geom_smooth(method = lm). Experiment with the height and width parameters in geom_jitter() to control how much “jitter” is added. Try values between 0 and 0.2 for both parameters.

2.3 Use linear regression to analyze if age affects winning chances

We’ll use the lm() function, which stands for “linear model” (aka linear regression). This statistical tool helps us understand what kind of linear relationship exists between variables. In this case, we want to see if age (“independent” AKA “explanatory” variable) affects winning (dependent variable).

The model we’ll estimate is: \(win = \beta_0 + \beta_1 age + u\) Where:

\(\beta_0\) is the baseline expected probability of winning for someone who is 0 years old (intercept)
\(\beta_0\) tells us how much the expected probability of winning changes for each year increase in age
\(u\) represents all other factors that affect winning that we haven’t included in the model

Steps:

Fit the model by running lm(win ~ age, data = love)
Use broom::tidy() to get a summary of the results (you’ll need to install the package broom by running install.packages("broom"), but leave this out of the document: you can’t compile a document that contains lines of code that install packages)
Look at the coefficient for age (\(\beta_1\)) and its p-value
If the p-value is less than 0.05, we can say age has a statistically significant effect on winning chances at the 5% level

Show your R code and fill in the blanks to interpret the results: For each year increase in age, the probability of winning changes by ___ (\(\beta_1\)). Since the p-value is ___, we (select one: reject/fail to reject) the idea that age has no effect at the 5% significance level.

2.4 Analyze whether being added to the show earlier helps contestants win

The show works like this: some contestants are on the show from day 1, and others are added as the show goes on. day_joined tells you what day the contestant is added. Follow the same steps as 2.3, but now look at the relationship between day of entry and winning chances:

Create a visualization (like in 2.2) but with entry day on the x-axis
Fit a linear model: \(win = \beta_0 + \beta_1 day_joined + u\)
Interpret the results:
- What does \(\beta_1\) tell us about how entry timing affects winning?
- Is this effect statistically significant at the 5% level (p < 0.05)?

2.5 One more question: Does being a model help your chances of winning?

To identify all the people whose profession has to do with modeling, you can use str_detect.

1. Dplyr: answer these questions using dplyr verbs. count is especially useful for a couple of the questions here. Recall that count(x) is equivalent to group_by(x) %>% summarize(n = n()).