library(tidyverse)
<- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>%
love mutate(win = str_detect(outcome, "winner|runner up|third place"))
2.4 Standard Errors
To get started, attach the tidyverse to your session and read the love island data set:
Then estimate the linear probability model: \[\text{Prob(win)}_i \sim \beta_0 + \beta_1 age_i + u_i\]
%>%
love lm(win ~ age, data = .) %>%
::tidy() broom
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.493 0.351 -1.41 0.163
2 age 0.0288 0.0147 1.95 0.0539
In assignment 2.3 you learned how to get the estimates for the coefficients \(\hat{\beta_0} = -0.493\) and \(\hat{\beta_1} = 0.0288\): they come from OLS formulas for the estimators of the parameters of the line of best fit.
In this assignment, we’ll explore the other values in the table: the standard errors std.error
, the test statistics statistic
, and the p-values p.value
.
OLS Hypothesis Tests
Exercise 1: The regression output shows \(\hat{\beta_1} = 0.0288\) with a standard error of 0.0147.
- Calculate the t-statistic for \(\hat{\beta_1}\). Show your work.
- The sample size of \(n = 96\) is large. Is the t-statistic greater than 1.96?
- Based on your answers to (a) and (b), do we reject \(H_0: \beta_1 = 0\) at the 5% significance level? That is, does age have a statistically significant effect on the probability of winning at the 5% level?
- Compare your answers to the
lm()
results table. Did you get the same statistic? An equivalent way to do the hypothesis test is to look at the p-value from the regression output and compare it to the significance level. According to the p-value, does age have a statistically significant effect on the probability of winning at the 10% level? What about the 1% level?
Answers:
- t-statistic: \(\frac{\hat{\beta}_1}{SE}\) =
Assumptions for Standard Errors
For our standard errors and hypothesis tests to be valid, we need several key assumptions:
- The linear model is an accurate model of the true data generating process
- Exogeneity: \(E[u_i|X] = 0\)
- Homoskedasticity: \(Var(u_i|X) = \sigma^2_u\) (constant variance)
- No Autocorrelation: \(Cov(u_i, u_j|X) = 0\) for \(i \neq j\), which could fail for time series data
- Normality: Either \(u\) is normally distributed, or the sample size is large enough for the Central Limit Theorem
The first three assumptions can be quite heroic depending on the application. We can use heteroskedasticity-robust standard errors to help mitigate the heteroskedasticity issue, but if the linear model is inaccurate or if exogeneity fails, OLS is biased and standard errors are wrong.
Exercise 2: For each scenario below, identify which OLS assumption might be violated:
- Analyzing house prices where expensive houses tend to have more variable prices
- Time series data where this year’s GDP is closely related to last year’s GDP
- A very small sample of only 10 observations
Answer:
Formula for Standard Errors
Let’s derive the formula for the standard error of \(\hat{\beta_1}\) step by step.
Exercise 3: Using the love island data, calculate using R:
- \(\sum_i (x_i - \bar{x})^2\) where x is age
- \(\sum_i e_i^2\) using the residuals
- The standard error of \(\hat{\beta_1}\) using the formula above
- Verify your answer matches the regression output
%>%
love mutate(age_demean = age - mean(age))
# A tibble: 96 × 14
name outcome day_left age profession
<chr> <chr> <dbl> <dbl> <chr>
1 Cara De La Hoyde winner 45 26 circus performer
2 Nathan Massey winner 45 25 carpenter
3 Alex Bowen runner up 45 24 model
4 Olivia Buckland runner up 45 22 sales executive
5 Kady McDermott third place 45 20 makeup artist
6 Scott Thomas third place 45 28 nightclub promoter
7 Adam Maxted dumped 45 24 wrestler
8 Katie Salmon dumped 45 20 glamour model
9 Emma Jane Woodham dumped 41 19 project manager
10 Terry Walsh dumped 41 28 carpenter
region_origin_UK gender first_arrive day_joined n_dates n_challenges_won
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 South female YES 1 NA NA
2 South male YES 1 NA NA
3 Midlands male NO 18 NA NA
4 South female YES 1 NA NA
5 South female NO 3 NA NA
6 North male YES 1 NA NA
7 Ireland male NO 11 NA NA
8 North female NO 34 NA NA
9 South female NO 27 NA NA
10 South male NO 3 NA NA
series win age_demean
<dbl> <lgl> <dbl>
1 2016 TRUE 2.32
2 2016 TRUE 1.32
3 2016 TRUE 0.323
4 2016 TRUE -1.68
5 2016 TRUE -3.68
6 2016 TRUE 4.32
7 2016 FALSE 0.323
8 2016 FALSE -3.68
9 2016 FALSE -4.68
10 2016 FALSE 4.32
# ℹ 86 more rows
Answer:
# a <- love %>%
# summarize(a = sum(___)) %>%
# pull(a)
# b <- love %>%
# lm(___) %>%
# residuals() %>%
# .^2 %>%
# ___()
# sqrt(___)
# love %>%
# lm(win ~ age, data = .) %>%
# broom::tidy() %>%
# select(std.error) %>%
# slice(2)
Standard Error Intuition
The formula for \(Var(\hat{\beta_1})\) tells us three key things about precision:
- More observations (larger n) → more precise estimates
- More variance in x → more precise estimates
- More variance in u → less precise estimates
Exercise 4: Explain how we know these three key things about the precision of our estimates by referring to the formula for the standard error.
Answer:
- n is in the denominator, so for large n, the standard error ___.
- the variance of u is in the numerator, so for large variance in u, the standard error ___.
- the variance of x is in the denominator, so for large variance in x, the standard error ___.