2.4 Standard Errors

To get started, attach the tidyverse to your session and read the love island data set:

library(tidyverse)
love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>%
  mutate(win = str_detect(outcome, "winner|runner up|third place"))

Then estimate the linear probability model: \[\text{Prob(win)}_i \sim \beta_0 + \beta_1 age_i + u_i\]

love %>% 
  lm(win ~ age, data = .) %>%
  broom::tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -0.493     0.351      -1.41  0.163 
2 age           0.0288    0.0147      1.95  0.0539

In assignment 2.3 you learned how to get the estimates for the coefficients \(\hat{\beta_0} = -0.493\) and \(\hat{\beta_1} = 0.0288\): they come from OLS formulas for the estimators of the parameters of the line of best fit.

In this assignment, we’ll explore the other values in the table: the standard errors std.error, the test statistics statistic, and the p-values p.value.

OLS Hypothesis Tests

Exercise 1: The regression output shows \(\hat{\beta_1} = 0.0288\) with a standard error of 0.0147.

Calculate the t-statistic for \(\hat{\beta_1}\). Show your work.
The sample size of \(n = 96\) is large. Is the t-statistic greater than 1.96?
Based on your answers to (a) and (b), do we reject \(H_0: \beta_1 = 0\) at the 5% significance level? That is, does age have a statistically significant effect on the probability of winning at the 5% level?
Compare your answers to the lm() results table. Did you get the same statistic? An equivalent way to do the hypothesis test is to look at the p-value from the regression output and compare it to the significance level. According to the p-value, does age have a statistically significant effect on the probability of winning at the 10% level? What about the 1% level?

Answers:

t-statistic: \(\frac{\hat{\beta}_1}{SE}\) =

Assumptions for Standard Errors

For our standard errors and hypothesis tests to be valid, we need several key assumptions:

The linear model is an accurate model of the true data generating process
Exogeneity: \(E[u_i|X] = 0\)
Homoskedasticity: \(Var(u_i|X) = \sigma^2_u\) (constant variance)
No Autocorrelation: \(Cov(u_i, u_j|X) = 0\) for \(i \neq j\), which could fail for time series data
Normality: Either \(u\) is normally distributed, or the sample size is large enough for the Central Limit Theorem

The first three assumptions can be quite heroic depending on the application. We can use heteroskedasticity-robust standard errors to help mitigate the heteroskedasticity issue, but if the linear model is inaccurate or if exogeneity fails, OLS is biased and standard errors are wrong.

Exercise 2: For each scenario below, identify which OLS assumption might be violated:

Analyzing house prices where expensive houses tend to have more variable prices
Time series data where this year’s GDP is closely related to last year’s GDP
A very small sample of only 10 observations

Answer:

Formula for Standard Errors

Let’s derive the formula for the standard error of \(\hat{\beta_1}\) step by step.

Exercise 3: Using the love island data, calculate using R:

\(\sum_i (x_i - \bar{x})^2\) where x is age
\(\sum_i e_i^2\) using the residuals
The standard error of \(\hat{\beta_1}\) using the formula above
Verify your answer matches the regression output

love %>%
  mutate(age_demean = age - mean(age))

# A tibble: 96 × 14
   name              outcome     day_left   age profession        
   <chr>             <chr>          <dbl> <dbl> <chr>             
 1 Cara De La Hoyde  winner            45    26 circus performer  
 2 Nathan Massey     winner            45    25 carpenter         
 3 Alex Bowen        runner up         45    24 model             
 4 Olivia Buckland   runner up         45    22 sales executive   
 5 Kady McDermott    third place       45    20 makeup artist     
 6 Scott Thomas      third place       45    28 nightclub promoter
 7 Adam Maxted       dumped            45    24 wrestler          
 8 Katie Salmon      dumped            45    20 glamour model     
 9 Emma Jane Woodham dumped            41    19 project manager   
10 Terry Walsh       dumped            41    28 carpenter         
   region_origin_UK gender first_arrive day_joined n_dates n_challenges_won
   <chr>            <chr>  <chr>             <dbl>   <dbl>            <dbl>
 1 South            female YES                   1      NA               NA
 2 South            male   YES                   1      NA               NA
 3 Midlands         male   NO                   18      NA               NA
 4 South            female YES                   1      NA               NA
 5 South            female NO                    3      NA               NA
 6 North            male   YES                   1      NA               NA
 7 Ireland          male   NO                   11      NA               NA
 8 North            female NO                   34      NA               NA
 9 South            female NO                   27      NA               NA
10 South            male   NO                    3      NA               NA
   series win   age_demean
    <dbl> <lgl>      <dbl>
 1   2016 TRUE       2.32 
 2   2016 TRUE       1.32 
 3   2016 TRUE       0.323
 4   2016 TRUE      -1.68 
 5   2016 TRUE      -3.68 
 6   2016 TRUE       4.32 
 7   2016 FALSE      0.323
 8   2016 FALSE     -3.68 
 9   2016 FALSE     -4.68 
10   2016 FALSE      4.32 
# ℹ 86 more rows

Answer:

# a <- love %>%
#   summarize(a = sum(___)) %>%
#   pull(a)

# b <- love %>% 
#   lm(___) %>%
#   residuals() %>%
#   .^2 %>%
#   ___()

# sqrt(___)

# love %>% 
#   lm(win ~ age, data = .) %>%
#   broom::tidy() %>%
#   select(std.error) %>%
#   slice(2)

Standard Error Intuition

The formula for \(Var(\hat{\beta_1})\) tells us three key things about precision:

More observations (larger n) → more precise estimates
More variance in x → more precise estimates
More variance in u → less precise estimates

Exercise 4: Explain how we know these three key things about the precision of our estimates by referring to the formula for the standard error.

Answer:

n is in the denominator, so for large n, the standard error ___.
the variance of u is in the numerator, so for large variance in u, the standard error ___.
the variance of x is in the denominator, so for large variance in x, the standard error ___.