Take the cholera example from chapter 11 of the workbook. Suppose John Snow observed that Lambeth customers had 10 cholera deaths per thousand households before Lambeth moved their pipes upstream (1849), and 8 cholera deaths per thousand households after (1854). Meanwhile, similar households with other water companies had 12 cholera deaths per thousand househols in 1849 and 20 cholera deaths per thousand households in 1854.
Lambeth | Other Companies | |
---|---|---|
1849 deaths per thousand | 10 | 12 |
1854 deaths per thousand | 8 | 20 |
Run this code to get started:
library(tidyverse)
friends <- read_csv("https://raw.githubusercontent.com/cobriant/dplyrmurdermystery/master/friends.csv")
This part of the project is inspired by a paper by Andrew J Hill, published in AEJ Applied (2015) called The Girl Next Door: The Effect of Opposite Gender Friends on High School Achievement. It’s an interesting read, but reading it closely is not a prerequisite for completing this classwork, and may even make you more confused about it! I suggest reading the paper only after you finish the classwork.
Data:
I artificially generated some survey data with variables:
Estimate this model and interpret the coefficients by filling in the blanks below.
\(\text{GPA}_i = \beta_0 + \beta_1 \text{friends_opposite}_i + \beta_2 \text{friends_same}_i + \beta_3 \text{friends_same}_i^2 + u_i\)
(Recall: to include a squared term in lm
, use
I()
: lm(y ~ x + I(x^2), data = .)
)
Interpretation: If you have no friends at all, you’re expected to have a GPA of ____. For every extra friend of the opposite sex you have, your GPA is expected to (circle one: rise/fall) by ____, which (circle one: is/is not) statistically different from zero at the 5% level. If you have one friend of the same sex, your GPA is predicted to be ____ points (circle one: lower/higher) than someone with no friends of the same sex. If you have four friends of the same sex, your GPA is predicted to be ____ points (circle one: lower/higher) than someone with no friends of the same sex.
Why would we include a squared term in this model?
Since I generated the data, I can tell you the true effect of friends of the opposite sex on GPA is not estimated well in the previous question.
In truth:
friends_opposite
has a coefficient of -.08,friends_same
has a coefficient of 0.5,friends_same^2
has a coefficient of -.08.Compare the true value for \(\beta_1\), the effect of
friends_opposite
on GPA, to the value you estimated in
question 2. Did you estimate \(\beta_1\) to be too low or
too high?
In the data generating process, I created a variable
strict_parents
that took on a 0 or a 1. It influenced
GPA
and also friends_opposite
, and I omitted
it from your version of the data. Is it causing omitted variable
bias? If so, what direction is the bias? Explain your
answers.
We’ll use bus_stop_opposite
as an instrument for
friends_opposite
to try to isolate the effect of the
exogenous variation in friends_opposite
. In the workbook,
we learned that bus_stop_opposite
is a valid instrument if
it’s relevant, exogenous, and
excludable. Explain how bus_stop_opposite
could plausibly satisfy each of those conditions.
Estimate the first stage: \(\text{friends opposite} = \gamma_0 + \gamma_1 \text{bus stop opposite} + \gamma_2 \text{friends same} + \gamma_3 \text{friends same}^2 + v\) and interpret the relevance of the instrument.
Estimate the second stage: \(\text{GPA} =
\alpha_0 + \alpha_1 \hat{\text{friends opposite}} + \alpha_2
\text{friends same} + \alpha_3 \text{friends same}^2 + w\). Does
our IV estimate of the effect of friends_opposite
improve
compared to the naive model estimate?