Part 1: Diff-in-diff

Take the cholera example from chapter 11 of the workbook. Suppose John Snow observed that Lambeth customers had 10 cholera deaths per thousand households before Lambeth moved their pipes upstream (1849), and 8 cholera deaths per thousand households after (1854). Meanwhile, similar households with other water companies had 12 cholera deaths per thousand househols in 1849 and 20 cholera deaths per thousand households in 1854.

Lambeth Other Companies
1849 deaths per thousand 10 12
1854 deaths per thousand 8 20

a) What is the diff-in-diff estimate of the effect of cleaner water on cholera deaths?

b) What is the key assumption for the validity of the diff-in-diff estimator?

c) Would the diff-in-diff estimator be valid in this case if Lambeth customers were wealthier than the “Other Companies” group, on average? Why or why not?

d) Would the diff-in-diff estimator be valid in this case if many wealthy people became Lambeth customers between 1849 and 1854 in order to get cleaner water? Why or why not?

Part 2: Instrumental Variables Example

Run this code to get started:

library(tidyverse)
friends <- read_csv("https://raw.githubusercontent.com/cobriant/dplyrmurdermystery/master/friends.csv")

This part of the project is inspired by a paper by Andrew J Hill, published in AEJ Applied (2015) called The Girl Next Door: The Effect of Opposite Gender Friends on High School Achievement. It’s an interesting read, but reading it closely is not a prerequisite for completing this classwork, and may even make you more confused about it! I suggest reading the paper only after you finish the classwork.

Data:

I artificially generated some survey data with variables:

a) Naive OLS Model

Estimate this model and interpret the coefficients by filling in the blanks below.

\(\text{GPA}_i = \beta_0 + \beta_1 \text{friends_opposite}_i + \beta_2 \text{friends_same}_i + \beta_3 \text{friends_same}_i^2 + u_i\)

(Recall: to include a squared term in lm, use I(): lm(y ~ x + I(x^2), data = .))

Interpretation: If you have no friends at all, you’re expected to have a GPA of ____. For every extra friend of the opposite sex you have, your GPA is expected to (circle one: rise/fall) by ____, which (circle one: is/is not) statistically different from zero at the 5% level. If you have one friend of the same sex, your GPA is predicted to be ____ points (circle one: lower/higher) than someone with no friends of the same sex. If you have four friends of the same sex, your GPA is predicted to be ____ points (circle one: lower/higher) than someone with no friends of the same sex.

Why would we include a squared term in this model?

b) Thinking about causality

Since I generated the data, I can tell you the true effect of friends of the opposite sex on GPA is not estimated well in the previous question.

In truth:

  • friends_opposite has a coefficient of -.08,
  • friends_same has a coefficient of 0.5,
  • friends_same^2 has a coefficient of -.08.

Compare the true value for \(\beta_1\), the effect of friends_opposite on GPA, to the value you estimated in question 2. Did you estimate \(\beta_1\) to be too low or too high?

In the data generating process, I created a variable strict_parents that took on a 0 or a 1. It influenced GPA and also friends_opposite, and I omitted it from your version of the data. Is it causing omitted variable bias? If so, what direction is the bias? Explain your answers.

c) Valid Instruments

We’ll use bus_stop_opposite as an instrument for friends_opposite to try to isolate the effect of the exogenous variation in friends_opposite. In the workbook, we learned that bus_stop_opposite is a valid instrument if it’s relevant, exogenous, and excludable. Explain how bus_stop_opposite could plausibly satisfy each of those conditions.

d) First Stage

Estimate the first stage: \(\text{friends opposite} = \gamma_0 + \gamma_1 \text{bus stop opposite} + \gamma_2 \text{friends same} + \gamma_3 \text{friends same}^2 + v\) and interpret the relevance of the instrument.

e) Second Stage

Estimate the second stage: \(\text{GPA} = \alpha_0 + \alpha_1 \hat{\text{friends opposite}} + \alpha_2 \text{friends same} + \alpha_3 \text{friends same}^2 + w\). Does our IV estimate of the effect of friends_opposite improve compared to the naive model estimate?