#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Intro to the Tidyverse by Colleen O'Briant
# Koan #11: lm()
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
# (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
# If the piece of code spans multiple lines, highlight the whole chunk or
# simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# In this koan, you'll learn how to use lm() to estimate models with:
# * log transformations,
# * a squared term,
# * no intercept,
# * categorical variables (called "factors"),
# * and interactions between variables.
# Run this code to get started:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gapminder)
# Read the qelp docs on lm():
?qelp::lm
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- Estimates, Standard Errors, and R-Squared -----
# 1. Simple Regression: How much can a $1 increase to a country's GDP per ------
# capita be expected to boost that country's life expectancy?
# Recall the lm() syntax is: lm(y ~ x, data = gapminder)
#01@
# lm(__)
# A $1 increase to a country's GDP per capita can be associated with a __ year
# increase in life expectancy.
#@01
# lm() prints very limited information. Usually we'd at least want to know the
# standard errors along with the coefficients. Practice piping the lm() call
# into `broom::tidy(conf.int = TRUE)` to get a tidied results table that shows
# you standard errors, test statistics, p-values, and even 95% confidence
# intervals.
# 2. Pipe the simple regression from question 1 into ---------------------------
# `broom::tidy(conf.int = TRUE)`. Is the coefficient on `gdpPercap` greater
# than 0 at the 1% level?
#02@
# lm(__) %>%
# broom::tidy(conf.int = TRUE)
# The coefficient on `gdpPercap` __ greater than 0 at the 1% level.
# (In the blank above, write "is" or "is not".)
#@02
# 3. To see the regression's r-squared, pipe the lm() call into ----------------
# `broom::glance()`.
#03@
# lm(__) %>%
# broom::glance()
# This model explains __% of the variation in lifeExp.
#@03
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- Fitted Values and Residuals -----
#
# In this class, we'll sometimes need to access fitted values and residuals from
# an lm() call. You should use the functions fitted.values() and residuals() to
# do that.
# 4. Pipe the simple regression into fitted.values() to get a vector of --------
# length 1704 (the same length as the number of rows of gapminder).
#04@
# lm(__) %>%
# __
#@04
# 5. Pipe the simple regression into residuals() to get another vector ---------
# of length 1704. Recall that the fitted values plus the residuals will be equal
# to the observed values for lifeExp.
#05@
# lm(__) %>%
# __
#@05
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- Transforming Variables ------
#
# 6. Take the simple regression we've been working on and apply a log ----------
# transformation to both the dependent variable and the explanatory variable by
# wrapping the variable names in the function log():
# lm(log(y) ~ log(x), data = gapminder).
#06@
# lm(__) %>%
# broom::tidy(conf.int = TRUE)
# When GDP per capita increases by 1%, life expectancy will increase by __%.
#@06
# 7. Instead of applying log() to both the dependent and explanatory -----------
# variable, apply it only to the explanatory variable. Does the R-squared
# improve?
#07@
# lm(__) %>%
# broom::glance()
#@07
# 8. Instead of doing a log transformation, try taking the square --------------
# of gdpPercap. You'll need to wrap the transformation in I(), which "inhibits
# the interpretation" of arithmetic operators like "+", "-", "*", and "^" in
# formulas. So the formula will look like: lm(y ~ I(x^2), data = gapminder).
# What's the R-squared now? Does a log transformation, square transformation,
# or no transformation at all of gdpPercap seem to be the best fit?
#08@
# lm(__) %>%
# broom::glance()
# This model explains __% of the variation in lifeExp.
#@08
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# 9. If you wanted to omit the intercept term, you can do that like this: ------
# lm(y ~ 0 + x, data = gapminder). Take the simple regression and omit
# the intercept.
#09@
# lm(__) %>%
# broom::tidy(conf.int = TRUE)
#@09
# 10. Take the simple regression we've been working on and add a second --------
# explanatory variable: 'year'.
#10@
# lm(__ ~ __ + __, data = __) %>%
# broom::tidy(conf.int = TRUE)
# Every year, a country's life expectancy is expected to increase by __ years.
#@10
# 11. Take out 'year' from the model and add 'continent'. This variable --------
# is different from the numerical variables we've been working with (lifeExp,
# gdpPercap, and year are all numeric). 'continent' is a "factor" variable,
# which means the data is categorical rather than numeric. When you add
# 'continent', lm() lets different continents have different intercepts.
#11@
# lm(__ ~ __ + __, data = __) %>%
# broom::tidy(conf.int = TRUE)
# What happened to Africa? One level has to be omitted to avoid the dummy
# variable trap. So the intercept is the OLS prediction for the intercept for
# Africa. It's saying that if a country in Africa has a GDP per capita of $0
# (which is nonsense), the life expectancy of the people in that country is
# expected to be 47.9 years. The intercept for a country in the Americas is
# 47.9 + 13.6 = 61.5 years.
# Our model predicts that a country in Asia with $0 GDP per capita will have a
# life expectancy of __ years.
#@11
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# In the previous question, we learned that you can allow different continents
# to have different *intercepts* by including 'continent' as an explanatory
# variable.
# You can also allow different continents to have different *slopes* by
# including an interaction between 'continent' and another variable like
# 'gdpPercap'.
# 12. Estimate a model of lifeExp where the explanatory variable is ------------
# the interaction between 'gdpPercap' and 'continent'. Hint: an interaction
# between x and z is done like this: lm(y ~ x:z, data = gapminder)
#12@
# lm(__) %>%
# broom::tidy(conf.int = TRUE)
# According to our model, a $1 larger GDP per capita can be associated with a
# __ year higher life expectancy in Europe.
#@12
# Note: When you include interactions between variables, you usually also
# include the variables by themselves: y ~ x + z + x:z. That way, you're letting
# both the slope and the intercept vary by the variables. A shorthand for that
# specification is this: y ~ x*z.
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.
# If you're ready, you can move on to koan 12: stats in R.