#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Intro to the Tidyverse by Colleen O'Briant
# Koan #5: summarize and group_by
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
# (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
# If the piece of code spans multiple lines, highlight the whole chunk or
# simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# In this koan, you'll learn the next two dplyr verbs:
# summarize() and group_by().
# Run this code to get started.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gapminder)
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- summarize() -----
# 'summarize()' reduces a tibble down to a customized summary.
# For example, when you want to know the minimum value of a variable,
# or the maximum, or the mean, or the median, you can use 'summarize()'.
gapminder %>%
summarize(lifeExp_min = min(lifeExp), lifeExp_max = max(lifeExp))
## # A tibble: 1 × 2
## lifeExp_min lifeExp_max
## <dbl> <dbl>
## 1 23.6 82.6
# The output is a tibble with columns as summary statistics. Make
# sure to give columns names (lifeExp_min and lifeExp_max).
# 1. Take 'gapminder', filter for only observations in Africa, -----------------
# and summarize to find the:
# median life expectancy,
# median population, and
# median GDP per capita.
#1@
# gapminder %>%
# __
#@1
# 2. Take 'gapminder', add a new column (mutate) for the total gdp, ------------
# and summarize to find the mean and median total gdp. Try to recall how to use
# mutate before looking at the last koan.
#2@
# gapminder %>%
# __
#@2
# Read the qelp docs on 'summarize()':
?qelp::summarize
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- group_by() -----
# The fifth dplyr function we'll learn is 'group_by()'. It sorts your data into
# buckets (groups) by the variable you specify.
# For example, this code sorts gapminder into buckets by year:
gapminder %>%
group_by(year)
## # A tibble: 1,704 × 6
## # Groups: year [12]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
# There's no obvious difference between a grouped tibble and an ungrouped tibble
# except that a grouped tibble has a special attribute called Groups:
# A tibble: 1,704 x 6
# Groups: year [12] <---- here's the attribute!
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.8 8425333 779.
# 2 Afghanistan Asia 1957 30.3 9240934 821.
# 3 Afghanistan Asia 1962 32.0 10267083 853.
# 'group_by()' isn't very useful on its own. To see how powerful it is, pair it
# with 'summarize()':
gapminder %>%
group_by(year) %>%
summarize(lifeExp_median = median(lifeExp))
## # A tibble: 12 × 2
## year lifeExp_median
## <int> <dbl>
## 1 1952 45.1
## 2 1957 48.4
## 3 1962 50.9
## 4 1967 53.8
## 5 1972 56.5
## 6 1977 59.7
## 7 1982 62.4
## 8 1987 65.8
## 9 1992 67.7
## 10 1997 69.4
## 11 2002 70.8
## 12 2007 71.9
# On its own, 'summarize()' outputs a tibble with *one row*. But in conjunction
# with 'group_by()', 'summarize()' outputs a tibble with the same number of rows
# as there are buckets (groups).
# The code above outputs a summary that tells us what the median life expectancy
# is in our data *for each year*. It's as if R sorted our observations (rows)
# into buckets by the grouping variable and visited each bucket individually to
# calculate the summary statistic before reporting the results.
# Grouped summaries are profoundly useful. Working with data, you'll use them
# all the time.
# 3. Take 'gapminder', filter for only observations in Africa, -----------------
# and summarize to find the median life expectancy, population, and
# GDP per capita *for each country*.
#3@
# gapminder %>%
# __
#@3
# 4. Summarize 'gapminder' to find the mean GDP per capita for each ------------
# continent, for each year (use 2 variables inside 'group_by').
#4@
# gapminder %>%
# __
#@4
# Read the qelp docs on 'group_by()':
?qelp::group_by
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# ----- count() -----
#
# Oftentimes you'll need to answer questions like, "How many observations
# do I have in each continent?"
# You /could/ 'filter' for each continent and then use 'nrow':
gapminder %>%
filter(continent == "Africa") %>%
nrow()
## [1] 624
gapminder %>%
filter(continent == "Americas") %>%
nrow()
## [1] 300
gapminder %>%
filter(continent == "Asia") %>%
nrow()
## [1] 396
# But this is not the best way: it's repetitive. If 'continent'
# took on 100 values, you'd have to copy-paste the code above 100 times!
# Instead, 'group_by' continent and 'summarize()' with the function 'n()',
# which counts the number of observations in the grouping context:
gapminder %>%
group_by(continent) %>%
summarize(n_observations = n())
## # A tibble: 5 × 2
## continent n_observations
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
# Grouping by a variable and counting 'n()' is such a common task,
# there's an even simpler way to do it: 'count()' is equivalent
# to group_by + summarize + n:
gapminder %>%
count(continent)
## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
# 5. How many observations are there from each country? ------------------------
#5@
# gapminder %>%
# __
#@5
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.
# If you're ready, you can move on to the next koan: arrange and slice.