K05_dplyr2.R

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#                   Intro to the Tidyverse by Colleen O'Briant
#                        Koan #5: summarize and group_by
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
#    (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
#    If the piece of code spans multiple lines, highlight the whole chunk or
#    simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# In this koan, you'll learn the next two dplyr verbs:
# summarize() and group_by().

# Run this code to get started.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(gapminder)

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                            ----- summarize() -----

# 'summarize()' reduces a tibble down to a customized summary.

# For example, when you want to know the minimum value of a variable,
# or the maximum, or the mean, or the median, you can use 'summarize()'.

gapminder %>%
  summarize(lifeExp_min = min(lifeExp), lifeExp_max = max(lifeExp))

## # A tibble: 1 × 2
##   lifeExp_min lifeExp_max
##         <dbl>       <dbl>
## 1        23.6        82.6

# The output is a tibble with columns as summary statistics. Make
# sure to give columns names (lifeExp_min and lifeExp_max).

# 1. Take 'gapminder', filter for only observations in Africa, -----------------
# and summarize to find the:
#    median life expectancy,
#    median population, and
#    median GDP per capita.

#1@

# gapminder %>%
# __

#@1

# 2. Take 'gapminder', add a new column (mutate) for the total gdp, ------------
# and summarize to find the mean and median total gdp. Try to recall how to use
# mutate before looking at the last koan.

#2@

# gapminder %>%
# __

#@2

# Read the qelp docs on 'summarize()':

?qelp::summarize

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                             ----- group_by() -----

# The fifth dplyr function we'll learn is 'group_by()'. It sorts your data into
# buckets (groups) by the variable you specify.

# For example, this code sorts gapminder into buckets by year:

gapminder %>%
  group_by(year)

## # A tibble: 1,704 × 6
## # Groups:   year [12]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

# There's no obvious difference between a grouped tibble and an ungrouped tibble
# except that a grouped tibble has a special attribute called Groups:

#  A tibble: 1,704 x 6
#  Groups:   year [12] <---- here's the attribute!
#    country     continent  year lifeExp      pop gdpPercap
#    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#  1 Afghanistan Asia       1952    28.8  8425333      779.
#  2 Afghanistan Asia       1957    30.3  9240934      821.
#  3 Afghanistan Asia       1962    32.0 10267083      853.

# 'group_by()' isn't very useful on its own. To see how powerful it is, pair it
# with 'summarize()':

gapminder %>%
  group_by(year) %>%
  summarize(lifeExp_median = median(lifeExp))

## # A tibble: 12 × 2
##     year lifeExp_median
##    <int>          <dbl>
##  1  1952           45.1
##  2  1957           48.4
##  3  1962           50.9
##  4  1967           53.8
##  5  1972           56.5
##  6  1977           59.7
##  7  1982           62.4
##  8  1987           65.8
##  9  1992           67.7
## 10  1997           69.4
## 11  2002           70.8
## 12  2007           71.9

# On its own, 'summarize()' outputs a tibble with *one row*. But in conjunction
# with 'group_by()', 'summarize()' outputs a tibble with the same number of rows
# as there are buckets (groups).

# The code above outputs a summary that tells us what the median life expectancy
# is in our data *for each year*. It's as if R sorted our observations (rows)
# into buckets by the grouping variable and visited each bucket individually to
# calculate the summary statistic before reporting the results.

# Grouped summaries are profoundly useful. Working with data, you'll use them
# all the time.

# 3. Take 'gapminder', filter for only observations in Africa, -----------------
# and summarize to find the median life expectancy, population, and
# GDP per capita *for each country*.

#3@

# gapminder %>%
# __

#@3


# 4. Summarize 'gapminder' to find the mean GDP per capita for each ------------
# continent, for each year (use 2 variables inside 'group_by').

#4@

# gapminder %>%
# __

#@4


# Read the qelp docs on 'group_by()':
?qelp::group_by

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                              ----- count() -----
#
# Oftentimes you'll need to answer questions like, "How many observations
# do I have in each continent?"

# You /could/ 'filter' for each continent and then use 'nrow':

gapminder %>%
  filter(continent == "Africa") %>%
  nrow()

## [1] 624

gapminder %>%
  filter(continent == "Americas") %>%
  nrow()

## [1] 300

gapminder %>%
  filter(continent == "Asia") %>%
  nrow()

## [1] 396

# But this is not the best way: it's repetitive. If 'continent'
# took on 100 values, you'd have to copy-paste the code above 100 times!

# Instead, 'group_by' continent and 'summarize()' with the function 'n()',
#  which counts the number of observations in the grouping context:

gapminder %>%
  group_by(continent) %>%
  summarize(n_observations = n())

## # A tibble: 5 × 2
##   continent n_observations
##   <fct>              <int>
## 1 Africa               624
## 2 Americas             300
## 3 Asia                 396
## 4 Europe               360
## 5 Oceania               24

# Grouping by a variable and counting 'n()' is such a common task,
# there's an even simpler way to do it: 'count()' is equivalent
# to group_by + summarize + n:

gapminder %>%
  count(continent)

## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

# 5. How many observations are there from each country? ------------------------

#5@

# gapminder %>%
# __

#@5

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.

# If you're ready, you can move on to the next koan: arrange and slice.

K05_dplyr2.R

colleenobriant

2022-08-24