K04_dplyr1.R

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#                   Intro to the Tidyverse by Colleen O'Briant
#                      Koan #4: filter, select, and mutate
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
#    (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
#    If the piece of code spans multiple lines, highlight the whole chunk or
#    simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# In this koan, you'll learn the first three dplyr verbs: 'filter', 'select',
# and 'mutate'.

# dplyr is a package for data manipulation (the name is supposed to evoke
# "data pliers"). It's included in the tidyverse, so you automatically have
# access to all the dplyr functions whenever you attach the tidyverse with
# 'library(tidyverse)'.

# dplyr is a SQL implementation. What is SQL? It stands for "Structured Query
# Language": it's a programming language for answering questions ("queries")
# about a dataset. With SQL (and with dplyr), you can transform your data into a
# neat table of results to answer just about any question you have about your
# data!

# As we'll see in this exercise and the next, you can use dplyr on a
# basic demographic dataset to answer questions like...
# - What were the 5 richest countries in the 1950s?
# - Which continent has the highest life expectancy on average?
# - What year did Mexico have the highest population growth?

# SQL is wonderful because in order to answer all these questions, you only need
# to learn 7 functions! dplyr is the same way, except dplyr is EVEN MORE
# wonderful because, where SQL is very *structured*, dplyr is not. You can use
# the 7 dplyr functions in whichever order feels straightforward and logical.
# This helps with readability and clarity (two of our biggest goals with writing
# good code).

# In this exercise, we'll learn the first 3 dplyr functions: 'filter()',
# 'select()', and 'mutate()'.

# In the next exercise, we'll learn two more: 'summarize()' and 'group_by()'.

# Finally, we'll finish up by learning the last two: 'arrange()' and 'slice()'.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Run this code to get started and to view the dataset:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# Un-comment the next line to install a new package 'gapminder', then re-comment
# it so you don't keep installing it every time you run this script or test
# this koan.
# install.packages("gapminder")
library(gapminder)
# view(gapminder)

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                              ----- filter() -----

# The first dplyr function we'll learn is 'filter()'. It lets us filter
# our tibble by some logical condition, like "continent is equal to Europe".
# This gives us all the rows (observations) that have Europe as the continent.

gapminder %>%
  filter(continent == "Europe")

## # A tibble: 360 × 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Albania Europe     1952    55.2 1282697     1601.
##  2 Albania Europe     1957    59.3 1476505     1942.
##  3 Albania Europe     1962    64.8 1728137     2313.
##  4 Albania Europe     1967    66.2 1984060     2760.
##  5 Albania Europe     1972    67.7 2263554     3313.
##  6 Albania Europe     1977    68.9 2509048     3533.
##  7 Albania Europe     1982    70.4 2780097     3631.
##  8 Albania Europe     1987    72   3075321     3739.
##  9 Albania Europe     1992    71.6 3326498     2497.
## 10 Albania Europe     1997    73.0 3428038     3193.
## # … with 350 more rows

# You can filter on more than one logical condition at a time. This will
# return a tibble with all the gapminder observations from Asia in the year
# 1952. Recall that character strings ("Asia") need quotes, but variable names
# and numbers don't (continent, year, 1952).

gapminder %>%
  filter(continent == "Asia", year == 1952)

## # A tibble: 33 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan      Asia       1952    28.8   8425333      779.
##  2 Bahrain          Asia       1952    50.9    120447     9867.
##  3 Bangladesh       Asia       1952    37.5  46886859      684.
##  4 Cambodia         Asia       1952    39.4   4693836      368.
##  5 China            Asia       1952    44   556263527      400.
##  6 Hong Kong, China Asia       1952    61.0   2125900     3054.
##  7 India            Asia       1952    37.4 372000000      547.
##  8 Indonesia        Asia       1952    37.5  82052000      750.
##  9 Iran             Asia       1952    44.9  17272000     3035.
## 10 Iraq             Asia       1952    45.3   5441766     4130.
## # … with 23 more rows

# Read the qelp docs on 'filter()':

?qelp::filter

# 1. Filter gapminder for all the observations from Europe in 2007. ------------

#1@

# __

#@1

# 2. Filter gapminder for all the observations where lifeExp was exactly -------
# equal to 68 years old.

#2@

# __

#@2

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                          ----- Logical Operators -----

# The operator '==' reads "is equal to".
# But that's not the only logical operator you can use with 'filter()'.

# '!=': "not equal to"
# '>', '>=', '<', '<=': "greater than", "greater than or equal to", etc.
# '%in%': for "in"

# 3. Filter gapminder for a short list of the richest countries in Asia. -------

#3@

# gapminder %>%
#   filter(continent == __, gdpPercap __ 30000)

#@3

# 4. Filter gapminder for observations *IN* the United States, Germany, --------
# and Brazil.

#4@

# gapminder %>%
#   filter(country __ c("United States", "Germany", "Brazil"))

#@4

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                              ----- select() -----
#
# The function 'select()' lets you select columns (variables) of a tibble
# by name. No quotes are necessary because they're variable names.

gapminder %>%
  select(country, continent, year)

## # A tibble: 1,704 × 3
##    country     continent  year
##    <fct>       <fct>     <int>
##  1 Afghanistan Asia       1952
##  2 Afghanistan Asia       1957
##  3 Afghanistan Asia       1962
##  4 Afghanistan Asia       1967
##  5 Afghanistan Asia       1972
##  6 Afghanistan Asia       1977
##  7 Afghanistan Asia       1982
##  8 Afghanistan Asia       1987
##  9 Afghanistan Asia       1992
## 10 Afghanistan Asia       1997
## # … with 1,694 more rows

# 5: Select the last 3 variables of 'gapminder' by name: -----------------------

#5@

# gapminder %>%
#   select(__, __, __)

#@5

# You can also use 'select()' to rename variables. For example, to create
# a tibble where the variable 'lifeExp' is changed to the name
# 'life_expectancy', you can do this:

gapminder %>%
  select(life_expectancy = lifeExp)

## # A tibble: 1,704 × 1
##    life_expectancy
##              <dbl>
##  1            28.8
##  2            30.3
##  3            32.0
##  4            34.0
##  5            36.1
##  6            38.4
##  7            39.9
##  8            40.8
##  9            41.7
## 10            41.8
## # … with 1,694 more rows

# Make sure to read the qelp docs on 'select()':

?qelp::select

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                              ----- mutate() -----

# Use 'mutate()' to add new variables to your tibble. Those new variables
# can even be transformations of other existing variables. For example,
# we can create a variable called 'total_gdp' that is the product of
# 'pop' and 'gdpPercap'. I've piped it into 'view()' so you can verify that
# it worked.

# gapminder %>%
#   mutate(total_gdp = pop*gdpPercap) %>%
#   view()

# 6. Use 'mutate()' to create a new variable 'pop_in_thousands'. ---------------
# So if the observation has 'pop' = 97,462, then 'pop_in_thousands' will be
# 97.462.

#6@

# gapminder %>%
#   mutate(__)

#@6

# Read the qelp docs on 'mutate()':

?qelp::mutate

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.

# If you're ready, you can move on to the next koan: summarize and group_by.

K04_dplyr1.R

colleenobriant

2022-08-24