# Intro to the Tidyverse by Colleen O'Briant
# Koan #4: filter, select, and mutate
# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
# (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
# If the piece of code spans multiple lines, highlight the whole chunk or
# simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).
# In this koan, you'll learn the first three dplyr verbs: 'filter', 'select',
# and 'mutate'.
# dplyr is a package for data manipulation (the name is supposed to evoke
# "data pliers"). It's included in the tidyverse, so you automatically have
# access to all the dplyr functions whenever you attach the tidyverse with
# 'library(tidyverse)'.
# dplyr is a SQL implementation. What is SQL? It stands for "Structured Query
# Language": it's a programming language for answering questions ("queries")
# about a dataset. With SQL (and with dplyr), you can transform your data into a
# neat table of results to answer just about any question you have about your
# data!
# As we'll see in this exercise and the next, you can use dplyr on a
# basic demographic dataset to answer questions like...
# - What were the 5 richest countries in the 1950s?
# - Which continent has the highest life expectancy on average?
# - What year did Mexico have the highest population growth?
# SQL is wonderful because in order to answer all these questions, you only need
# to learn 7 functions! dplyr is the same way, except dplyr is EVEN MORE
# wonderful because, where SQL is very *structured*, dplyr is not. You can use
# the 7 dplyr functions in whichever order feels straightforward and logical.
# This helps with readability and clarity (two of our biggest goals with writing
# good code).
# In this exercise, we'll learn the first 3 dplyr functions: 'filter()',
# 'select()', and 'mutate()'.
# In the next exercise, we'll learn two more: 'summarize()' and 'group_by()'.
# Finally, we'll finish up by learning the last two: 'arrange()' and 'slice()'.
# Run this code to get started and to view the dataset:
# Un-comment the next line to install a new package 'gapminder', then re-comment
# it so you don't keep installing it every time you run this script or test
# this koan.
# install.packages("gapminder")
# view(gapminder)
# ----- filter() -----
# The first dplyr function we'll learn is 'filter()'. It lets us filter
# our tibble by some logical condition, like "continent is equal to Europe".
# This gives us all the rows (observations) that have Europe as the continent.
gapminder %>%
filter(continent == "Europe")
## # A tibble: 360 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
## 7 Albania Europe 1982 70.4 2780097 3631.
## 8 Albania Europe 1987 72 3075321 3739.
## 9 Albania Europe 1992 71.6 3326498 2497.
## 10 Albania Europe 1997 73.0 3428038 3193.
## # … with 350 more rows
# You can filter on more than one logical condition at a time. This will
# return a tibble with all the gapminder observations from Asia in the year
# 1952. Recall that character strings ("Asia") need quotes, but variable names
# and numbers don't (continent, year, 1952).
gapminder %>%
filter(continent == "Asia", year == 1952)
## # A tibble: 33 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Bahrain Asia 1952 50.9 120447 9867.
## 3 Bangladesh Asia 1952 37.5 46886859 684.
## 4 Cambodia Asia 1952 39.4 4693836 368.
## 5 China Asia 1952 44 556263527 400.
## 6 Hong Kong, China Asia 1952 61.0 2125900 3054.
## 7 India Asia 1952 37.4 372000000 547.
## 8 Indonesia Asia 1952 37.5 82052000 750.
## 9 Iran Asia 1952 44.9 17272000 3035.
## 10 Iraq Asia 1952 45.3 5441766 4130.
## # … with 23 more rows
# Read the qelp docs on 'filter()':
# 1. Filter gapminder for all the observations from Europe in 2007. ------------
# __
# 2. Filter gapminder for all the observations where lifeExp was exactly -------
# equal to 68 years old.
# __
# ----- Logical Operators -----
# The operator '==' reads "is equal to".
# But that's not the only logical operator you can use with 'filter()'.
# '!=': "not equal to"
# '>', '>=', '<', '<=': "greater than", "greater than or equal to", etc.
# '%in%': for "in"
# 3. Filter gapminder for a short list of the richest countries in Asia. -------
# gapminder %>%
# filter(continent == __, gdpPercap __ 30000)
# 4. Filter gapminder for observations *IN* the United States, Germany, --------
# and Brazil.
# gapminder %>%
# filter(country __ c("United States", "Germany", "Brazil"))
# ----- select() -----
# The function 'select()' lets you select columns (variables) of a tibble
# by name. No quotes are necessary because they're variable names.
gapminder %>%
select(country, continent, year)
## # A tibble: 1,704 × 3
## country continent year
## <fct> <fct> <int>
## 1 Afghanistan Asia 1952
## 2 Afghanistan Asia 1957
## 3 Afghanistan Asia 1962
## 4 Afghanistan Asia 1967
## 5 Afghanistan Asia 1972
## 6 Afghanistan Asia 1977
## 7 Afghanistan Asia 1982
## 8 Afghanistan Asia 1987
## 9 Afghanistan Asia 1992
## 10 Afghanistan Asia 1997
## # … with 1,694 more rows
# 5: Select the last 3 variables of 'gapminder' by name: -----------------------
# gapminder %>%
# select(__, __, __)
# You can also use 'select()' to rename variables. For example, to create
# a tibble where the variable 'lifeExp' is changed to the name
# 'life_expectancy', you can do this:
gapminder %>%
select(life_expectancy = lifeExp)
## # A tibble: 1,704 × 1
## life_expectancy
## <dbl>
## 1 28.8
## 2 30.3
## 3 32.0
## 4 34.0
## 5 36.1
## 6 38.4
## 7 39.9
## 8 40.8
## 9 41.7
## 10 41.8
## # … with 1,694 more rows
# Make sure to read the qelp docs on 'select()':
# ----- mutate() -----
# Use 'mutate()' to add new variables to your tibble. Those new variables
# can even be transformations of other existing variables. For example,
# we can create a variable called 'total_gdp' that is the product of
# 'pop' and 'gdpPercap'. I've piped it into 'view()' so you can verify that
# it worked.
# gapminder %>%
# mutate(total_gdp = pop*gdpPercap) %>%
# view()
# 6. Use 'mutate()' to create a new variable 'pop_in_thousands'. ---------------
# So if the observation has 'pop' = 97,462, then 'pop_in_thousands' will be
# 97.462.
# gapminder %>%
# mutate(__)
# Read the qelp docs on 'mutate()':
# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.
# If you're ready, you can move on to the next koan: summarize and group_by.