K02_tibble.R

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#                   Intro to the Tidyverse by Colleen O'Briant
#                                Koan #2: Tibbles
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# In order to progress:
# 1. Read all instructions carefully.
# 2. When you come to an exercise, fill in the blank, un-comment the line
#    (Ctrl/Cmd Shift C), and execute the code in the console (Ctrl/Cmd Return).
#    If the piece of code spans multiple lines, highlight the whole chunk or
#    simply put your cursor at the end of the last line.
# 3. Save (Ctrl/Cmd S).
# 4. Test that your answers are correct (Ctrl/Cmd Shift T).

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Tibbles:

# In the previous koan, we learned how to hold data in vectors. But data
# usually comes to us in a table format (like a spreadsheet). When you're
# working with a tibble, you want to think of it as just a bunch of column
# vectors.

# Load the tidyverse to get started:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                         ----- Building a tibble -----
#
# To show you what a tibble is, I'll build one. Suppose I have some data on US
# GDP per capita and life expectancy since 1957. Here's the data in written
# form:
#
# In 1957, US GDP per capita was $14,847 and life expectancy was 69.5 years.
# In 1977, US GDP per capita was $24,073 and life expectancy was 73.4 years.
# In 1997, US GDP per capita was $35,767 and life expectancy was 76.8 years.

# Here's how to put that data into a tibble in R. Read this carefully and run
# the code (highlight the code you want to run or put your cursor at the end
# of an expression and hit Cmd-Return):

us_data <- tibble(
  year = c(1957, 1977, 1997),
  gdpPercap = c(14847, 24073, 35767),
  lifeExp = c(69.5, 73.4, 76.8)
)

us_data

## # A tibble: 3 × 3
##    year gdpPercap lifeExp
##   <dbl>     <dbl>   <dbl>
## 1  1957     14847    69.5
## 2  1977     24073    73.4
## 3  1997     35767    76.8

# Notice that in order to build a tibble, you define vectors with 'c' that
# become the columns of the table. The columns have names.

# Now take some data for Canada:
# In 1957, GDP per capita in Canada was $12,490 and life expectancy was 70.0 years.
# In 1977, GDP per capita in Canada was $22,091 and life expectancy was 74.2 years.
# In 1997, GDP per capita in Canada was $28,955 and life expectancy was 78.0 years.

# 1. Create a tibble that holds the canada_data: -------------------------------

#1@

# canada_data <- tibble(
#   year      = __,
#   gdpPercap = __,
#   lifeExp   = __
# )

# canada_data

#@1

# Tibbles are pretty strict. They have 2 rules you must follow:
#  1. Each column must be named ('year', 'gdpPercap', and 'lifeExp' are the
#     names here). If you try to define a column without giving it a name,
#     'tibble' will come up with the name for you.
#  2. Each column must have the same number of rows (each column vector has 3
#     elements, so each column vector has 3 rows, and the tibble has 3 rows). If
#     you try to define a column that's shorter than the others, 'tibble' will
#     either recycle the values of that column to make it long enough, or it
#     will give you an error.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                       ----- Inspecting your tibble -----

# There are a few ways to see the data held in the tibble. The first way is
# to print the tibble to your console either by using the function 'print',
# or calling the dataset's name:

print(us_data)

## # A tibble: 3 × 3
##    year gdpPercap lifeExp
##   <dbl>     <dbl>   <dbl>
## 1  1957     14847    69.5
## 2  1977     24073    73.4
## 3  1997     35767    76.8

us_data

## # A tibble: 3 × 3
##    year gdpPercap lifeExp
##   <dbl>     <dbl>   <dbl>
## 1  1957     14847    69.5
## 2  1977     24073    73.4
## 3  1997     35767    76.8

# A tibble: 4 x 3
#    year  gdpPercap lifeExp
#    <dbl>    <dbl>    <dbl>
# 1  1957     14847    69.5
# 2  1977     24073    73.4
# 3  1997     35767    76.8
# 4  2017     60062    78.5

# Another way is to use 'view()' to look at the data in a separate tab.

# view(us_data)

# Un-comment this line of code with Ctrl/Cmd Shift C. Then after running it,
# re-comment it using the same hotkeys. Why did I comment out view()?
# It doesn't play nicely with my koan tests (it opens a view tab each test,
# which will be slow and painful for you). It's best to always comment out
# view() calls in the koans.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                         ----- Tibble Dimensions -----

# You can use the functions 'nrow()' and 'ncol' to find the dimensions of the
# tibble. For instance, this tells us that us_data has 3 rows:

nrow(us_data)

## [1] 3

# 2. How many columns does us_data have? ---------------------------------------

#2@

# ncol(us_data) == __

#@2

?qelp::nrow
?qelp::ncol

# In a tibble, columns are called *variables*. Rows are called *observations*.
# This is very important to understand. In our tibble, the *variables* (columns)
# are 'year','gdpPercap', and 'lifeExp': they're the things that were measured
# for each *observation* (row): for 1957, 1977, and 1997.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                         ----- Variable Names -----

# You can get the variable (column) names of a tibble using 'names()':

# 3. What does names() return when you apply it to us_data? --------------------

#3@

# names(us_data) == c("year", "gdpPercap", __)

#@3

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#                         ----- Adding Rows -----

# Suppose you want to append a row (or several) to your dataset. You can do that
# using the function 'add_row()'. It takes the tibble as the first argument and
# then variable names and values to construct the new row. Suppose we got a new
# observation (row) about Canada: In 2007, GDP per capita in Canada was $36,319
# and life expectancy was 80.7 years.


#4@

# add_row(
#   canada_data,
#   year = __,
#   gdpPercap = __,
#   lifeExp = __
# )

#@4

# Make sure to read the qelp docs for 'tibble':
?qelp::tibble

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Great work! You're one step closer to tidyverse enlightenment. Make sure to
# return to this topic to meditate on it later.

# If you're ready, you can move on to the next koan: Piping.

K02_tibble.R

colleenobriant

2022-09-08