The output of most of the R chunks isn’t included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.

This is an R Markdown document. Follow the link to learn more about R Markdown and the notebook format used during the workshop.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This gives you info on which packages it actually loaded, because when you install tidyverse, it installs ~25 packages, but it only loads the ones listed. Tidyverse packages also tend to be verbose in warning you when there are functions with the same name in multiple packages.

Background

Tidyverse packages do a few things:

Tidyverse packages are particularly useful for:

Data

Let’s import the data we’ll be using. The data is from the Stanford Open Policing Project and includes vehicle stops by the Evanston police in 2017. We’re reading the data in from a URL directly.

We’re going to use the read_csv function from the readr package, which is part of the tidyverse. The read_csv function works like read.csv except is has some different defaults, guesses data types a bit differently, and produces a tibble instead of a normal data frame (details coming).

police <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/ev_police.csv")
## Rows: 14792 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): beat, subject_race, subject_sex, department_name, type, violation...
## dbl   (8): raw_row_number, location, department_id, vehicle_year, raw_Driver...
## lgl   (9): subject_age, citation_issued, warning_issued, contraband_found, c...
## date  (1): date
## time  (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The output message that you get tells you what data type it guessed for each column based on the format of the information. “chr” is character or text data, “dbl” is numeric (stands for double, which is technical term for a type of number), “lgl” is logical/Boolean (TRUE/FALSE). Note that it also automatically read and identified date and time values and converted them to date and time objects – not just string/character data.

We can also manually specify column types for cases where the assumption that read_csv makes is wrong. We use the col_types argument (similar to colClasses for read.csv). Let’s make the location to be character data, since it is zip codes – zip codes should not be treated as numbers.

police <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/ev_police.csv",
                   col_types=c("location"="c"))

EXERCISE 1

Remember: you need to have loaded tidyverse, so execute the cells above.

We have a dataset that includes ISO two-letter country codes. The country code for Namibia is NA, so we don’t want to read “NA” in as missing, which it does by default (see how “NA” is grayed out in the output below for the Namibia country code?).

Look at the documentation (help) page for read_csv. You can open it by typing ?read_csv in the console. The na argument determines what values are imported as missing NA.

Change the code below so that only empty strings “” and “N/A” values are imported as missing (not “NA”). Look at fix_data after importing so you can check the values.

fix_data <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/missing.csv")
## Rows: 8 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country_name, country_code, continent
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fix_data

You can write your code here:

Tibbles

You may have noticed above that read_csv imported the data as something called a Tibble. Tibbles are the tidyverse version of a data frame. You can use them as you would a data frame (they are one), but they behave in slightly different ways.

police

The most observable difference is that tibbles will only print 10 rows and the columns that will fit in your console. When they print, they print a list of column names and the types of the columns that are shown.

To view the dataset, use View():

View(police)

When using [] notation to subset them, they will always return a tibble. In contrast, data frames sometimes return a data frame and sometimes return just a vector.

police[, 1]
as.data.frame(police)[, 1]

dplyr

dplyr is the core package of the tidyverse. It includes functions for working with tibbles (or any data frames). While you can still use base R operations on tibbles/data frames, such as using $ and [] subsetting like we did above, dplyr provides alternatives to all of the common data manipulation tasks.

Here, we’re just going to look at the basics of subsetting data to get a feel for how tidyverse functions typically work. Next session, we’ll get into variations on subsetting data and some other dplyr functions.

Before we start, let’s remember what columns are in our data:

names(police)

select

The select() function lets us choose which columns (or variables) we want to keep in our data.

The data frame is the first input, and the name of the column is the second. We do not have to put quotes around the column name.

select(police, subject_race)

If we want to select additional columns, we can just list the column names as additional inputs, each column name separated by commas:

select(police, subject_race, outcome)

As with [] indexing, columns will be returned in the order specified:

select(police, subject_sex, subject_race, date)

We could also use the column index number if we wanted to instead. We don’t need to put the values in c() like we would with [] (but we could).

select(police, 1, 4, 10)

Yes, there are other ways to specify which columns you want. We’ll cover those next session.

EXERCISE 2

Remember: you need to have loaded tidyverse, and the police data, so execute the cells above.

Convert this base R expression: police[,c("violation", "citation_issued", "warning_issued")] to use select() instead to do the same thing. You can write your code here:

filter

To choose which rows should remain in our data, we use filter(). As with [], we write expressions that evaluate to TRUE or FALSE for each row. Like select(), we can use the column names without quotes.

filter(police, location == "60202")

Note that we use == to test for equality and get TRUE/FALSE output. You can also write more complicated expressions – anything that will evaluate to a vector of TRUE/FALSE values.

filter(police, is.na(beat))

Variables (columns) that are already logical (TRUE/FALSE values), can be used to filter:

filter(police, contraband_found)

EXERCISE 3

Use filter() to choose the rows where subject_race is “white”.

The equivalent base R expression would be police[police$subject_race == "white",]. You can write your code here:

slice

Unlike select(), we can’t use row numbers to index which rows we want with filter. This gives an error:

filter(police, 10)

If we did need to use the row index (row number) to select which rows we want, we can use the slice() function.

slice(police, 10)
slice(police, 10:15)

We don’t usually use slice() in this way when working with dplyr. This is because we ideally want to be working with well-structured data, where we can reorder the rows without losing information. If reordering the rows in the dataset would result in a loss of information (it would mess up your data), then the dataset is missing an important variable – maybe just a sequence index. You should always be able to use a variable to order the data if needed.

Pipe: Chaining Commands Together

So, we can choose rows and choose columns separately; how do we combine these operations? dplyr, and other tidyverse, commands can be strung together is a series with a %>% (say/read: pipe) operator. The pipe takes the output of the command on the left and makes that the first input to the command on the right. (If you are familiar with working in a terminal/at the command line, it works like a bash pipe character |.)

This works because the functions all take a data frame as the first input, and they return a data frame as the output.

We can rewrite

select(police, date, time)

as

police %>% select(date, time)

and you’ll often see code formatted, so %>% is at the end of each line, and the following line that are still part of the same expression are indented:

police %>%
  select(date, time)

The pipe comes from a package called magrittr, which has additional special operators in it that you can use. The keyboard shortcut for %>% is command-shift-M (Mac) or control-shift-M (Windows).

We can use the pipe to string together multiple commands operating on the same data frame:

police %>%
  select(subject_race, subject_sex) %>%
  filter(subject_race == "white")

We would read the %>% in the command above as “then” if reading the code outloud: from police, select subject_race and subject_sex, then filter where subject_race is white.

This works because the dplyr functions take a tibble/data frame as the first argument (input) and return a tibble/data frame as the output. This makes it easy to pass a data frame through multiple operations, changing it one step at a time.

Order does matter, as the commands are executed in order. So this would give us an error:

police %>%
  select(subject_sex, outcome) %>%
  filter(subject_race == "white")

Because subject_race is no longer in the data frame once we try to filter with it. We’d have to reverse the order:

police %>%
  filter(subject_race == "white") %>%
  select(subject_sex, outcome)

You can use the pipe operator to string together commands outside of the tidyverse as well, and it works with any input and output, not just data frames:

# sum(is.na(police$beat))
is.na(police$beat) %>% sum()

EXERCISE 4

Select the date, time, and outcome (columns) of stops that occur in beat “71” (rows). Make use of the %>% operator.

The equivalent base R expression would be: police[police$beat == "71", c("date", "time", "outcome")]

Hint: remember that a column needs to still be in the data frame if you’re going to use the column to filter.

Note that so far, we haven’t actually changed the police data frame at all. We’ve written expressions to give us output, but we haven’t saved it.

Sometimes we may still want to save the result of some expression, such as after performing a bunch of data cleaning steps. We can assign the output of piped commands as we would with any other expression.

police60201 <- police %>%
  filter(location == "60201") %>%
  select(date, time, beat, type, outcome) 

EXERCISE 5

Select only vehicle_year and vehicle_make columns for observations where there were contraband_weapons. You can write your code here:

Recap

We learned what tibbles are, the dplyr equivalents of indexing and subsetting a data frame, and the pipe %>% operator.

Next time we’re going to look at some more complicated use cases for select, filter, and slice, as well as learn mutate to create or change variables in our datasets.

Answers to the exercises

Exercise 1

fix_data <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/missing.csv", na = c("", "N/A"))
fix_data

Exercise 2

select(police, violation, citation_issued, warning_issued)

Exercise 3

filter(police, subject_race == "white")

Exercise 4

police %>% 
  filter(beat == "71") %>% 
  select(date, time, outcome)

Exercise 5

police %>% 
  filter(contraband_weapons == TRUE) %>% 
  select(vehicle_year, vehicle_make)

You can also write:

police %>% 
  filter(contraband_weapons) %>% 
  select(vehicle_year, vehicle_make)
