The output of most of the R chunks isn’t included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.

Setup

library(dplyr)
library(readr)

We know {dplyr} functions for working with a single data frame. {dplyr} also has functions for joining two data frames together.

Data

We’re going to use COVID-19 data by county and demographics by county for joining:

# county level data
demo <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_demographics.csv", show_col_types = FALSE)

cases <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_cases.csv", show_col_types = FALSE)

deaths <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_deaths.csv", show_col_types = FALSE)

# state level data
state_stats <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/state_covid.csv", show_col_types = FALSE)

I’ve introduced some missing data for the purposes of this workshop. Get the full COVID-19 case data from https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/. COVID-19 case and death data included here is as of early July 2020. Demographic variables are census estimates for July 2019.
To learn more about COVID-19 and race, see https://covidtracking.com/race; this is the source of the state-level COVID-19 statistics used here.

What’s in the data sets:

demo           # demographics by county
cases          # covid case counts by county
deaths         # covid death counts by county
state_stats.   # covid case and death counts by state and demographics

County FIPS Codes are unique 5-digit codes that represent specific US counties. The first two digits represent the county’s state and the last three digits represent the county. For example “17031” is the FIPS Code for Cook County, Illinois.

Adding Rows and Columns

Before we get to joins, let’s first cover adding rows or columns to a data frame.

{dplyr} includes bind_rows and bind_cols, which are roughly equivalent to base R rbind and cbind, but they are a bit easier to work with.

To demonstrate, let’s create some subsets of the data to work with first:

delaware <- filter(demo, state=="Delaware") %>%
  select(1:6)
hawaii <- filter(demo, state=="Hawaii") %>%
  select(1:4,7:9)
delaware
hawaii

We have some overlap in the column names, but the set of columns doesn’t match exactly. To combine these datasets, we want to bind the rows (each is a county) together:

# just printing, not saving...
bind_rows(delaware, hawaii)

bind_rows includes all columns the exist in either dataset and fills in missing values with NA. It matches columns by full name. If we tried this with rbind, we’d get an error, because the set of columns doesn’t match.

When would you really do this?

Maybe you have separate data files for each year of data, and you need to combine them. Or for each school, or state, or experiment iteration. Any case where the set of variables is similar across multiple data sets that all have the same unit of observation (rows are the same type of thing across data sets and you have similar variables).

For adding columns, we’d need to make sure that both data frames have the same number of rows and they appear in the same order. Generally, you don’t want to bind_cols. If you have some column that identifies observations in each data frame, you’ll want to join the datasets together instead, which is what we’re going to do next. If you are just trying to add a few columns of data, use mutate.

Why use join instead of bind_cols?

While it does happen, it’s dangerous to have multiple data sets with the same set of rows without some ID variable – where they’re just matched by the order of the rows; it’s too easy for the matching between datasets to get messed up.

Joins

The terminology for joining comes from SQL, which is used to interact with databases. And {dplyr} can work directly with SQL databases, translating {dplyr} commands into SQL, running them in the database, and then retrieving the results. That’s for a separate workshop, but if you do want to work with a database, know that you do not necessarily have to export the data in order to work with it in R. And you can write {dplyr} commands instead of SQL commands to work with the data.

Note: the equivalent function for joining in base R is merge().

Background: Keys

Joins work when you have two data frames (called tables in databases, but same idea), and you want to merge them together. Each data set needs to have some type of identifier, a key, that tells you how a row in dataset 1 should be matched into dataset 2.

For example, here, we have three county datasets, all of which have different data (different variables). In our case, each dataset includes an ID variable for each county. These IDs are the same across datasets because all 3 use the FIPS code from the US Census. So even though our datasets have different numbers of rows:

names(cases)
## [1] "countyFIPS" "county"     "state"      "stateFIPS"  "cases"
names(deaths)
## [1] "countyFIPS" "county"     "state"      "stateFIPS"  "deaths"
names(demo)
##  [1] "state"          "county"         "fips"           "total"         
##  [5] "white"          "black"          "native"         "asian"         
##  [9] "two"            "non_white"      "white_prop"     "black_prop"    
## [13] "native_prop"    "asian_prop"     "two_prop"       "non_white_prop"
nrow(cases)
## [1] 2830
nrow(deaths)
## [1] 3145
nrow(demo)
## [1] 3142

We have a key that will link rows in the different datasets together.

For example, look at a county that is in all three data sets:

filter(demo, fips == 1009)
filter(cases, countyFIPS == 1009)
filter(deaths, countyFIPS == 1009)

We want to combine the different variables from all 3 data frames together into a single one. We can’t just bind the columns together though, because the set of counties in each data set is different (some counties are missing from some datasets).

Example (we’ll talk through the code shortly):

cases %>%
  left_join(deaths, by="countyFIPS")

We also have a data set with state level data on COVID deaths. We might want to add this state level info to each county. In that case, you could use the state name to pull the correct state data from the data set for each row in the county data set.

# We'll talk through the code more below
cases %>%
  left_join(state_stats, by=c("state"="State"))

In this second example, each state in the state data set matches to multiple rows (multiple counties) in the county data set – this is OK and expected. You can have 1:1, 1:many, or many:many matches across data sets (although this last one can get messy! so much so, that {dplyr} will warn you when it happens).

There can also be cases where the combination of two different variables is the “key” – where both columns need to match for the data sets to be joined correctly.

There is also a more complicated case where you don’t have clean “key” variables but you still want to join two datasets. This requires fuzzy or approximate matching using variables in the two data sets, which is beyond the scope of what we can cover today. A case where this might happen is when you have country names (not ISO codes), and variants such as DRC vs. Democratic Republic of the Congo vs. Congo, DR (yes, this happens regularly).

Today, we’re assuming there is some clean key to match on where values match exactly.

EXERCISE 1

Import data about degrees awarded at Northwestern. What column(s) can be used as a key to join the datasets together?

Note: this isn’t a code answer - read in the data and take a look at the data frames to see what’s common in them.

bac <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/bac_schools.csv")

masters <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/masters_schools.csv")

doc <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/doc_schools.csv")

Background: Join Types

There are different ways to match up rows in two datasets. They vary based on what happens to rows that don’t match to the other data set (where the key from one data set isn’t present in the other).

https://tavareshugo.github.io/r-intro-tidyverse-gapminder/fig/07-dplyr_joins.svg

An inner join only includes in the result rows that matched in both datasets.

A left join includes all rows from the first (left) data set, filling in missing values where there isn’t a match in the second data set (right).

A right join does the opposite: keeps all rows from the second (right) data set, and fills in missing values where there isn’t a match in the fist.

An outer join keeps all rows from both data sets, filling in missing values where there aren’t matches.

An anti-join helps you find rows in a dataset that do NOT match to another.

Image Source: https://cghlewis.com/blog/joins/

Joins with {dplyr}

Each of the join types is a different function in {dplyr}: inner_join(), left_join(), right_join(), full_join() (the last one is an outer join). You use them all in the same way.

Let’s start with the first example above. I’m going to re-write it in a single line:

left_join(cases, deaths, by="countyFIPS")

The first input, cases, is the “left” table. The second, deaths, is the “right.” The by="countyFIPS" is telling it what key to use to join – in this case, there’s a column named “countyFIPS” in both data frames that should be used to match up rows.

In the output, it adds the suffix “.x” to any columns from the first data set where there’s a column of the same name in the second data set, and uses the suffix “.y” for the reverse. “countyFIPS” doesn’t get a suffix because it’s the variable we told it to join on. “cases” and “deaths” don’t get a suffix because there isn’t a column with those names in the other data set.

Note above and also below, I’m just joining and printing the resulting data frame. You could save the output (the resulting data frame) in a new variable though:

cases_deaths <- left_join(cases, deaths, by="countyFIPS")

Variations on by

If I don’t tell {dplyr} what the key variables are to use to join the tables, it will default to trying to join on any columns with the same name across data frames:

left_join(cases, deaths)

And tell me what it joined by in the output.

But it’s best to be explicit about how you want to join the tables. Just because two columns have the same name, doesn’t mean all of the data is the same! In the case of the cases and deaths data frames, they come from the same source. So all of these columns SHOULD match, but they don’t!

left_join(cases, deaths,
          by="countyFIPS") %>%
  filter(county.x != county.y)  
# use select(county.x, county.y) to see differences

If we matched by countyFIPS and county, we’d miss these matches.

Something to always watch out for! It’s usually best to join using the minimal set of columns needed to identify the correct matches.

With the cases and deaths data frames, they came from the same data source, so the key column has the same name in both data sets. In the demo data frame, the name of the column with the county FIPS code is different though:

names(demo)

To join with that, we use a named vector with by, with the form c("Name in first data set"="Name in second data set"):

left_join(cases, demo, by=c("countyFIPS"="fips"))

Here, where the column names are different, it just keeps one copy of the joining key column, and uses the column name from the first (left) data set.

We can also join on multiple columns, where each pair of columns we specify needs to match. For example, in the cases and deaths data, there are some of the same columns. We can require multiple columns to match by supplying a vector of names:

left_join(cases, deaths, by=c("countyFIPS", "state"))

This gives us a cleaner result without duplicate rows. But make sure that the values in these columns really should all match – if not, you’ll be missing matches.

EXERCISE 2

Join (left_join is fine) the cases data frame to the state_stats data frame using the state column in each (“state” in cases and “State” in state_stats); case of the variable name matters! (“state” is not equal to “State” as column names).

EXERCISE 3

Join the bac, masters, and doc data frames imported above into a single data frame.

You can string together multiple join commands with %>%

Unmatched Rows

A left join keeps all of the rows in the first data set, whether or not they match a row in the second. How do we find rows that didn’t match?

You can use anti_join() to find rows in one (the first) data frame that don’t match to a row in the second:

anti_join(deaths, cases, 
          by=c("countyFIPS")) 

This only returns the rows that don’t match.

Note that we don’t get any columns from the cases data frame because these are the rows from deaths that don’t match to cases. To see what rows in cases don’t match, switch the order:

anti_join(cases, deaths, 
          by=c("countyFIPS")) 

They all match! But remember, there are still rows in deaths that aren’t represented in the result.

If, instead of seeing the non-matching rows separately, you want to see which ones didn’t match in the context of the joined data frame, you can look for missing values in the joined columns.

left_join(deaths, cases, 
          by=c("countyFIPS")) %>%
  filter(is.na(cases))

Beware though, because if there are legitimately missing values in the column you check (here, cases), then you’ll get mixed results here.

EXERCISE 4

Find the rows in deaths that don’t have a match in demo Hint: join with deaths (column “countyFIPS”) and demo (column “fips”).

Beware Missing Values

Missing values: by default, {dplyr} matches missing (NA) as if it is a value. So if we have two data frames, and there are missing values in our key id variables, then they will match, even if there are multiple NAs.

a <- tibble(id=c(1, 2, NA), val=c(10, 20, 30))
b <- tibble(id=c(2, 3, NA, NA), val2 = c("a", "b", "c", "d"))
a
b
left_join(a, b, by="id")

If we want this to not happen, we can set the na_matches argument to “never”

left_join(a, b, by="id", 
          na_matches="never")

Sometimes, R will warn you that there there might be a problem.

c <- tibble(id=c(2, 3, NA, NA), val3 = c("w", "x", "y", "z"))
c

left_join(b, c, by="id")

This isn’t actually related to missing values, but it’s one of the times when this is most likely to happen by mistake. What it means is that there’s more than one row in the first data set that matches to the same row in the second data set AND there are rows in the first data set that each match to multiple rows in the second data set.

This can be legitimate, but it’s something you’d always want to check to confirm that the behavior is correct.

Cleanup

So you’ve joined two data sets, and now you have a bunch of columns with “.x” and “.y” (or other types of repeats). What do you do?

Option 1: select only the columns you really want in the result ahead of doing the join. NOTE: this is one of the very few times I might nest {dplyr} function calls

cases %>%
  select(countyFIPS, county, state, cases) %>%
  left_join(select(demo, fips, total, white, black),
            by=c("countyFIPS"="fips"))

Why nest the select call instead of saving a subset of the data first, and using that? Mostly for situations where the datasets are large, and I don’t want extra copies of the data frame in my environment. In this case, the data sets are small enough that I could do this instead:

demo_sub <- select(demo, fips, total, white, black)

cases %>%
  select(countyFIPS, county, state, cases) %>%
  left_join(demo_sub,
            by=c("countyFIPS"="fips"))

But I have demo_sub hanging around in my environment then (I like to keep my environment clean when possible!).

Option 2: use rename() or select() on the result of the join:

cases %>%
  left_join(demo,
            by=c("countyFIPS"="fips")) %>%
  select(countyFIPS, county=county.x, state=state.x, 
         cases, total, white, black)

EXERCISE 5

Create a data frame with only the following columns: countyFIPS, state, deaths, non_white population

Hint: use the demo and deaths data frames.

Learning More

See the Relational Data chapter in the book R for Data Science: https://r4ds.had.co.nz/relational-data.html

Recap

You now can use different {dplyr} join commands, such as inner_join, outer_join, left_join, right_join, anti_join, to join together your data in a wide variety of ways.

Next session: the {tidyr} functions for reshaping your data to follow the tidy format: pivot_longer, pivot_wider, separate and separate_rows

Answers to the exercises

EXERCISE 1

# View the data in each dataset
View(bac)
View(masters)
View(doc)

# Which columns are common?
c("school", "year") # set of common columns which together can be used as a Key

EXERCISE 2

# check column names
names(cases)
names(state_stats)

# use appropriate join key syntax
left_join(cases, state_stats, by=c("state"="State"))

EXERCISE 3

# check column names
names(bac)
names(masters)
names(doc)

# use appropriate join key syntax
bac %>% 
  left_join(masters, by=c("school", "year")) %>% 
  left_join(doc, by=c("school", "year"))

EXERCISE 4

# check column names
names(deaths)
names(demo)

# use appropriate join key syntax
anti_join(deaths, demo, by=c("countyFIPS"="fips"))

EXERCISE 5

# check column names
names(deaths)
names(demo)

# left join on deaths to automatically select countyFIPS colum
deaths %>% 
  left_join(demo, by=c("countyFIPS"="fips")) %>% 
  select(state=state.x, deaths, non_white) # rename state column while selecting
---
title: 'dplyr: join'
output:
  html_document:
    df_print: paged
    code_download: TRUE
    toc: true
    toc_depth: 2
editor_options:
  chunk_output_type: console
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval=FALSE, warning=FALSE, error=FALSE, message=FALSE)
```

*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*

# Setup

```{r, eval=TRUE}
library(dplyr)
library(readr)
```

We know `{dplyr}` functions for working with a single data frame. `{dplyr}` also has functions for joining two data frames together.

# Data

We're going to use COVID-19 data by county and demographics by county for joining:

```{r, eval=TRUE}
# county level data
demo <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_demographics.csv", show_col_types = FALSE)

cases <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_cases.csv", show_col_types = FALSE)

deaths <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_deaths.csv", show_col_types = FALSE)

# state level data
state_stats <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/state_covid.csv", show_col_types = FALSE)
```

I've introduced some missing data for the purposes of this workshop. Get the full COVID-19 case data from <https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/>. COVID-19 case and death data included here is as of early July 2020. Demographic variables are census estimates for July 2019.\
To learn more about COVID-19 and race, see <https://covidtracking.com/race>; this is the source of the state-level COVID-19 statistics used here.

What's in the data sets:

```{r}
demo           # demographics by county
cases          # covid case counts by county
deaths         # covid death counts by county
state_stats.   # covid case and death counts by state and demographics
```

County FIPS Codes are unique 5-digit codes that represent specific US counties. The first two digits represent the county's state and the last three digits represent the county. For example "17031" is the FIPS Code for Cook County, Illinois.

# Adding Rows and Columns

Before we get to joins, let's first cover adding rows or columns to a data frame.

`{dplyr}` includes `bind_rows` and `bind_cols`, which are roughly equivalent to base R `rbind` and `cbind`, but they are a bit easier to work with.

To demonstrate, let's create some subsets of the data to work with first:

```{r, eval=TRUE}
delaware <- filter(demo, state=="Delaware") %>%
  select(1:6)
hawaii <- filter(demo, state=="Hawaii") %>%
  select(1:4,7:9)
```

```{r}
delaware
hawaii
```

We have some overlap in the column names, but the set of columns doesn't match exactly. To combine these datasets, we want to bind the rows (each is a county) together:

```{r, eval=TRUE}
# just printing, not saving...
bind_rows(delaware, hawaii)
```

`bind_rows` includes all columns the exist in either dataset and fills in missing values with `NA`. It matches columns by full name. If we tried this with `rbind`, we'd get an error, because the set of columns doesn't match.

**When would you really do this?**

Maybe you have separate data files for each year of data, and you need to combine them. Or for each school, or state, or experiment iteration. Any case where the set of **variables** is **similar across multiple data sets** that all have the same unit of observation (rows are the same type of thing across data sets and you have similar variables).

For **adding columns**, we'd need to make sure that **both data frames have the same number of rows** and they appear in the same order. Generally, you don't want to `bind_cols`. If you have some column that identifies observations in each data frame, you'll want to join the datasets together instead, which is what we're going to do next. If you are just trying to add a few columns of data, use `mutate`.

**Why use join instead of `bind_cols`?**

While it does happen, it's dangerous to have multiple data sets with the same set of rows without some ID variable -- where they're just matched by the order of the rows; it's too easy for the matching between datasets to get messed up.

# Joins

The terminology for joining comes from SQL, which is used to interact with databases. And `{dplyr}` can work directly with SQL databases, translating `{dplyr}` commands into SQL, running them in the database, and then retrieving the results. That's for a separate workshop, but if you do want to work with a database, know that you do not necessarily have to export the data in order to work with it in R. And you can write `{dplyr}` commands instead of SQL commands to work with the data.

**Note**: the equivalent function for joining in base R is `merge()`.

## Background: Keys

Joins work when you have two data frames (called tables in databases, but same idea), and you want to merge them together. Each data set needs to have some type of identifier, a key, that tells you how a row in dataset 1 should be matched into dataset 2.

For example, here, we have three county datasets, all of which have different data (different variables). In our case, each dataset includes an ID variable for each county. These IDs are the same across datasets because all 3 use the FIPS code from the US Census. So even though our datasets have different numbers of rows:

```{r, eval=TRUE}
names(cases)
names(deaths)
names(demo)

nrow(cases)
nrow(deaths)
nrow(demo)
```

We have a key that will link rows in the different datasets together.

For example, look at a county that is in all three data sets:

```{r, eval=TRUE}
filter(demo, fips == 1009)
filter(cases, countyFIPS == 1009)
filter(deaths, countyFIPS == 1009)
```

We want to combine the different variables from all 3 data frames together into a single one. We can't just bind the columns together though, because the set of counties in each data set is different (some counties are missing from some datasets).

Example (we'll talk through the code shortly):

```{r, eval=TRUE}
cases %>%
  left_join(deaths, by="countyFIPS")
```

We also have a data set with state level data on COVID deaths. We might want to add this state level info to each county. In that case, you could use the state name to pull the correct state data from the data set for each row in the county data set.

```{r, eval=TRUE}
# We'll talk through the code more below
cases %>%
  left_join(state_stats, by=c("state"="State"))
```

In this second example, each state in the state data set matches to multiple rows (multiple counties) in the county data set -- this is OK and expected. You can have 1:1, 1:many, or many:many matches across data sets (although this last one can get messy! so much so, that `{dplyr}` will warn you when it happens).

There can also be cases where the combination of two different variables is the "key" -- where both columns need to match for the data sets to be joined correctly.

There is also a more complicated case where you don't have clean "key" variables but you still want to join two datasets. This requires fuzzy or approximate matching using variables in the two data sets, which is beyond the scope of what we can cover today. A case where this might happen is when you have country names (not ISO codes), and variants such as DRC vs. Democratic Republic of the Congo vs. Congo, DR (yes, this happens regularly).

Today, we're assuming there is some clean key to match on where values match exactly.

### EXERCISE 1

Import data about degrees awarded at Northwestern. What column(s) can be used as a key to join the datasets together?

Note: this isn't a code answer - read in the data and take a look at the data frames to see what's common in them.

```{r}
bac <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/bac_schools.csv")

masters <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/masters_schools.csv")

doc <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/doc_schools.csv")
```

## Background: Join Types

There are different ways to match up rows in two datasets. They vary based on what happens to rows that don't match to the other data set (where the key from one data set isn't present in the other).

<https://tavareshugo.github.io/r-intro-tidyverse-gapminder/fig/07-dplyr_joins.svg>

An **inner** join only includes in the result rows that matched in both datasets.

A **left** join includes all rows from the first (left) data set, filling in missing values where there isn't a match in the second data set (right).

A **right** join does the opposite: keeps all rows from the second (right) data set, and fills in missing values where there isn't a match in the fist.

An **outer** join keeps all rows from both data sets, filling in missing values where there aren't matches.

An **anti**-join helps you find rows in a dataset that do NOT match to another.

![](images/join_types.png)

Image Source: <https://cghlewis.com/blog/joins/>

## Joins with `{dplyr}`

Each of the join types is a different function in `{dplyr}`: `inner_join()`, `left_join()`, `right_join()`, `full_join()` (the last one is an outer join). You use them all in the same way.

Let's start with the first example above. I'm going to re-write it in a single line:

```{r}
left_join(cases, deaths, by="countyFIPS")
```

The first input, cases, is the "left" table. The second, deaths, is the "right." The `by="countyFIPS"` is telling it what key to use to join -- in this case, there's a column named "countyFIPS" in both data frames that should be used to match up rows.

In the output, it adds the suffix ".x" to any columns from the first data set where there's a column of the same name in the second data set, and uses the suffix ".y" for the reverse. "countyFIPS" doesn't get a suffix because it's the variable we told it to join on. "cases" and "deaths" don't get a suffix because there isn't a column with those names in the other data set.

Note above and also below, I'm just joining and printing the resulting data frame. You could save the output (the resulting data frame) in a new variable though:

```{r}
cases_deaths <- left_join(cases, deaths, by="countyFIPS")
```

### Variations on by

If I don't tell `{dplyr}` what the key variables are to use to join the tables, it will default to trying to join on any columns with the same name across data frames:

```{r}
left_join(cases, deaths)
```

And tell me what it joined by in the output.

But it's best to be explicit about how you want to join the tables. Just because two columns have the same name, doesn't mean all of the data is the same! In the case of the `cases` and `deaths` data frames, they come from the same source. So all of these columns *SHOULD* match, but they don't!

```{r}
left_join(cases, deaths,
          by="countyFIPS") %>%
  filter(county.x != county.y)  
# use select(county.x, county.y) to see differences
```

If we matched by countyFIPS and county, we'd miss these matches.

Something to always watch out for! It's usually best to join using the minimal set of columns needed to identify the correct matches.

With the `cases` and `deaths` data frames, they came from the same data source, so the key column has the same name in both data sets. In the `demo` data frame, the name of the column with the county FIPS code is different though:

```{r}
names(demo)
```

To join with that, we use a named vector with `by`, with the form `c("Name in first data set"="Name in second data set")`:

```{r}
left_join(cases, demo, by=c("countyFIPS"="fips"))
```

Here, where the column names are different, it just keeps one copy of the joining key column, and uses the column name from the first (left) data set.

We can also join on multiple columns, where each pair of columns we specify needs to match. For example, in the cases and deaths data, there are some of the same columns. We can require multiple columns to match by supplying a vector of names:

```{r}
left_join(cases, deaths, by=c("countyFIPS", "state"))
```

This gives us a cleaner result without duplicate rows. But make sure that the values in these columns really should all match -- if not, you'll be missing matches.

### EXERCISE 2

Join (`left_join` is fine) the `cases` data frame to the `state_stats` data frame using the state column in each ("state" in `cases` and "State" in `state_stats`); case of the variable name matters! ("state" is not equal to "State" as column names).

```{r}


```

### EXERCISE 3

Join the bac, masters, and doc data frames imported above into a single data frame.

You can string together multiple join commands with %\>%

```{r}

```

## Unmatched Rows

A left join keeps all of the rows in the first data set, whether or not they match a row in the second. How do we find rows that didn't match?

You can use `anti_join()` to find rows in one (the first) data frame that don't match to a row in the second:

```{r}
anti_join(deaths, cases, 
          by=c("countyFIPS")) 
```

This only returns the rows that don't match.

Note that we don't get any columns from the `cases` data frame because these are the rows from `deaths` that don't match to cases. To see what rows in `cases` don't match, switch the order:

```{r}
anti_join(cases, deaths, 
          by=c("countyFIPS")) 
```

They all match! But remember, there are still rows in deaths that aren't represented in the result.

If, instead of seeing the non-matching rows separately, you want to see which ones didn't match in the context of the joined data frame, you can look for missing values in the joined columns.

```{r}
left_join(deaths, cases, 
          by=c("countyFIPS")) %>%
  filter(is.na(cases))
```

Beware though, because if there are legitimately missing values in the column you check (here, cases), then you'll get mixed results here.

### EXERCISE 4

Find the rows in `deaths` that don't have a match in `demo` Hint: join with `deaths` (column "countyFIPS") and `demo` (column "fips").

```{r}


```

## Beware Missing Values

Missing values: by default, `{dplyr}` matches missing (`NA`) as if it is a value. So if we have two data frames, and there are missing values in our key id variables, then they will match, even if there are multiple `NA`s.

```{r}
a <- tibble(id=c(1, 2, NA), val=c(10, 20, 30))
b <- tibble(id=c(2, 3, NA, NA), val2 = c("a", "b", "c", "d"))
a
b
```

```{r}
left_join(a, b, by="id")
```

If we want this to not happen, we can set the `na_matches` argument to "never"

```{r}
left_join(a, b, by="id", 
          na_matches="never")
```

Sometimes, R will warn you that there there might be a problem.

```{r}
c <- tibble(id=c(2, 3, NA, NA), val3 = c("w", "x", "y", "z"))
c

left_join(b, c, by="id")
```

This isn't actually related to missing values, but it's one of the times when this is most likely to happen by mistake. What it means is that there's more than one row in the first data set that matches to the same row in the second data set AND there are rows in the first data set that each match to multiple rows in the second data set.

This can be legitimate, but it's something you'd always want to check to confirm that the behavior is correct.

## Cleanup

So you've joined two data sets, and now you have a bunch of columns with ".x" and ".y" (or other types of repeats). What do you do?

**Option 1**: select only the columns you really want in the result ahead of doing the join. NOTE: this is one of the very few times I might nest `{dplyr}` function calls

```{r}
cases %>%
  select(countyFIPS, county, state, cases) %>%
  left_join(select(demo, fips, total, white, black),
            by=c("countyFIPS"="fips"))
```

Why nest the `select` call instead of saving a subset of the data first, and using that? Mostly for situations where the datasets are large, and I don't want extra copies of the data frame in my environment. In this case, the data sets are small enough that I could do this instead:

```{r}
demo_sub <- select(demo, fips, total, white, black)

cases %>%
  select(countyFIPS, county, state, cases) %>%
  left_join(demo_sub,
            by=c("countyFIPS"="fips"))
```

But I have `demo_sub` hanging around in my environment then (I like to keep my environment clean when possible!).

**Option 2**: use `rename()` or `select()` on the result of the join:

```{r}
cases %>%
  left_join(demo,
            by=c("countyFIPS"="fips")) %>%
  select(countyFIPS, county=county.x, state=state.x, 
         cases, total, white, black)
```

### EXERCISE 5

Create a data frame with only the following columns: countyFIPS, state, deaths, non_white population

Hint: use the `demo` and `deaths` data frames.

```{r}

```

# Learning More

See the Relational Data chapter in the book R for Data Science: <https://r4ds.had.co.nz/relational-data.html>

# Recap

You now can use different `{dplyr}` join commands, such as `inner_join`, `outer_join`, `left_join`, `right_join`, `anti_join`, to join together your data in a wide variety of ways.

Next session: the `{tidyr}` functions for reshaping your data to follow the tidy format: `pivot_longer`, `pivot_wider`, `separate` and `separate_rows`

# Answers to the exercises

### EXERCISE 1

```{r}
# View the data in each dataset
View(bac)
View(masters)
View(doc)

# Which columns are common?
c("school", "year") # set of common columns which together can be used as a Key
```

### EXERCISE 2

```{r}
# check column names
names(cases)
names(state_stats)

# use appropriate join key syntax
left_join(cases, state_stats, by=c("state"="State"))
```

### EXERCISE 3

```{r}
# check column names
names(bac)
names(masters)
names(doc)

# use appropriate join key syntax
bac %>% 
  left_join(masters, by=c("school", "year")) %>% 
  left_join(doc, by=c("school", "year"))
```

### EXERCISE 4

```{r}
# check column names
names(deaths)
names(demo)

# use appropriate join key syntax
anti_join(deaths, demo, by=c("countyFIPS"="fips"))
```

### EXERCISE 5

```{r}
# check column names
names(deaths)
names(demo)

# left join on deaths to automatically select countyFIPS colum
deaths %>% 
  left_join(demo, by=c("countyFIPS"="fips")) %>% 
  select(state=state.x, deaths, non_white) # rename state column while selecting
```
