Note: some output from the code is not included here to keep the file size smaller. Run the code to see the output.
This is an R Markdown document. Follow the link to learn more about R Markdown and the notebook format used during the workshop.
Unlike other sessions, we don’t have exercises today - just a demo of what some of the packages are and how you might use them.
Some packages installed by tidyverse don’t load when you do library(tidyverse). We’ll load them explicitly:
library(tidyverse)
library(lubridate)
library(readxl)
# stringr and forcats already loaded
List the tidyverse packages:
tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr" "dplyr"
## [6] "forcats" "ggplot2" "haven" "hms" "httr"
## [11] "jsonlite" "lubridate" "magrittr" "modelr" "pillar"
## [16] "purrr" "readr" "readxl" "reprex" "rlang"
## [21] "rstudioapi" "rvest" "stringr" "tibble" "tidyr"
## [26] "xml2" "tidyverse"
While there are many utility packages that make working with a particular type of data easier, or solve a particular class of problem, today we’ll look at just 4: readxl, lubridate, forcats, and stringr.
We’ll start with the most straightforward. readxl is for reading directly from Excel files – no need to export to CSV first.
Let’s download the spreadsheet we’ll use first. Normally you’d have the file already on your computer, but for the workshop, this makes it easier for everyone to get it in the right location.
download.file("https://github.com/nuitrcs/r-tidyverse/raw/master/data/session6data.xlsx",
destfile = "session6data.xlsx")
You should now have a file called session6data.xlsx in your working directory.
Now, we can read it much like we’d read a CSV file. By default, it will read the first sheet in the file, but we can specify the sheet by number or name.
registrations <- read_excel("session6data.xlsx",
sheet="registrations")
If we look at registrations, you can see that, like read_csv, it read in dates to a date type directly and doesn’t convert character data to factors:
head(registrations)
The first package for dealing with a specific data type we’ll cover is lubridate, which is for working with dates and datetimes (dates plus a time component). It has a series of functions that are named like:
ymd()
ymd_hms()
dmy()
dmy_h()
mdy()
And so on, where y=year, m (in the first part)=month, d=day, h=hour, m (in the second part)=minute, s=second. With the function name, you are specifying which parts of a date or time appear in the text you’re converting and what order the different parts appear in. You can ignore all of the delimiters and other components like dashes, slashes, or even no delimiter at all. lubridate can handle most cases:
mdy("1/13/2020","1/13/20", "01132020", "1-13-2020", "Jan 13 2020", "Jan. 13, 2020")
## [1] "2020-01-13" "2020-01-13" "2020-01-13" "2020-01-13" "2020-01-13"
## [6] "2020-01-13"
Without lubridate, you have to specify the format of the datetime object with POSIX standards (see the help page for strftime).
There are also functions for guessing the format of a datetime.
There are likewise functions for extracting the pieces of a date or datetime object:
registrations %>%
select(date) %>%
mutate(day = day(date),
month = month(date),
monthname = month(date, label=TRUE),
hour = hour(date),
wday = wday(date, label=TRUE))
round_date(today(), "month")
## [1] "2021-02-01"
floor_date(today(), "month")
## [1] "2021-01-01"
If you need just times, or additional time functions, see the package hms
.
forcats is for working with factors – categorical data. I find it particularly useful in situations where I’m creating data visualizations, either plots or tables, containing categories because it has functions to collapse categories and reorder them. Like lubridate, it doesn’t do anything you can’t do in other ways, but it makes things a bit easier and provides a consistent interface.
First, the package provides a method that will create a factor consistently across all systems: as_factor()
(note the trend in tidyverse packages of replacing . in function names with _). as.factor()
(the base R function) will create factor levels by taking the unique values in the vector and then sorting them in order (alphabetical). But the specific order depends on the locale (where you are in the world and the default language for your computer - alphabetical order isn’t the same everywhere). This can cause issues when sharing code with collaborators. as_factor()
creates levels in the order that values appear in the data. So two different vectors with the same set of values could end up with different levels, but the behavior will be consistent across the world.
Long story short: it’s best to just be explicit about factor levels when you make a factor regardless of which function you’re using.
as.factor(c("dog", "cat", "Cat", "mouse", "cat", "dog"))
## [1] dog cat Cat mouse cat dog
## Levels: cat Cat dog mouse
as_factor(c("dog", "cat", "Cat", "mouse", "cat", "dog"))
## [1] dog cat Cat mouse cat dog
## Levels: dog cat Cat mouse
Ok, so getting out of the weeds, what can forcats help with?
What schools do workshop registrants come from?
registrations %>%
count(school, sort=TRUE)
First, let’s make the missing values a level of the factor, instead of NA. This will let us count them and combine them with other categories:
registrations <- mutate(registrations,
school = fct_explicit_na(school, na_level="Missing"))
registrations %>%
count(school, sort=TRUE)
If I want to present this data, it can be useful to collapse some of the “other” categories. While fct_explicit_na()
above made school a factor already, I can start with a normal vector – I don’t have to factor it first.
I can collapse groups manually into an “Other” (“Other” is the default level name that is created):
registrations %>%
mutate(school = fct_other(school, drop=c("Northwestern Medicine", "Missing",
"Central University Administration Unit (eg. NUIT)"))) %>%
count(school, sort=TRUE)
Or I can set a cut-off of how many times a level needs to appear:
registrations %>%
mutate(school = fct_lump_min(school, 10)) %>%
count(school, sort=TRUE)
Or how many levels I want – it will take the most frequent ones and lump the rest:
registrations %>%
mutate(school = fct_lump_n(school, 6)) %>%
count(school, sort=TRUE)
There are other options as well.
What if I make a plot of my data:
registrations %>%
ggplot(aes(y=school)) +
geom_bar(orientation="y")
In addition to my labels being too long, this chart is hard to read because the bars aren’t in any particular order. Just counting the data first doesn’t work. ggplot puts categories by factor level, not value.
# doesn't fix the ordering
registrations %>%
count(school, sort=TRUE) %>%
ggplot(aes(y=school, x=n)) +
geom_col()
So reorder our factor levels to reflect the frequency of each:
registrations %>%
mutate(school=fct_infreq(school)) %>%
ggplot(aes(y=school)) +
geom_bar(orientation="y")
Or we could reverse:
registrations %>%
mutate(school=fct_rev(fct_infreq(school))) %>%
ggplot(aes(y=school)) +
geom_bar(orientation="y")
You can reorder the levels by another variable as well, if you need an order other than the count.
stringr is for finding, replacing, and extracting strings. Again, it mostly gives a unified interface to various functions that exist in R elsewhere, but can be difficult to use or remember the syntax for. These functions all take the string to operate on as the first input.
There are functions for converting and cleaning up text data:
# tolower()
str_to_lower("Christina Maimone")
## [1] "christina maimone"
str_to_title("tidyverse workshop series")
## [1] "Tidyverse Workshop Series"
str_to_title("PhD")
## [1] "Phd"
str_trim(" Christina ") # remove whitespace beginning and end
## [1] "Christina"
str_squish("tidyverse workshop series ") # trim plus collapse interior spaces
## [1] "tidyverse workshop series"
There are functions for padding strings, which would have been useful when we were fixing zip codes in an earlier session:
str_pad("8831", width=5, side="left", pad="0")
## [1] "08831"
Then there are lots of functions to detect or replace substrings in a string. These functions will take either a fixed string for comparison or a regular expression. Regular expressions, or regex, are a way of writing patterns that match text. We can’t cover them all within this workshop, but they’re well worth learning and will save you lots of time if you work with text at all.
They default to interpreting the “pattern” input as a regex. This doesn’t make much difference if you’re only searching for letters and spaces. If you need to search for punctuation, you will likely want to tell it it’s not a regex.
These functions all take the string to work on first, then the pattern to match/detect, then other options.
str_detect("Christina Maimone", "ma")
## [1] FALSE
# don't use a regex, and don't care about case
str_detect("Christina Maimone", fixed("ma", ignore_case=TRUE))
## [1] TRUE
fixed()
is useful to tell it to make a literal match. To tell it to do a case-insensitive match, you have to use either fixed()
or regex()
explicitly.
For example, a “.” has special meaning in regular expressions (it matches any character), so if you want to match a literal period, fixed()
is useful:
str_detect(registrations$department, ".")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA
## [166] NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(registrations$department, fixed("."))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA NA NA FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE
Matching to the beginning or end of the string:
str_starts("Christina", "Ch")
## [1] TRUE
str_ends("Christina", "Ch")
## [1] FALSE
You can also just get the elements of the vector that do match:
str_subset(registrations$department,
regex("pol.+sci.+", ignore_case=TRUE))
## [1] "Political Science" "Political science" "Political Science"
## [4] "Political Science" "Pol. Science" "Political science"
## [7] "Political science" "Political Science" "political science"
regex()
is like fixed()
but to be explicit that it is a regex. The pattern string would get passed to regex, but if you want to ignore the case, you have to call it explicitly.
Replacing – there is a version of each for replacing the first occurrence vs. all.
First, just deleting:
str_remove("Christina Maimone", "a")
## [1] "Christin Maimone"
str_remove_all("Christina Maimone", "a")
## [1] "Christin Mimone"
Replacing with something new:
str_replace("Christina Maimone", "a", "!")
## [1] "Christin! Maimone"
str_replace_all("Christina Maimone", "a", "!")
## [1] "Christin! M!imone"
You can split a string, which can get complicated, because you get lists back, unless you know how many components to expect:
str_split(registrations$department, " ") %>%
head() # just print the first 6 elements
## [[1]]
## [1] "neurology"
##
## [[2]]
## [1] "CEE"
##
## [[3]]
## [1] "Gastro"
##
## [[4]]
## [1] "Data" "Science"
##
## [[5]]
## [1] "Medicine"
##
## [[6]]
## [1] "PBC"
# here fixed means fixed number of components, not a fixed pattern as above
str_split_fixed(registrations$department, " ", 3) %>%
head(20) # just print the first 20 elements
## [,1] [,2] [,3]
## [1,] "neurology" "" ""
## [2,] "CEE" "" ""
## [3,] "Gastro" "" ""
## [4,] "Data" "Science" ""
## [5,] "Medicine" "" ""
## [6,] "PBC" "" ""
## [7,] "Communication" "Sciences" "and Disorders"
## [8,] "Psychology" "" ""
## [9,] "ChemBE" "" ""
## [10,] "Finance" "" ""
## [11,] "imc" "" ""
## [12,] "Economics" "" ""
## [13,] "Pediatrics" "" ""
## [14,] "Accounting" "Services" ""
## [15,] "MEDS" "" ""
## [16,] "Econ" "" ""
## [17,] "Preventive" "Medicine" ""
## [18,] "Preventative" "Medicine" ""
## [19,] "Communication" "Sciences" "and Disorders"
## [20,] "Communication" "" ""
And then there is pattern extraction and matching, which can also be complicated, especially if you want to extract all matches. First, just the first match:
# any part of the string that matches the pattern - first match only
str_extract(registrations$department,
regex("sciences?", ignore_case=TRUE))
## [1] NA NA NA "Science" NA NA
## [7] "Sciences" NA NA NA NA NA
## [13] NA NA NA NA NA NA
## [19] "Sciences" NA NA NA NA NA
## [25] NA NA NA NA "Sciences" NA
## [31] NA NA NA NA NA NA
## [37] NA "Sciences" "Science" "Sciences" NA NA
## [43] NA NA NA NA NA "science"
## [49] "science" "Sciences" NA NA "Science" NA
## [55] NA NA NA "science" NA NA
## [61] "Sciences" NA NA "Sciences" NA NA
## [67] NA NA NA NA NA NA
## [73] NA NA NA "Science" "Science" NA
## [79] NA NA NA NA "Sciences" NA
## [85] NA NA NA NA NA NA
## [91] NA NA NA NA NA NA
## [97] NA NA NA NA NA NA
## [103] NA NA "Sciences" NA NA NA
## [109] NA NA "Science" NA NA NA
## [115] "Sciences" "Science" NA "Sciences" NA NA
## [121] NA NA NA NA NA "science"
## [127] "science" "Sciences" NA "Science" NA NA
## [133] NA NA NA "Science" NA NA
## [139] "science" NA "Sciences" NA NA NA
## [145] "Sciences" NA NA NA NA NA
## [151] NA NA NA NA NA NA
## [157] "Sciences" NA NA NA NA "Sciences"
## [163] NA NA NA NA NA NA
## [169] NA NA NA NA NA NA
## [175] "Sciences" NA NA NA NA NA
## [181] "science" NA NA NA NA NA
## [187] NA NA NA NA NA NA
## [193] NA NA NA NA NA NA
## [199] NA NA NA "science" NA NA
## [205] NA NA NA NA NA NA
## [211] NA NA NA NA NA "Sciences"
## [217] NA NA NA NA "science" NA
## [223] NA NA NA NA NA "Sciences"
## [229] NA "Science" NA "sciences" "Sciences" NA
## [235] NA NA NA NA NA NA
## [241] "Sciences" NA "science" "science" "Sciences" NA
## [247] "Science" NA NA NA NA NA
## [253] "science" NA NA NA NA "science"
## [259] NA NA "Sciences" NA NA "Sciences"
## [265] NA NA NA NA "Science"
All matches – starts with an s followed by other “word characters” (letters, -). You get a list back because there could be any number of matches, so one element of the list for each element in the input vector, and each list element is a vector of character data.
str_extract_all(registrations$department,
regex("s\\w+", ignore_case=TRUE)) %>%
head(10)
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "stro"
##
## [[4]]
## [1] "Science"
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
##
## [[7]]
## [1] "Sciences" "sorders"
##
## [[8]]
## [1] "sychology"
##
## [[9]]
## character(0)
##
## [[10]]
## character(0)
character(0) indicates no match – it stands for an empty vector of type character.
There are more matching functions as well, such as for capturing groups in regular expressions.
The packages above are a few of the ones I use frequently. But let’s end with a look at what else is in the tidyverse metapackage:
broom: converts the output of statistical modeling functions into tibbles. Useful if you want to run a lot of regression models, t-tests, etc. and save all of the output in a standardized format.
modelr: helps connect a workflow using dplyr with statistical modeling functions, such as lm, predict, etc. I haven’t seen it used much. Tidymodeling packages are under development in a separate set of packages, see https://www.tidymodels.org/
hms: mentioned above, like lubridate but for time.
jsonlite: for working with JSON data, which is a common result of API calls
xml2: for working with XML data
dbplyr: facilitates using dplyr functions directly with SQL databases, instead of having to write SQL queries yourself.
haven: for reading in data sets from other statistical programs such as Stata and SAS.
httr: for making web requests and working with URLs; used by rvest (below, for web scraping), or can be useful for calling APIs yourself
rvest: for working with html files; we have a webscraping workshop with this package later this summer
magrittr: this is the origin of %>% – there are additional operators like this.
purrr: supports functional programming, which is a particular approach to structuring code. What you might use is the map family of functions which does what the apply family of functions does in base R, but like other tidyverse packages, purrr provides a consistent interface and naming conventions to hopefully make it easier to figure out which function you need and how to use it.
reprex: a small package to help you make reproducible examples from your code (minimal commands someone else can run to reproduce what you did) so you can post your code on help forums or in GitHub repos to get help, report a bug, etc.
Packages that are primarily infrastructure for other tidyverse packages - you aren’t likely to use them directly unless you’re developing new packages: cli, crayon, pillar, rlang, rstudioapi
Above are the packages that are installed as part of the “tidyverse” package, but there are many more packages within the tidyverse universe that work in similar ways and interact nicely with dplyr workflows.