An essential reference for ggplot is the Data Visualization with ggplot2 Cheat Sheet. It’s useful to have this handy.

This workshop is organized around several example plots instead of a systematic overview of ggplot2.

Load the packages

If you don’t have them installed yet, then first install some packages:

install.packages(c("tidyverse", "ggthemes", "babynames"))

Then load ones we’ll be using

library(tidyverse)
library(ggthemes)

Caveats

This notebook is focused on getting you acquainted with ggplot2, not teaching good data visualization practices. Some of the plots created below are not necessarily the best way to visualize the example data sets.

Also, I’ve included the steps necessary to manipulate the original data into the form needed for plotting because this is a significant, but often ignored, part of learning how to make plots. The data manipulation steps aren’t covered in great detail, but the code is provided.

Data

Load the data we’ll be working with

library(babynames)
data(babynames)
babynames
## # A tibble: 1,858,689 x 5
##     year   sex      name     n       prop
##    <dbl> <chr>     <chr> <int>      <dbl>
##  1  1880     F      Mary  7065 0.07238433
##  2  1880     F      Anna  2604 0.02667923
##  3  1880     F      Emma  2003 0.02052170
##  4  1880     F Elizabeth  1939 0.01986599
##  5  1880     F    Minnie  1746 0.01788861
##  6  1880     F  Margaret  1578 0.01616737
##  7  1880     F       Ida  1472 0.01508135
##  8  1880     F     Alice  1414 0.01448711
##  9  1880     F    Bertha  1320 0.01352404
## 10  1880     F     Sarah  1288 0.01319618
## # ... with 1,858,679 more rows

Ok, so what’s a tibble? It’s a fancy data.frame that is part of the tidyverse (e.g. dplyr, ggplot2, etc.). It makes data.frames a little nicer, like not printing all rows by default. You can use it like a data.frame.

Example 1

Let’s explore the proportion of names that begin with each letter

Get the Data to Plot

First, manipulate the data (the dplyr way). Add a new column that is the first initial of the name, then make a new data set that counts proportion of names each year with that initial, by sex.

babynames <- mutate(babynames, initial=substr(name, 1, 1))
initial_props <- group_by(babynames, year, sex, initial) %>% 
  summarize(count=sum(n), totalprop=sum(prop)) %>%
  ungroup()

Without dplyr, if you prefer:

babynames$initial <- substr(babynames$name, 1, 1)
initial_props <- aggregate(babynames[,c("n","prop")], by=list(babynames$year, babynames$sex, babynames$initial), sum)
names(initial_props)<- c("year", "sex", "initial", "count", "totalprop")

Simple Plot

Ok, make a simple plot first. Let’s look at girls with letter A names:

initial_props %>% filter(sex=="F", initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year)) + geom_line()

Or, without dplyr

ggplot(initial_props[initial_props$sex=="F"&initial_props$initial=="A",], 
       aes(y=totalprop, x=year)) + geom_line()

Multiple lines

Add in boys too - we want two different color lines, so we’ll use the aes parameter color:

initial_props %>% filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, color=sex)) + geom_line()

Or, without dplyr

ggplot(initial_props[initial_props$initial=="A",], 
       aes(y=totalprop, x=year, color=sex)) + geom_line()

Or instead of color, we could change line type:

initial_props %>% filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, linetype=sex)) + geom_line()

The aesthetics we can use are determined by the geom, and are listed on the geom’s help pages. We can also be clearer about this relationship and specify the aesthetics mappings with the geom call, instead of ggplot.

initial_props %>% filter(initial=="A") %>% 
  ggplot() + 
  geom_line(aes(y=totalprop, x=year, linetype=sex))

initial_props %>% filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year)) + 
  geom_line(aes(linetype=sex))

Aesthetics vs. Styling

If, instead of mapping data to characteristics of a plot, we just want to change the colors (or other appearance), we have options for that outside of the aes() call:

initial_props %>% filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, linetype=sex)) + 
  geom_line(color="red", size=1)

Facets

Instead of just one letter, we could make a plot for each letter using facets:

ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
  geom_line() +
  facet_wrap(~initial)

Note that above, we split the plot creation commands on multiple lines. If you do this, make sure to end the line with a + or split within a () so that R doesn’t think you’re done with the line.

For facets, the options are facet_wrap, which automatically determines the number of rows and columns and fills the spots with the plots, or facet_grid which can be used to specify values to spread over rows or columns. Facets use the formula specification, which is y~x. So variables to the left of ~ specify layout down the y-axis (so appearing as rows), while variables to the right specify layout along the x axis (so appearing as columns). For facet_wrap you always just put the variable on the right of ~. You can change the default layout with other options to the function. You can use facets with multiple variables.

You can see that by default all of the plots have the same axes – the same range. This lets you compare easily across plots, but you can change this.

Cleanup/Formatting

Ok, now some plot cleanup:

Let’s rotate the year labels so that we can read them easier. The layout of the labels is part of the plot theme.

ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
  geom_line() + 
  facet_wrap(~initial) +
  theme(axis.text.x = element_text(angle = -90, vjust=0.5))

And make better labels:

ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
  geom_line() + 
  facet_wrap(~initial) +
  theme(axis.text.x = element_text(angle = -90, vjust=0.5)) +
  labs(title="Baby Names First Initial", x="Year", y="Proportion of Babies")

The legend label has to be changed by reference to the color scale. You can change the x and y labels that way too instead of using labs (which is a convenience):

ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
  geom_line() + 
  facet_wrap(~initial) +
  ggtitle("Baby Names First Initial") +
  scale_x_continuous(name="Year") +
  scale_y_continuous(name="Proportion of Babies") +
  scale_color_discrete(labels=c("Female","Male"), name="Sex") +
  theme(axis.text.x = element_text(angle = -90, vjust=.5)) 

Our y-axis is a little crowded, and we don’t need all of those gridlines:

ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
  geom_line() + 
  facet_wrap(~initial) +
  ggtitle("Baby Names First Initial") +
  scale_x_continuous(name="Year") +
  scale_y_continuous(name="Proportion of Babies", 
                     breaks=c(0,.1,.2)) +
  scale_color_discrete(labels=c("Female","Male"), name="Sex") +
  theme(axis.text.x = element_text(angle = -90, vjust=.5),
        panel.grid.minor = element_blank()) 

Example 2

Let’s look at unisex names. For reference: https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/ and http://flowingdata.com/2013/09/25/the-most-unisex-names-in-us-history/

Data

We’ll just look among the 1000 most common names by total count since 1950. We’ll define unisex names as ones with roughy equal numbers of men and women.

topnames<-babynames %>% 
  # this section determines the top names
  filter(year>=1950) %>% 
  group_by(name) %>% 
  summarize(total=sum(n)) %>%
  arrange(desc(total)) %>%
  head(1000) %>%
  # this section adds back in details for those names
  inner_join(babynames) %>%
  filter(year >= 1950) %>%
  # make labels more convenient
  mutate(sex=recode(sex, "F"="Female", "M"="Male")) %>%
  # compute totals by sex for each year for each name
  group_by(name, sex, total) %>%
  summarize(sextotal=sum(n)) %>%
  # make a column for each sex (wide data)
  spread(key=sex, value=sextotal) %>%
  # compute ratio between male and female
  mutate(ratio=(Male-Female)/total)
## Joining, by = "name"

Note the use of spread from tidyr above. This is important because to make a scatterplot, we need to have our x and y values in different variables. We can’t easily separate x and y values by group (such as sex).

Without dplyr:

totalcounts<-sort(tapply(babynames$n[babynames$year>=1950], babynames$name[babynames$year>=1950], sum), decreasing=TRUE)[1:1000]
topnames<-data.frame(name=names(totalcounts), total=totalcounts)
topnames<-merge(topnames, babynames)
topnames<-topnames[topnames$year>=1950, ]
topnames<-aggregate(topnames$n, by=list(topnames$name, topnames$sex, topnames$total), sum)
names(topnames)<-c("name","sex","total","sextotal")
topnames$sex<-ifelse(topnames$sex=="M","Male","Female")
topnames<-spread(topnames, key=sex, value=sextotal)
topnames$ratio<-(topnames$Male-topnames$Female)/topnames$total

Scatterplot to Find Unisex Names

Let’s use a plot to find unisex names. We’ll plot the number of boys with the name vs. the number of girls and look for names near the diagonal. Instead of aggreating the counts by sex first, we’ll let ggplot do it.

ggplot(topnames, aes(x=Male, y=Female)) + geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

We get a warning message. This is because for one name, there either isn’t any male or females with the name. Let’s check:

topnames %>% 
  filter(is.na(Male) | is.na(Female))
## # A tibble: 1 x 5
## # Groups:   name [1]
##       name total Female  Male ratio
##      <chr> <int>  <int> <int> <dbl>
## 1 Penelope 37666  37666    NA    NA

Ok, back to the plot. It looks like there might be a few unisex names, but none of the most common names (makes sense). We could try a log scale:

ggplot(topnames, aes(x=Male, y=Female)) + geom_point() +
  scale_x_log10() +
  scale_y_log10()
## Warning: Removed 1 rows containing missing values (geom_point).

Well that’s kind of interesting. The missing chunk in the bottom left is because the data set doesn’t have names where there were fewer than 5 people. Another approach: limit the range on the axes to just look at the bottom left corner:

ggplot(topnames, aes(x=Male, y=Female)) + geom_point() + 
  lims(x=c(0,250000), y=c(0,250000))
## Warning: Removed 180 rows containing missing values (geom_point).

The warning comes from setting limits on the x and y axis excluding some points.

Let’s add in a diagonal line for reference:

ggplot(topnames, aes(x=Male, y=Female)) + geom_point() + 
  lims(x=c(0,250000), y=c(0,250000)) +
  geom_abline(slope=1, intercept=0)
## Warning: Removed 180 rows containing missing values (geom_point).

Color points by how male or female they are, and let’s make them a little bigger

ggplot(topnames, aes(x=Male, y=Female, color=ratio)) + 
  geom_point(size=2) + 
  lims(x=c(0,250000), y=c(0,250000)) +
  geom_abline(slope=1, intercept=0, color="gray60") + 
  scale_colour_gradient(low = "pink", high = "blue", name="Sex Split", 
                        breaks=c(.9,0,-.9), labels=c("Male","Neutral","Female")) 
## Warning: Removed 180 rows containing missing values (geom_point).

Now, for the points we care about, let’s label them. We could label all of them, but we really just want to label the points near the line:

ggplot(topnames, aes(x=Male, y=Female, color=ratio)) + 
  geom_point(size=2) + 
  lims(x=c(0,150000), y=c(0,150000)) +
  geom_abline(slope=1, intercept=0, color="gray60") + 
  scale_colour_gradient(low = "pink", high = "blue", name="Sex Split", 
                        breaks=c(.9,0,-.9), labels=c("Male","Neutral","Female")) +
  geom_text(aes(label=ifelse(abs(ratio)< .2, as.character(name), '')),
            hjust=-.25, vjust=0.5, color="gray10", fontface = "bold", size=3) 
## Warning: Removed 316 rows containing missing values (geom_point).
## Warning: Removed 316 rows containing missing values (geom_text).

Not bad, but a little cluttered. Useful for an exploratory plot. Probably not something we’d want to use for presentation.

Example 3

Continue as above, looking at unisex names, but make a bar chart.

We need the Male and Female columns back together in one, because we’re no longer plotting them against each other – instead, we’re grouping again. Use gather to undo what we did with spread.

topnames2<-gather(topnames, key="sex",value="sextotal", Male, Female)

We only want to look at the names that score the lowest on ratio (absolute value). And we already have the counts calculated, so we tell geom_bar that so it doesn’t try to do the counting itself.

topnames2 %>% filter(abs(ratio)<.3 & !is.na(ratio)) %>%
  ggplot(aes(x=name, y=sextotal, group=sex, fill=sex)) + 
  geom_bar(stat="identity") +
  labs(title="Popular Unisex Names", x="", y="Count")

Plot proportions instead of counts. And let’s use slightly different colors (there are different ways to specify colors, but below we use hex codes. See: https://en.wikipedia.org/wiki/Web_colors):

topnames2 %>% filter(abs(ratio)<.3 & !is.na(ratio)) %>%
  ggplot(aes(x=name, y=sextotal, group=sex, fill=sex)) + 
  geom_bar(stat="identity", position="fill") +
  labs(title="Popular Unisex Names", x="", y="Count") + 
  geom_hline(yintercept=.5) +
  scale_fill_manual(values=c("#ff33cc","#1e90ff")) 

Lose the gray background with the minimal theme, and we don’t need horizontal grid lines for a bar chart:

ggplot(topnames2[abs(topnames2$ratio)<.3 & !is.na(topnames2$ratio),], aes(x=name, y=sextotal, group=sex, fill=sex)) + 
  geom_bar(stat="identity") +
  labs(title="Popular Unisex Names", x="", y="Count") +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank()) 

Note that the order of theme_minimal and theme matters. The latter will take precedence over the former.

Themes

We used theme_minimal above to style the plot. This is one of the themes built into ggplot2. There are other themes available in the ggthemes package. First, a base plot without themes:

initial_props %>% 
  filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, color=sex)) + 
  geom_line() +
  ggtitle("Baby Names Starting with A") 

Then, the FiveThirtyEight theme:

initial_props %>% 
  filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, color=sex)) + 
  geom_line() +
  ggtitle("Baby Names Starting with A") +
  theme_fivethirtyeight()

With the Stata theme (if you want to fool your audience?)

initial_props %>% 
  filter(initial=="A") %>% 
  ggplot(aes(y=totalprop, x=year, color=sex)) + 
  geom_line() +
  ggtitle("Baby Names Starting with A") +
  theme_stata()

See a list of available themes with examples at https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html

Saving Plots

ggsave will save the last plot made. Or, you can save the plot as an object with a name and send it to ggsave. The file extension determines the image type.

ggsave("mybarchart.pdf")

or

mybarchart<- ggplot(topnames[topnames$ratio<.3,], aes(x=name, y=sextotal, group=sex, fill=sex)) + 
  geom_bar(stat="identity") +
  labs(title="Popular Unisex Names", x="", y="Count") +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank()) 
ggsave("mybarchart.png", mybarchart)