An essential reference for ggplot is the Data Visualization with ggplot2 Cheat Sheet. It’s useful to have this handy.
This workshop is organized around several example plots instead of a systematic overview of ggplot2
.
If you don’t have them installed yet, then first install some packages:
install.packages(c("tidyverse", "ggthemes", "babynames"))
Then load ones we’ll be using
library(tidyverse)
library(ggthemes)
This notebook is focused on getting you acquainted with ggplot2
, not teaching good data visualization practices. Some of the plots created below are not necessarily the best way to visualize the example data sets.
Also, I’ve included the steps necessary to manipulate the original data into the form needed for plotting because this is a significant, but often ignored, part of learning how to make plots. The data manipulation steps aren’t covered in great detail, but the code is provided.
Load the data we’ll be working with
library(babynames)
data(babynames)
babynames
## # A tibble: 1,858,689 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.07238433
## 2 1880 F Anna 2604 0.02667923
## 3 1880 F Emma 2003 0.02052170
## 4 1880 F Elizabeth 1939 0.01986599
## 5 1880 F Minnie 1746 0.01788861
## 6 1880 F Margaret 1578 0.01616737
## 7 1880 F Ida 1472 0.01508135
## 8 1880 F Alice 1414 0.01448711
## 9 1880 F Bertha 1320 0.01352404
## 10 1880 F Sarah 1288 0.01319618
## # ... with 1,858,679 more rows
Ok, so what’s a tibble? It’s a fancy data.frame
that is part of the tidyverse
(e.g. dplyr
, ggplot2
, etc.). It makes data.frame
s a little nicer, like not printing all rows by default. You can use it like a data.frame
.
Let’s explore the proportion of names that begin with each letter
First, manipulate the data (the dplyr
way). Add a new column that is the first initial of the name, then make a new data set that counts proportion of names each year with that initial, by sex.
babynames <- mutate(babynames, initial=substr(name, 1, 1))
initial_props <- group_by(babynames, year, sex, initial) %>%
summarize(count=sum(n), totalprop=sum(prop)) %>%
ungroup()
Without dplyr
, if you prefer:
babynames$initial <- substr(babynames$name, 1, 1)
initial_props <- aggregate(babynames[,c("n","prop")], by=list(babynames$year, babynames$sex, babynames$initial), sum)
names(initial_props)<- c("year", "sex", "initial", "count", "totalprop")
Ok, make a simple plot first. Let’s look at girls with letter A names:
initial_props %>% filter(sex=="F", initial=="A") %>%
ggplot(aes(y=totalprop, x=year)) + geom_line()
Or, without dplyr
ggplot(initial_props[initial_props$sex=="F"&initial_props$initial=="A",],
aes(y=totalprop, x=year)) + geom_line()
Add in boys too - we want two different color lines, so we’ll use the aes
parameter color
:
initial_props %>% filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, color=sex)) + geom_line()
Or, without dplyr
ggplot(initial_props[initial_props$initial=="A",],
aes(y=totalprop, x=year, color=sex)) + geom_line()
Or instead of color, we could change line type:
initial_props %>% filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, linetype=sex)) + geom_line()
The aesthetics we can use are determined by the geom, and are listed on the geom’s help pages. We can also be clearer about this relationship and specify the aesthetics mappings with the geom call, instead of ggplot
.
initial_props %>% filter(initial=="A") %>%
ggplot() +
geom_line(aes(y=totalprop, x=year, linetype=sex))
initial_props %>% filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year)) +
geom_line(aes(linetype=sex))
If, instead of mapping data to characteristics of a plot, we just want to change the colors (or other appearance), we have options for that outside of the aes()
call:
initial_props %>% filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, linetype=sex)) +
geom_line(color="red", size=1)
Instead of just one letter, we could make a plot for each letter using facets:
ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
geom_line() +
facet_wrap(~initial)
Note that above, we split the plot creation commands on multiple lines. If you do this, make sure to end the line with a +
or split within a ()
so that R doesn’t think you’re done with the line.
For facets, the options are facet_wrap
, which automatically determines the number of rows and columns and fills the spots with the plots, or facet_grid
which can be used to specify values to spread over rows or columns. Facets use the formula specification, which is y~x
. So variables to the left of ~
specify layout down the y-axis (so appearing as rows), while variables to the right specify layout along the x axis (so appearing as columns). For facet_wrap
you always just put the variable on the right of ~
. You can change the default layout with other options to the function. You can use facets with multiple variables.
You can see that by default all of the plots have the same axes – the same range. This lets you compare easily across plots, but you can change this.
Ok, now some plot cleanup:
Let’s rotate the year labels so that we can read them easier. The layout of the labels is part of the plot theme
.
ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
geom_line() +
facet_wrap(~initial) +
theme(axis.text.x = element_text(angle = -90, vjust=0.5))
And make better labels:
ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
geom_line() +
facet_wrap(~initial) +
theme(axis.text.x = element_text(angle = -90, vjust=0.5)) +
labs(title="Baby Names First Initial", x="Year", y="Proportion of Babies")
The legend label has to be changed by reference to the color scale. You can change the x and y labels that way too instead of using labs
(which is a convenience):
ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
geom_line() +
facet_wrap(~initial) +
ggtitle("Baby Names First Initial") +
scale_x_continuous(name="Year") +
scale_y_continuous(name="Proportion of Babies") +
scale_color_discrete(labels=c("Female","Male"), name="Sex") +
theme(axis.text.x = element_text(angle = -90, vjust=.5))
Our y-axis is a little crowded, and we don’t need all of those gridlines:
ggplot(initial_props, aes(y=totalprop, x=year, color=sex)) +
geom_line() +
facet_wrap(~initial) +
ggtitle("Baby Names First Initial") +
scale_x_continuous(name="Year") +
scale_y_continuous(name="Proportion of Babies",
breaks=c(0,.1,.2)) +
scale_color_discrete(labels=c("Female","Male"), name="Sex") +
theme(axis.text.x = element_text(angle = -90, vjust=.5),
panel.grid.minor = element_blank())
Let’s look at unisex names. For reference: https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/ and http://flowingdata.com/2013/09/25/the-most-unisex-names-in-us-history/
We’ll just look among the 1000 most common names by total count since 1950. We’ll define unisex names as ones with roughy equal numbers of men and women.
topnames<-babynames %>%
# this section determines the top names
filter(year>=1950) %>%
group_by(name) %>%
summarize(total=sum(n)) %>%
arrange(desc(total)) %>%
head(1000) %>%
# this section adds back in details for those names
inner_join(babynames) %>%
filter(year >= 1950) %>%
# make labels more convenient
mutate(sex=recode(sex, "F"="Female", "M"="Male")) %>%
# compute totals by sex for each year for each name
group_by(name, sex, total) %>%
summarize(sextotal=sum(n)) %>%
# make a column for each sex (wide data)
spread(key=sex, value=sextotal) %>%
# compute ratio between male and female
mutate(ratio=(Male-Female)/total)
## Joining, by = "name"
Note the use of spread
from tidyr
above. This is important because to make a scatterplot, we need to have our x and y values in different variables. We can’t easily separate x and y values by group (such as sex).
Without dplyr
:
totalcounts<-sort(tapply(babynames$n[babynames$year>=1950], babynames$name[babynames$year>=1950], sum), decreasing=TRUE)[1:1000]
topnames<-data.frame(name=names(totalcounts), total=totalcounts)
topnames<-merge(topnames, babynames)
topnames<-topnames[topnames$year>=1950, ]
topnames<-aggregate(topnames$n, by=list(topnames$name, topnames$sex, topnames$total), sum)
names(topnames)<-c("name","sex","total","sextotal")
topnames$sex<-ifelse(topnames$sex=="M","Male","Female")
topnames<-spread(topnames, key=sex, value=sextotal)
topnames$ratio<-(topnames$Male-topnames$Female)/topnames$total
Let’s use a plot to find unisex names. We’ll plot the number of boys with the name vs. the number of girls and look for names near the diagonal. Instead of aggreating the counts by sex first, we’ll let ggplot
do it.
ggplot(topnames, aes(x=Male, y=Female)) + geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
We get a warning message. This is because for one name, there either isn’t any male or females with the name. Let’s check:
topnames %>%
filter(is.na(Male) | is.na(Female))
## # A tibble: 1 x 5
## # Groups: name [1]
## name total Female Male ratio
## <chr> <int> <int> <int> <dbl>
## 1 Penelope 37666 37666 NA NA
Ok, back to the plot. It looks like there might be a few unisex names, but none of the most common names (makes sense). We could try a log scale:
ggplot(topnames, aes(x=Male, y=Female)) + geom_point() +
scale_x_log10() +
scale_y_log10()
## Warning: Removed 1 rows containing missing values (geom_point).
Well that’s kind of interesting. The missing chunk in the bottom left is because the data set doesn’t have names where there were fewer than 5 people. Another approach: limit the range on the axes to just look at the bottom left corner:
ggplot(topnames, aes(x=Male, y=Female)) + geom_point() +
lims(x=c(0,250000), y=c(0,250000))
## Warning: Removed 180 rows containing missing values (geom_point).
The warning comes from setting limits on the x and y axis excluding some points.
Let’s add in a diagonal line for reference:
ggplot(topnames, aes(x=Male, y=Female)) + geom_point() +
lims(x=c(0,250000), y=c(0,250000)) +
geom_abline(slope=1, intercept=0)
## Warning: Removed 180 rows containing missing values (geom_point).
Color points by how male or female they are, and let’s make them a little bigger
ggplot(topnames, aes(x=Male, y=Female, color=ratio)) +
geom_point(size=2) +
lims(x=c(0,250000), y=c(0,250000)) +
geom_abline(slope=1, intercept=0, color="gray60") +
scale_colour_gradient(low = "pink", high = "blue", name="Sex Split",
breaks=c(.9,0,-.9), labels=c("Male","Neutral","Female"))
## Warning: Removed 180 rows containing missing values (geom_point).
Now, for the points we care about, let’s label them. We could label all of them, but we really just want to label the points near the line:
ggplot(topnames, aes(x=Male, y=Female, color=ratio)) +
geom_point(size=2) +
lims(x=c(0,150000), y=c(0,150000)) +
geom_abline(slope=1, intercept=0, color="gray60") +
scale_colour_gradient(low = "pink", high = "blue", name="Sex Split",
breaks=c(.9,0,-.9), labels=c("Male","Neutral","Female")) +
geom_text(aes(label=ifelse(abs(ratio)< .2, as.character(name), '')),
hjust=-.25, vjust=0.5, color="gray10", fontface = "bold", size=3)
## Warning: Removed 316 rows containing missing values (geom_point).
## Warning: Removed 316 rows containing missing values (geom_text).
Not bad, but a little cluttered. Useful for an exploratory plot. Probably not something we’d want to use for presentation.
Continue as above, looking at unisex names, but make a bar chart.
We need the Male and Female columns back together in one, because we’re no longer plotting them against each other – instead, we’re grouping again. Use gather
to undo what we did with spread
.
topnames2<-gather(topnames, key="sex",value="sextotal", Male, Female)
We only want to look at the names that score the lowest on ratio
(absolute value). And we already have the counts calculated, so we tell geom_bar
that so it doesn’t try to do the counting itself.
topnames2 %>% filter(abs(ratio)<.3 & !is.na(ratio)) %>%
ggplot(aes(x=name, y=sextotal, group=sex, fill=sex)) +
geom_bar(stat="identity") +
labs(title="Popular Unisex Names", x="", y="Count")
Plot proportions instead of counts. And let’s use slightly different colors (there are different ways to specify colors, but below we use hex codes. See: https://en.wikipedia.org/wiki/Web_colors):
topnames2 %>% filter(abs(ratio)<.3 & !is.na(ratio)) %>%
ggplot(aes(x=name, y=sextotal, group=sex, fill=sex)) +
geom_bar(stat="identity", position="fill") +
labs(title="Popular Unisex Names", x="", y="Count") +
geom_hline(yintercept=.5) +
scale_fill_manual(values=c("#ff33cc","#1e90ff"))
Lose the gray background with the minimal theme, and we don’t need horizontal grid lines for a bar chart:
ggplot(topnames2[abs(topnames2$ratio)<.3 & !is.na(topnames2$ratio),], aes(x=name, y=sextotal, group=sex, fill=sex)) +
geom_bar(stat="identity") +
labs(title="Popular Unisex Names", x="", y="Count") +
theme_minimal() +
theme(panel.grid.major.x = element_blank())
Note that the order of theme_minimal
and theme
matters. The latter will take precedence over the former.
We used theme_minimal
above to style the plot. This is one of the themes built into ggplot2
. There are other themes available in the ggthemes
package. First, a base plot without themes:
initial_props %>%
filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, color=sex)) +
geom_line() +
ggtitle("Baby Names Starting with A")
Then, the FiveThirtyEight theme:
initial_props %>%
filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, color=sex)) +
geom_line() +
ggtitle("Baby Names Starting with A") +
theme_fivethirtyeight()
With the Stata theme (if you want to fool your audience?)
initial_props %>%
filter(initial=="A") %>%
ggplot(aes(y=totalprop, x=year, color=sex)) +
geom_line() +
ggtitle("Baby Names Starting with A") +
theme_stata()
See a list of available themes with examples at https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html
ggsave
will save the last plot made. Or, you can save the plot as an object with a name and send it to ggsave
. The file extension determines the image type.
ggsave("mybarchart.pdf")
or
mybarchart<- ggplot(topnames[topnames$ratio<.3,], aes(x=name, y=sextotal, group=sex, fill=sex)) +
geom_bar(stat="identity") +
labs(title="Popular Unisex Names", x="", y="Count") +
theme_minimal() +
theme(panel.grid.major.x = element_blank())
ggsave("mybarchart.png", mybarchart)