Setup

These exercises use the gapminder data you may have read in during previous exercises. If not, you can read in the file, or you could get the data from a package.

Option 1: Read data file

gapminder<-read_csv("https://goo.gl/dWrc9m")
# or
gapminder <- read_csv("data/gapminder5.csv")

Option 2: Install and Use Package

install.packages("gapminder")
library(gapminder)

The object we want from this package is gapminder. The gapminder data frame is available for you to use, but it can help to explicitly load it into the workspace:

data(gapminder)

Now gapminder should show up in the Environment tab in the upper right of RStudio. See the help for gapminder for a description of the data.

Hint

If, while you’re doing any of these exercises, you mess your data frame, you can always reload the data and try again. That’s part of the reason why you always want to have a complete script of what you do to your data – so you can re-run it all when you make a mistake.

Note

Also note that many of these exercises are a significant step up in difficulty from previous parts. Many require multiple steps or lines of code to complete. Think through the steps you want to take, then try to figure out how to do each step in R.


Exercise: Explore the Structure of the Dataset

Exercise: Selecting and Filtering

Note: for cases where a tibble is the output, it’s ok if just the first rows show.

Exercise: Transforming Data

# not all of them, but these are the ones in gapminder
northamerica <- c("Canada", "Costa Rica", "Cuba", "Dominican Republic", 
                "El Salvador", "Guatemala", "Haiti", "Honduras",
                "Jamaica", "Mexico", "Nicaragua", "Panama",
                "Trinidad and Tobago", "United States")

Exercise: Grouping and Summarizing

Calculate the average life expectancy per country. Which had the longest life expectancy and which had the shortest life expectancy?

Exercise: Dates, Aggregating

Working with data from the Chicago Park District on park reservations:

parks <- read.csv("data/chicagoparks.csv",
                  stringsAsFactors = FALSE)

Look at the first few rows of parks. Then convert the date columns to date objects with lubridate.

Use the month() and year() functions to make new variables for the parks data based on reservation start date.

Which month has the most reservations?

Which park has the most reservations?

Challenge: What is the most frequent combination of park and event type?

Exercise: Sorting, Adding and Removing Columns

We’re going to use data from the World Cup.

Prior to 2002, the following variables weren’t recorded consistenty: Shots on goal, Shots wide, Free kicks, Offside, Corners

worldcup <- read.csv("data/worldcup.csv",na.strings = c("", "-"),
                     stringsAsFactors = FALSE)

Set observations before 2002 for the variables that weren’t consistently coded to NA.

Make a new variable total shots that adds the two shot columns.

Make a new variable for shot accuracy that is the number of goals for divided by the total number of shots

Strip the write space (hint: trimws) from the Team and ISO code variables

Drop the ISO code column

Make a new variable with total goals scored by country. Sort it to find out which country has scored the most goals.

Sort the whole world cup data frame by shot accuracy, so that more accurate teams (team, year) are at the top.

Exercise: Recoding/Replacing Values

Working with data on Chipotle online orders:

chipotle <- read.csv("data/chipotle.tsv", stringsAsFactors = FALSE, sep="\t",
                     na.strings = c("NULL"))
DT::datatable(chipotle, rownames = FALSE)

Convert the item_total_price column to a number so you can work with it.

  1. Replace the "$" with an empty string "". You can use the sub() function for this, or stringr:str_replace_all() For the “pattern” argument in either function, use "\\$" as the pattern, because $ is a special charcter in string patterns. Or you could use use substr() to exclude the first character in the string (nchar() is a function to get the number of characters in a string).

  2. Use the as.numeric function to convert the character data to a number.

Exercise: Aggregating

Using the transformed chipotle data from above, compute the following, use aggregate where appropriate:

Exercise: Apply Functions

Load data on alcohol consumption by country:

alcohol <- read.csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv",
                    stringsAsFactors = FALSE,
                    na.strings = c("")) # NOT NA, because that is the code for North America

Use apply functions to help you find:

Exercise: t-test, correlation

Using the alcohol data and some of the variables you computed above, do North American countries drink less beer on average than South American countries?

What are the pairwise correlations between consumption of beer, wine, and spirits?

Challenge Exercise: Create indicator variables

Following the example in the code in exercises/exerciseA.R, make indicator variables for all of the different ingredients in the choice_description column of the chipotle data. Use a linear regression model to predict the item_total_price (normalize by quantity first) using the item type and indicator for guacamole for burritos and bowls. How much does guacamole cost?

Then check this by looking at the cost of guacamole by gettign the average price per item according to whether or not Guacamole was included. Use the aggregate function.

Challenge Exercise: Complex Group Summary

Find all countries where life expectancy increased by at least 60% from 1952 to 2007.

Hint: you might find the merge function useful. One way to approach this would be to make a data frame with the 1952 data, another with the 2007 data, and then merge them together. There are, however, other ways to proceed.

Challenge Exercise: Drops in Population

Find all countries that experienced a drop in population at any point in the timespan of the data.

Hint: look at the diff function. Note that diff will result in one fewer values in the result than in the original, so you’ll need to address that.