Setup

These exercises use the gapminder data you may have read in during previous exercises. If not, you can read in the file, or you could get the data from a package.

Option 1: Read data file

gapminder<-read_csv("https://goo.gl/dWrc9m")
# or
gapminder <- read_csv("data/gapminder5.csv")

Option 2: Install and Use Package

install.packages("gapminder")

library(gapminder)

The object we want from this package is gapminder. The gapminder data frame is available for you to use, but it can help to explicitly load it into the workspace:

data(gapminder)

Now gapminder should show up in the Environment tab in the upper right of RStudio. See the help for gapminder for a description of the data.

Hint

If, while you’re doing any of these exercises, you mess your data frame, you can always reload the data and try again. That’s part of the reason why you always want to have a complete script of what you do to your data – so you can re-run it all when you make a mistake.

Note

Also note that many of these exercises are a significant step up in difficulty from previous parts. Many require multiple steps or lines of code to complete. Think through the steps you want to take, then try to figure out how to do each step in R.

Exercise: Explore the Structure of the Dataset

What are the names of the columns?
Is there any missing data?
How many different countries are there?
What is the continent name for the US?
Does the number of countries in the data change over time?

Exercise: Selecting and Filtering

Show observations where life expectancy is greater than 80.
Show only population and GDP per capita for Kenya for years before 1970.
Show the observation that has the maximum life expectancy

Note: for cases where a tibble is the output, it’s ok if just the first rows show.

Exercise: Transforming Data

Add a new column to gapminder that is the population in millions.
Instead of having North and South American countries together as a continent, reassign the North American countries in the vector below to have contient “North America” and the rest of the countries currently associated with “America” to be “South America”. Hint: continent is a factor, so you’ll need to either first convert it to character data or explicitly manipulate the levels.
Make a new object, gapminder_max that has only one observation per country, and has the maximum value that each country has had overtime for population, life expectancy, and GDP per capita. Hint: use aggregate and note that the first argument (the data) needs to only contain the columns you want to compute a summary measure on.

# not all of them, but these are the ones in gapminder
northamerica <- c("Canada", "Costa Rica", "Cuba", "Dominican Republic", 
                "El Salvador", "Guatemala", "Haiti", "Honduras",
                "Jamaica", "Mexico", "Nicaragua", "Panama",
                "Trinidad and Tobago", "United States")

Exercise: Grouping and Summarizing

Calculate the average life expectancy per country. Which had the longest life expectancy and which had the shortest life expectancy?

Exercise: Dates, Aggregating

Working with data from the Chicago Park District on park reservations:

parks <- read.csv("data/chicagoparks.csv",
                  stringsAsFactors = FALSE)

Look at the first few rows of parks. Then convert the date columns to date objects with lubridate.

Use the month() and year() functions to make new variables for the parks data based on reservation start date.

Which month has the most reservations?

Which park has the most reservations?

Challenge: What is the most frequent combination of park and event type?

Exercise: Sorting, Adding and Removing Columns

We’re going to use data from the World Cup.

Prior to 2002, the following variables weren’t recorded consistenty: Shots on goal, Shots wide, Free kicks, Offside, Corners

worldcup <- read.csv("data/worldcup.csv",na.strings = c("", "-"),
                     stringsAsFactors = FALSE)

Set observations before 2002 for the variables that weren’t consistently coded to NA.

Make a new variable total shots that adds the two shot columns.

Make a new variable for shot accuracy that is the number of goals for divided by the total number of shots

Strip the write space (hint: trimws) from the Team and ISO code variables

Drop the ISO code column

Make a new variable with total goals scored by country. Sort it to find out which country has scored the most goals.

Sort the whole world cup data frame by shot accuracy, so that more accurate teams (team, year) are at the top.

Exercise: Recoding/Replacing Values

Working with data on Chipotle online orders:

chipotle <- read.csv("data/chipotle.tsv", stringsAsFactors = FALSE, sep="\t",
                     na.strings = c("NULL"))
DT::datatable(chipotle, rownames = FALSE)

Convert the item_total_price column to a number so you can work with it.

Replace the "$" with an empty string "". You can use the sub() function for this, or stringr:str_replace_all() For the “pattern” argument in either function, use "\\$" as the pattern, because $ is a special charcter in string patterns. Or you could use use substr() to exclude the first character in the string (nchar() is a function to get the number of characters in a string).
Use the as.numeric function to convert the character data to a number.

Exercise: Aggregating

Using the transformed chipotle data from above, compute the following, use aggregate where appropriate:

Average cost per order: note that there are multiple rows (items) for some orders
Find the order that cost the most; look at the items in the order Hint: if you use tapply() you may want to reference the names of the vector.
Find the order with the most items in it; look at the items in the order

Exercise: Apply Functions

Load data on alcohol consumption by country:

alcohol <- read.csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv",
                    stringsAsFactors = FALSE,
                    na.strings = c("")) # NOT NA, because that is the code for North America

Use apply functions to help you find:

Which continent drinks the most beer? Is this different for wine?
For each country, what is their most consumed alcohol type (beer, wine, spirits)? Create a new variable
Challenge: What is the most common favorite type across countries? Hint: there is no function for finding the modal value in base R. There are functions in some packages though. Solving this will take some internet searches to find these functions or advanced work with lists or custom functions.
Is the result for North American surprising to you? View the data to see if you can figure out why the favorite is the favorite.

Exercise: t-test, correlation

Using the alcohol data and some of the variables you computed above, do North American countries drink less beer on average than South American countries?

What are the pairwise correlations between consumption of beer, wine, and spirits?

Challenge Exercise: Create indicator variables

Following the example in the code in exercises/exerciseA.R, make indicator variables for all of the different ingredients in the choice_description column of the chipotle data. Use a linear regression model to predict the item_total_price (normalize by quantity first) using the item type and indicator for guacamole for burritos and bowls. How much does guacamole cost?

Then check this by looking at the cost of guacamole by gettign the average price per item according to whether or not Guacamole was included. Use the aggregate function.

Challenge Exercise: Complex Group Summary

Find all countries where life expectancy increased by at least 60% from 1952 to 2007.

Hint: you might find the merge function useful. One way to approach this would be to make a data frame with the 1952 data, another with the 2007 data, and then merge them together. There are, however, other ways to proceed.

Challenge Exercise: Drops in Population

Find all countries that experienced a drop in population at any point in the timespan of the data.

Hint: look at the diff function. Note that diff will result in one fewer values in the result than in the original, so you’ll need to address that.

Exercises Part 4: Data Manipulation and Basic Exploration

Christina Maimone

2018-06-27