These exercises use the gapminder data you may have read in during previous exercises. If not, you can read in the file, or you could get the data from a package.
gapminder<-read_csv("https://goo.gl/dWrc9m")
# or
gapminder <- read_csv("data/gapminder5.csv")
install.packages("gapminder")
library(gapminder)
The object we want from this package is gapminder. The gapminder data frame is available for you to use, but it can help to explicitly load it into the workspace:
data(gapminder)
Now gapminder should show up in the Environment tab in the upper right of RStudio. See the help for gapminder for a description of the data.
If, while you’re doing any of these exercises, you mess your data frame, you can always reload the data and try again. That’s part of the reason why you always want to have a complete script of what you do to your data – so you can re-run it all when you make a mistake.
Also note that many of these exercises are a significant step up in difficulty from previous parts. Many require multiple steps or lines of code to complete. Think through the steps you want to take, then try to figure out how to do each step in R.
names(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
sum(is.na(gapminder)) # count any missing values in entire data set
## [1] 0
length(unique(gapminder$country)) # number of countries
## [1] 142
length(levels(gapminder$country)) # alt. to get num countries, since it's a factor
## [1] 142
You could also look in the Environment tab in the upper right window in RStudio to see the number of levels for country.
To find the continent for the US, we need to see how the US appears in the data. Easiest option is probably just to print the names of all of the countries and look at them, since there aren’t that many.
# continent for US
gapminder$continent[gapminder$country=="United States"][1] ## [1] because we just need 1 value
## [1] Americas
## Levels: Africa Americas Asia Europe Oceania
# number of countries per year
tapply(gapminder$country, gapminder$year, length)
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 142 142 142 142 142 142 142 142 142 142 142 142
Note: for cases where a tibble is the output, it’s ok if just the first rows show.
gapminder[gapminder$lifeExp > 80,]
## # A tibble: 21 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 2002 80.4 19546792 30688.
## 2 Australia Oceania 2007 81.2 20434176 34435.
## 3 Canada Americas 2007 80.7 33390141 36319.
## 4 France Europe 2007 80.7 61083916 30470.
## 5 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 7 Iceland Europe 2002 80.5 288030 31163.
## 8 Iceland Europe 2007 81.8 301931 36181.
## 9 Israel Asia 2007 80.7 6426679 25523.
## 10 Italy Europe 2002 80.2 57926999 27968.
## # ... with 11 more rows
gapminder[gapminder$country=='Kenya' & gapminder$year < 1970,
c("pop", "gdpPercap")]
## # A tibble: 4 x 2
## pop gdpPercap
## <int> <dbl>
## 1 6464046 854.
## 2 7454779 944.
## 3 8678557 897.
## 4 10191512 1057.
gapminder[gapminder$lifeExp == max(gapminder$lifeExp),]
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
gapminder that is the population in millions.gapminder_max that has only one observation per country, and has the maximum value that each country has had overtime for population, life expectancy, and GDP per capita. Hint: use aggregate and note that the first argument (the data) needs to only contain the columns you want to compute a summary measure on.# not all of them, but these are the ones in gapminder
northamerica <- c("Canada", "Costa Rica", "Cuba", "Dominican Republic",
"El Salvador", "Guatemala", "Haiti", "Honduras",
"Jamaica", "Mexico", "Nicaragua", "Panama",
"Trinidad and Tobago", "United States")
gapminder$popmillion <- gapminder$pop/1000000
gapminder$continent <- as.character(gapminder$continent)
gapminder$continent[gapminder$country %in% northamerica] <-
"North America"
gapminder$continent[gapminder$continent == "Americas"] <-
"South America"
gapminder$continent <- factor(gapminder$continent)
gapminder_max <- aggregate(gapminder[,c("pop","lifeExp","gdpPercap")],
by=list(country=gapminder$country,
continent=gapminder$continent), # to keep continent in the data
max)
Calculate the average life expectancy per country. Which had the longest life expectancy and which had the shortest life expectancy?
# aggregate will keep country name, whereas tapply wouldn't
meanLifeExp <- aggregate(gapminder$lifeExp,
list(country=gapminder$country),
mean)
names(meanLifeExp)
## [1] "country" "x"
meanLifeExp[meanLifeExp$x == max(meanLifeExp$x),]
## country x
## 58 Iceland 76.51142
meanLifeExp[meanLifeExp$x == min(meanLifeExp$x),]
## country x
## 113 Sierra Leone 36.76917
# or, with special functions
meanLifeExp[which.max(meanLifeExp$x),]
## country x
## 58 Iceland 76.51142
meanLifeExp[which.min(meanLifeExp$x),]
## country x
## 113 Sierra Leone 36.76917
Working with data from the Chicago Park District on park reservations:
parks <- read.csv("data/chicagoparks.csv",
stringsAsFactors = FALSE)
Look at the first few rows of parks. Then convert the date columns to date objects with lubridate.
Use the month() and year() functions to make new variables for the parks data based on reservation start date.
Which month has the most reservations?
Which park has the most reservations?
Challenge: What is the most frequent combination of park and event type?
head(parks)
## Requestor Organization Park.Number
## 1 Mike Gianfortune Kelly Park <NA> 260
## 2 Timothy Gibson Mather Park <NA> 241
## 3 Michael Brown Clarendon Park <NA> 1002
## 4 Jennifer Pabst Norwood Park <NA> 141
## 5 Lisa Ragucci Winnemac Park <NA> 486
## 6 Dan Ryan Wentworth Park <NA> 238
## Park.Facility.Name Reservation.Start.Date
## 1 Kelly Natatorium within Kelly School 01/01/2015
## 2 Mather Natatorium within Mather School 01/01/2015
## 3 Clarendon Pool within Uplift School 01/01/2015
## 4 Norwood Pool within Taft School 01/01/2015
## 5 Winnemac Pool within Amundsen School 01/01/2015
## 6 Wentworth Pool within Kennedy School 01/01/2015
## Reservation.End.Date Event.Type
## 1 01/01/2015 Administrative Reservation - Internal Hold PARKS
## 2 01/01/2015 Administrative Reservation - Internal Hold PARKS
## 3 01/01/2015 Administrative Reservation - Internal Hold PARKS
## 4 01/01/2015 Administrative Reservation - Internal Hold PARKS
## 5 01/01/2015 Administrative Reservation - Internal Hold PARKS
## 6 01/01/2015 Administrative Reservation - Internal Hold PARKS
## Event.Description Permit.Status
## 1 CPS Pool Closure Approved
## 2 Mather CPS Pool Closure Approved
## 3 Uplift CPS Pool Closure Approved
## 4 Taft High School CPS Pool Closure #4478 Approved
## 5 CPS Pool Closure Approved
## 6 Kennedy Pool CPS Pool Closure Approved
library(lubridate)
parks$Reservation.Start.Date <- mdy(parks$Reservation.Start.Date)
parks$Reservation.End.Date <- mdy(parks$Reservation.End.Date)
parks$month <- month(parks$Reservation.Start.Date)
parks$year <- year(parks$Reservation.Start.Date)
sort(table(parks$month))
##
## 2 12 1 3 11 4 5 10 9 6 8 7
## 818 844 868 1508 1608 2270 3388 3622 4594 6387 8742 13722
head(sort(table(parks$Park.Facility.Name), decreasing = TRUE))
##
## Lakeshore East Park
## 2141
## Lincoln Harbor - Diversey Event Space
## 1101
## Garfield Conservatory All Show Rooms
## 951
## Lincoln Picnic Grove 16
## 719
## 31st St. Harbor
## 611
## Lincoln Harbor - Belmont Event Space
## 586
# challenge: there are other ways to do this
park_type <- aggregate(parks[,1], by=list(park=parks$Park.Facility.Name,
event=parks$Event.Type),
length)
head(park_type[order(park_type$x, decreasing=TRUE),])
## park
## 1305 Lakeshore East Park
## 2077 Garfield Conservatory All Show Rooms
## 1313 Lincoln Harbor - Diversey Event Space
## 1265 31st St. Harbor
## 1314 Lincoln LaSalle/Clark/Stockton
## 485 Berger Game Room
## event x
## 1305 Permit - Athletic Training 2105
## 2077 Permit - Media: Non-Commercial Photography 921
## 1313 Permit - Athletic Training 802
## 1265 Permit - Athletic Training 562
## 1314 Permit - Athletic Training 553
## 485 Administrative Reservation - Park Legacy Items 538
We’re going to use data from the World Cup.
Prior to 2002, the following variables weren’t recorded consistenty: Shots on goal, Shots wide, Free kicks, Offside, Corners
worldcup <- read.csv("data/worldcup.csv",na.strings = c("", "-"),
stringsAsFactors = FALSE)
Set observations before 2002 for the variables that weren’t consistently coded to NA.
Make a new variable total shots that adds the two shot columns.
Make a new variable for shot accuracy that is the number of goals for divided by the total number of shots
Strip the write space (hint: trimws) from the Team and ISO code variables
Drop the ISO code column
Make a new variable with total goals scored by country. Sort it to find out which country has scored the most goals.
Sort the whole world cup data frame by shot accuracy, so that more accurate teams (team, year) are at the top.
worldcup$Shots.on.goal[worldcup$Year < 2002] <- NA
worldcup$Shots.wide[worldcup$Year < 2002] <- NA
worldcup$Free.kicks[worldcup$Year < 2002] <- NA
worldcup$Offside[worldcup$Year < 2002] <- NA
worldcup$Corners[worldcup$Year < 2002] <- NA
worldcup$totalshots <- worldcup$Shots.on.goal + worldcup$Shots.wide
worldcup$shot_accuracy <- worldcup$Goals.for/worldcup$totalshots
worldcup$Team <- trimws(worldcup$Team)
worldcup$ISO.code <- trimws(worldcup$ISO.code)
worldcup <- worldcup[,-3]
country_goals <- tapply(worldcup$Goals.for, worldcup$Team, sum)
sort(country_goals)
## Canada China PR Dutch East Indies
## 0 0 0
## Greece Trinidad and Tobago Zaire
## 0 0 0
## Angola Bolivia El Salvador
## 1 1 1
## Iraq Israel Togo
## 1 1 1
## Haiti Honduras Kuwait
## 2 2 2
## New Zealand Serbia and Montenegro Slovenia
## 2 2 2
## United Arab Emirates Czech Republic Egypt
## 2 3 3
## Jamaica Ghana Wales
## 3 4 4
## Australia Côte d'Ivoire Cuba
## 5 5 5
## German DR Korea DPR Ukraine
## 5 5 5
## Algeria Iran Ecuador
## 6 6 7
## Norway Senegal Japan
## 7 7 8
## South Africa Tunisia Saudi Arabia
## 8 8 9
## Republic of Ireland Russia Costa Rica
## 10 11 12
## Morocco Northern Ireland Colombia
## 12 13 14
## Nigeria Cameroon Croatia
## 14 15 15
## Peru Turkey Bulgaria
## 19 20 22
## Korea Republic Denmark Scotland
## 22 24 25
## Paraguay USA Romania
## 27 27 30
## Chile Portugal Switzerland
## 31 32 37
## Austria Czechoslovakia Poland
## 43 44 44
## Belgium Mexico Soviet Union
## 46 48 53
## Germany Netherlands Yugoslavia
## 59 59 60
## Uruguay England Sweden
## 65 74 74
## Spain Hungary France
## 80 87 95
## Argentina Italy Germany FR
## 113 122 131
## Brazil
## 201
head(worldcup[order(worldcup$shot_accuracy, decreasing = TRUE),])
## Year Team Goals.for Goals.against Penalties Matches Shots.on.goal
## 298 2002 Brazil 18 4 2 7 54
## 332 2006 Argentina 11 3 0 5 30
## 307 2002 Portugal 6 4 0 3 17
## 313 2002 Denmark 5 5 1 4 15
## 301 2002 Turkey 10 6 0 7 32
## 349 2006 Tunisia 3 6 0 3 8
## Shots.wide Free.kicks Offside Corners Won Drawn Lost totalshots
## 298 39 0 26 34 7 0 0 93
## 332 27 3 24 28 3 2 0 57
## 307 16 0 7 12 1 0 2 33
## 313 14 0 9 19 2 1 1 29
## 301 30 0 21 39 4 1 2 62
## 349 11 3 12 7 0 1 2 19
## shot_accuracy
## 298 0.1935484
## 332 0.1929825
## 307 0.1818182
## 313 0.1724138
## 301 0.1612903
## 349 0.1578947
Working with data on Chipotle online orders:
chipotle <- read.csv("data/chipotle.tsv", stringsAsFactors = FALSE, sep="\t",
na.strings = c("NULL"))
DT::datatable(chipotle, rownames = FALSE)
Convert the item_total_price column to a number so you can work with it.
Replace the "$" with an empty string "". You can use the sub() function for this, or stringr:str_replace_all() For the “pattern” argument in either function, use "\\$" as the pattern, because $ is a special charcter in string patterns. Or you could use use substr() to exclude the first character in the string (nchar() is a function to get the number of characters in a string).
Use the as.numeric function to convert the character data to a number.
chipotle$item_total_price <- sub("\\$", "", chipotle$item_total_price)
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)
Alternatives:
chipotle$item_total_price <- stringr::str_replace_all(chipotle$item_total_price, "\\$", "")
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)
chipotle$item_total_price <- substr(chipotle$item_total_price, 2, nchar(chipotle$item_total_price))
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)
Using the transformed chipotle data from above, compute the following, use aggregate where appropriate:
tapply() you may want to reference the names of the vector.total_per_order <- aggregate(chipotle$item_total_price, by=list(chipotle$order_id), sum)
max(total_per_order$x)
total_per_order[which.max(total_per_order$x),]
chipotle[chipotle$order_id == total_per_order[which.max(total_per_order$x), 1], ]
items_per_order <- aggregate(chipotle$quantity, by=list(chipotle$order_id), sum)
items_per_order[which.max(items_per_order$x),]
chipotle[chipotle$order_id == items_per_order[which.max(items_per_order$x), 1], ]
Load data on alcohol consumption by country:
alcohol <- read.csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv",
stringsAsFactors = FALSE,
na.strings = c("")) # NOT NA, because that is the code for North America
Use apply functions to help you find:
avg_beer <- tapply(alcohol$beer_servings, alcohol$continent, mean)
sort(avg_beer)
## AS AF OC NA SA EU
## 37.04545 61.47170 89.68750 145.43478 175.08333 193.77778
avg_wine <- tapply(alcohol$wine_servings, alcohol$continent, mean)
sort(avg_wine)
## AS AF NA OC SA EU
## 9.068182 16.264151 24.521739 35.625000 62.416667 142.222222
alcohol$favorite <- apply(alcohol[, 2:4], 1, which.max)
alcohol$favorite <- c("beer", "spirits", "wine")[alcohol$favorite] # there are other ways to do this recoding
table(alcohol$favorite)
##
## beer spirits wine
## 114 59 20
# Challenge
library(raster)
favorite_type <- tapply(alcohol$favorite, alcohol$continent, modal)
Using the alcohol data and some of the variables you computed above, do North American countries drink less beer on average than South American countries?
What are the pairwise correlations between consumption of beer, wine, and spirits?
t.test(alcohol$beer_servings[alcohol$continent=="NA"],
alcohol$beer_servings[alcohol$continent=="SA"])
##
## Welch Two Sample t-test
##
## data: alcohol$beer_servings[alcohol$continent == "NA"] and alcohol$beer_servings[alcohol$continent == "SA"]
## t = -1.1809, df = 26.682, p-value = 0.2481
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -81.19221 21.89510
## sample estimates:
## mean of x mean of y
## 145.4348 175.0833
cor(alcohol[,2:4])
## beer_servings spirit_servings wine_servings
## beer_servings 1.0000000 0.4588189 0.5271717
## spirit_servings 0.4588189 1.0000000 0.1947970
## wine_servings 0.5271717 0.1947970 1.0000000
Following the example in the code in exercises/exerciseA.R, make indicator variables for all of the different ingredients in the choice_description column of the chipotle data. Use a linear regression model to predict the item_total_price (normalize by quantity first) using the item type and indicator for guacamole for burritos and bowls. How much does guacamole cost?
Then check this by looking at the cost of guacamole by gettign the average price per item according to whether or not Guacamole was included. Use the aggregate function.
There are few ingredients that seem to be duplicates of each other with slightly different names, but we’re going to ignore that for now.
This is just one answer. There are other ways to do this, especially if you use packages and functions we didn’t cover.
library(qdapTools)
ingredients <- mtabulate(strsplit(chipotle$choice_description, ", "))
chipotle <- cbind(chipotle, ingredients)
chipotle$burrito_or_bowl <- stringr::str_detect(chipotle$item_name, "Burrito|Bowl")
chipotle$single_price <- chipotle$item_total_price/chipotle$quantity
# regression
lm(single_price ~ item_name + Guacamole, data=chipotle, subset=chipotle$burrito_or_bowl)
##
## Call:
## lm(formula = single_price ~ item_name + Guacamole, data = chipotle,
## subset = chipotle$burrito_or_bowl)
##
## Coefficients:
## (Intercept) item_nameBarbacoa Burrito
## 9.153200 0.012563
## item_nameBarbacoa Salad Bowl item_nameBowl
## 0.222933 -3.017067
## item_nameBurrito item_nameCarnitas Bowl
## -2.595778 -0.027133
## item_nameCarnitas Burrito item_nameCarnitas Salad Bowl
## 0.008716 0.218311
## item_nameChicken Bowl item_nameChicken Burrito
## -0.477320 -0.500021
## item_nameChicken Salad Bowl item_nameSteak Bowl
## -0.416058 0.001179
## item_nameSteak Burrito item_nameSteak Salad Bowl
## -0.032119 0.218630
## item_nameVeggie Bowl item_nameVeggie Burrito
## -0.479530 -0.534843
## item_nameVeggie Salad Bowl Guacamole
## -0.418608 2.527734
# aggregate
avg_costs <- aggregate(chipotle$single_price[chipotle$burrito_or_bowl],
by=list(item=chipotle$item_name[chipotle$burrito_or_bowl],
guac=chipotle$Guacamole[chipotle$burrito_or_bowl]), mean)
avg_costs <- avg_costs[order(avg_costs$item, avg_costs$guac),]
guac_diff <- avg_costs[seq(2, nrow(avg_costs), 2),3]-avg_costs[seq(1, nrow(avg_costs), 2),3]
mean(guac_diff)
## [1] 2.234194
Find all countries where life expectancy increased by at least 60% from 1952 to 2007.
Hint: you might find the merge function useful. One way to approach this would be to make a data frame with the 1952 data, another with the 2007 data, and then merge them together. There are, however, other ways to proceed.
# one way, there are others
y1952<-gapminder[gapminder$year==1952, c("country","lifeExp")]
y2007<-gapminder[gapminder$year==2007, c("country","lifeExp")]
bothyears<-merge(y1952, y2007, by="country")
bothyears$pctIncrease <- (bothyears$lifeExp.y - bothyears$lifeExp.x)/bothyears$lifeExp.x
bothyears[bothyears$pctIncrease >= .6,]
## country lifeExp.x lifeExp.y pctIncrease
## 3 Algeria 43.077 72.301 0.6784131
## 9 Bangladesh 37.484 64.062 0.7090492
## 12 Bolivia 40.414 65.554 0.6220617
## 17 Burkina Faso 31.975 52.295 0.6354965
## 25 China 44.000 72.961 0.6582045
## 27 Comoros 40.715 65.152 0.6001965
## 39 Egypt 41.893 71.338 0.7028621
## 42 Eritrea 35.928 58.040 0.6154531
## 47 Gambia 30.000 59.448 0.9816000
## 51 Guatemala 42.023 70.259 0.6719178
## 52 Guinea 33.609 56.007 0.6664286
## 54 Haiti 37.579 60.916 0.6210117
## 55 Honduras 41.912 70.198 0.6748902
## 59 India 37.373 64.698 0.7311428
## 60 Indonesia 37.468 70.650 0.8856091
## 68 Jordan 43.158 72.535 0.6806849
## 71 Korea, Rep. 47.453 78.623 0.6568605
## 76 Libya 42.723 73.952 0.7309646
## 77 Madagascar 36.681 59.443 0.6205392
## 80 Mali 33.685 54.467 0.6169512
## 86 Morocco 42.873 71.164 0.6598792
## 88 Myanmar 36.319 62.069 0.7089953
## 90 Nepal 36.157 63.785 0.7641121
## 93 Nicaragua 42.314 72.899 0.7228104
## 97 Oman 37.578 75.640 1.0128799
## 101 Peru 43.902 71.421 0.6268279
## 110 Saudi Arabia 39.875 72.777 0.8251285
## 111 Senegal 37.278 63.062 0.6916680
## 125 Syria 45.883 74.143 0.6159144
## 131 Tunisia 44.600 73.923 0.6574664
## 132 Turkey 43.585 71.777 0.6468280
## 138 Vietnam 40.412 74.249 0.8373008
## 139 West Bank and Gaza 43.160 73.422 0.7011585
## 140 Yemen, Rep. 32.548 62.698 0.9263242
Find all countries that experienced a drop in population at any point in the timespan of the data.
Hint: look at the diff function. Note that diff will result in one fewer values in the result than in the original, so you’ll need to address that.
gapminder <- gapminder[order(gapminder$country, gapminder$year),] # make sure in order for diffs
gapminder$popdiff <- c(NA, diff(gapminder$pop, lag=1))
# We lagged across countries: get rid of first (1952) observation for each country,
# which doesn't have anything previous to compare to
gapminder$popdiff[gapminder$year == 1952] <- NA
gapminder[!is.na(gapminder$popdiff) & gapminder$popdiff < 0,]
## # A tibble: 46 x 8
## country continent year lifeExp pop gdpPercap popmillion popdiff
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <int>
## 1 Afghanistan Asia 1982 39.9 1.29e7 978. 12.9 -2.00e6
## 2 Bosnia and… Europe 1992 72.2 4.26e6 2547. 4.26 -8.30e4
## 3 Bosnia and… Europe 1997 73.2 3.61e6 4766. 3.61 -6.49e5
## 4 Bulgaria Europe 1992 71.2 8.66e6 6303. 8.66 -3.13e5
## 5 Bulgaria Europe 1997 70.3 8.07e6 5970. 8.07 -5.92e5
## 6 Bulgaria Europe 2002 72.1 7.66e6 7697. 7.66 -4.04e5
## 7 Bulgaria Europe 2007 73.0 7.32e6 10681. 7.32 -3.39e5
## 8 Cambodia Asia 1977 31.2 6.98e6 525. 6.98 -4.72e5
## 9 Croatia Europe 1997 73.7 4.44e6 9876. 4.44 -4.94e4
## 10 Czech Repu… Europe 1997 74.0 1.03e7 16049. 10.3 -1.50e4
## # ... with 36 more rows