Setup

These exercises use the gapminder data you may have read in during previous exercises. If not, you can read in the file, or you could get the data from a package.

Option 1: Read data file

gapminder<-read_csv("https://goo.gl/dWrc9m")
# or
gapminder <- read_csv("data/gapminder5.csv")

Option 2: Install and Use Package

install.packages("gapminder")

library(gapminder)

The object we want from this package is gapminder. The gapminder data frame is available for you to use, but it can help to explicitly load it into the workspace:

data(gapminder)

Now gapminder should show up in the Environment tab in the upper right of RStudio. See the help for gapminder for a description of the data.

Hint

If, while you’re doing any of these exercises, you mess your data frame, you can always reload the data and try again. That’s part of the reason why you always want to have a complete script of what you do to your data – so you can re-run it all when you make a mistake.

Note

Also note that many of these exercises are a significant step up in difficulty from previous parts. Many require multiple steps or lines of code to complete. Think through the steps you want to take, then try to figure out how to do each step in R.

Exercise: Explore the Structure of the Dataset

What are the names of the columns?
Is there any missing data?
How many different countries are there?
What is the continent name for the US?
Does the number of countries in the data change over time?

Answer

names(gapminder)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

sum(is.na(gapminder)) # count any missing values in entire data set

## [1] 0

length(unique(gapminder$country)) # number of countries

## [1] 142

length(levels(gapminder$country)) # alt. to get num countries, since it's a factor

## [1] 142

You could also look in the Environment tab in the upper right window in RStudio to see the number of levels for country.

To find the continent for the US, we need to see how the US appears in the data. Easiest option is probably just to print the names of all of the countries and look at them, since there aren’t that many.

# continent for US
gapminder$continent[gapminder$country=="United States"][1] ## [1] because we just need 1 value

## [1] Americas
## Levels: Africa Americas Asia Europe Oceania

# number of countries per year
tapply(gapminder$country, gapminder$year, length)

## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
##  142  142  142  142  142  142  142  142  142  142  142  142

Exercise: Selecting and Filtering

Show observations where life expectancy is greater than 80.
Show only population and GDP per capita for Kenya for years before 1970.
Show the observation that has the maximum life expectancy

Note: for cases where a tibble is the output, it’s ok if just the first rows show.

Answer

gapminder[gapminder$lifeExp > 80,]

## # A tibble: 21 x 6
##    country          continent  year lifeExp      pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Australia        Oceania    2002    80.4 19546792    30688.
##  2 Australia        Oceania    2007    81.2 20434176    34435.
##  3 Canada           Americas   2007    80.7 33390141    36319.
##  4 France           Europe     2007    80.7 61083916    30470.
##  5 Hong Kong, China Asia       2002    81.5  6762476    30209.
##  6 Hong Kong, China Asia       2007    82.2  6980412    39725.
##  7 Iceland          Europe     2002    80.5   288030    31163.
##  8 Iceland          Europe     2007    81.8   301931    36181.
##  9 Israel           Asia       2007    80.7  6426679    25523.
## 10 Italy            Europe     2002    80.2 57926999    27968.
## # ... with 11 more rows

gapminder[gapminder$country=='Kenya' & gapminder$year < 1970, 
          c("pop", "gdpPercap")]

## # A tibble: 4 x 2
##        pop gdpPercap
##      <int>     <dbl>
## 1  6464046      854.
## 2  7454779      944.
## 3  8678557      897.
## 4 10191512     1057.

gapminder[gapminder$lifeExp == max(gapminder$lifeExp),]

## # A tibble: 1 x 6
##   country continent  year lifeExp       pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Japan   Asia       2007    82.6 127467972    31656.

Exercise: Transforming Data

Add a new column to gapminder that is the population in millions.
Instead of having North and South American countries together as a continent, reassign the North American countries in the vector below to have contient “North America” and the rest of the countries currently associated with “America” to be “South America”. Hint: continent is a factor, so you’ll need to either first convert it to character data or explicitly manipulate the levels.
Make a new object, gapminder_max that has only one observation per country, and has the maximum value that each country has had overtime for population, life expectancy, and GDP per capita. Hint: use aggregate and note that the first argument (the data) needs to only contain the columns you want to compute a summary measure on.

# not all of them, but these are the ones in gapminder
northamerica <- c("Canada", "Costa Rica", "Cuba", "Dominican Republic", 
                "El Salvador", "Guatemala", "Haiti", "Honduras",
                "Jamaica", "Mexico", "Nicaragua", "Panama",
                "Trinidad and Tobago", "United States")

Answer

gapminder$popmillion <- gapminder$pop/1000000

gapminder$continent <- as.character(gapminder$continent)
gapminder$continent[gapminder$country %in% northamerica] <-
  "North America"
gapminder$continent[gapminder$continent == "Americas"] <-
  "South America"
gapminder$continent <- factor(gapminder$continent)

gapminder_max <- aggregate(gapminder[,c("pop","lifeExp","gdpPercap")], 
                           by=list(country=gapminder$country, 
                                   continent=gapminder$continent), # to keep continent in the data
                           max)

Exercise: Grouping and Summarizing

Calculate the average life expectancy per country. Which had the longest life expectancy and which had the shortest life expectancy?

Answer

# aggregate will keep country name, whereas tapply wouldn't
meanLifeExp <- aggregate(gapminder$lifeExp, 
                         list(country=gapminder$country), 
                         mean)
names(meanLifeExp)

## [1] "country" "x"

meanLifeExp[meanLifeExp$x == max(meanLifeExp$x),]

##    country        x
## 58 Iceland 76.51142

meanLifeExp[meanLifeExp$x == min(meanLifeExp$x),]

##          country        x
## 113 Sierra Leone 36.76917

# or, with special functions
meanLifeExp[which.max(meanLifeExp$x),]

##    country        x
## 58 Iceland 76.51142

meanLifeExp[which.min(meanLifeExp$x),]

##          country        x
## 113 Sierra Leone 36.76917

Exercise: Dates, Aggregating

Working with data from the Chicago Park District on park reservations:

parks <- read.csv("data/chicagoparks.csv",
                  stringsAsFactors = FALSE)

Look at the first few rows of parks. Then convert the date columns to date objects with lubridate.

Use the month() and year() functions to make new variables for the parks data based on reservation start date.

Which month has the most reservations?

Which park has the most reservations?

Challenge: What is the most frequent combination of park and event type?

Answer

head(parks)

##                      Requestor Organization Park.Number
## 1  Mike Gianfortune Kelly Park         <NA>         260
## 2   Timothy Gibson Mather Park         <NA>         241
## 3 Michael Brown Clarendon Park         <NA>        1002
## 4  Jennifer Pabst Norwood Park         <NA>         141
## 5   Lisa Ragucci Winnemac Park         <NA>         486
## 6      Dan Ryan Wentworth Park         <NA>         238
##                       Park.Facility.Name Reservation.Start.Date
## 1   Kelly Natatorium within Kelly School             01/01/2015
## 2 Mather Natatorium within Mather School             01/01/2015
## 3    Clarendon Pool within Uplift School             01/01/2015
## 4        Norwood Pool within Taft School             01/01/2015
## 5   Winnemac Pool within Amundsen School             01/01/2015
## 6   Wentworth Pool within Kennedy School             01/01/2015
##   Reservation.End.Date                                       Event.Type
## 1           01/01/2015 Administrative Reservation - Internal Hold PARKS
## 2           01/01/2015 Administrative Reservation - Internal Hold PARKS
## 3           01/01/2015 Administrative Reservation - Internal Hold PARKS
## 4           01/01/2015 Administrative Reservation - Internal Hold PARKS
## 5           01/01/2015 Administrative Reservation - Internal Hold PARKS
## 6           01/01/2015 Administrative Reservation - Internal Hold PARKS
##                         Event.Description Permit.Status
## 1                        CPS Pool Closure      Approved
## 2                 Mather CPS Pool Closure      Approved
## 3                 Uplift CPS Pool Closure      Approved
## 4 Taft High School CPS Pool Closure #4478      Approved
## 5                        CPS Pool Closure      Approved
## 6           Kennedy Pool CPS Pool Closure      Approved

library(lubridate)
parks$Reservation.Start.Date <- mdy(parks$Reservation.Start.Date)
parks$Reservation.End.Date <- mdy(parks$Reservation.End.Date)

parks$month <- month(parks$Reservation.Start.Date)
parks$year <- year(parks$Reservation.Start.Date)

sort(table(parks$month))

## 
##     2    12     1     3    11     4     5    10     9     6     8     7 
##   818   844   868  1508  1608  2270  3388  3622  4594  6387  8742 13722

head(sort(table(parks$Park.Facility.Name), decreasing = TRUE))

## 
##                   Lakeshore East Park 
##                                  2141 
## Lincoln Harbor - Diversey Event Space 
##                                  1101 
##  Garfield Conservatory All Show Rooms 
##                                   951 
##               Lincoln Picnic Grove 16 
##                                   719 
##                       31st St. Harbor 
##                                   611 
##  Lincoln Harbor - Belmont Event Space 
##                                   586

# challenge: there are other ways to do this
park_type <- aggregate(parks[,1], by=list(park=parks$Park.Facility.Name, 
                                          event=parks$Event.Type),
                       length)
head(park_type[order(park_type$x, decreasing=TRUE),])

##                                       park
## 1305                   Lakeshore East Park
## 2077  Garfield Conservatory All Show Rooms
## 1313 Lincoln Harbor - Diversey Event Space
## 1265                       31st St. Harbor
## 1314        Lincoln LaSalle/Clark/Stockton
## 485                       Berger Game Room
##                                               event    x
## 1305                     Permit - Athletic Training 2105
## 2077    Permit - Media:  Non-Commercial Photography  921
## 1313                     Permit - Athletic Training  802
## 1265                     Permit - Athletic Training  562
## 1314                     Permit - Athletic Training  553
## 485  Administrative Reservation - Park Legacy Items  538

Exercise: Sorting, Adding and Removing Columns

We’re going to use data from the World Cup.

Prior to 2002, the following variables weren’t recorded consistenty: Shots on goal, Shots wide, Free kicks, Offside, Corners

worldcup <- read.csv("data/worldcup.csv",na.strings = c("", "-"),
                     stringsAsFactors = FALSE)

Set observations before 2002 for the variables that weren’t consistently coded to NA.

Make a new variable total shots that adds the two shot columns.

Make a new variable for shot accuracy that is the number of goals for divided by the total number of shots

Strip the write space (hint: trimws) from the Team and ISO code variables

Drop the ISO code column

Make a new variable with total goals scored by country. Sort it to find out which country has scored the most goals.

Sort the whole world cup data frame by shot accuracy, so that more accurate teams (team, year) are at the top.

Answer

worldcup$Shots.on.goal[worldcup$Year < 2002] <- NA
worldcup$Shots.wide[worldcup$Year < 2002] <- NA
worldcup$Free.kicks[worldcup$Year < 2002] <- NA
worldcup$Offside[worldcup$Year < 2002] <- NA
worldcup$Corners[worldcup$Year < 2002] <- NA

worldcup$totalshots <- worldcup$Shots.on.goal + worldcup$Shots.wide
worldcup$shot_accuracy <- worldcup$Goals.for/worldcup$totalshots

worldcup$Team <- trimws(worldcup$Team)
worldcup$ISO.code <- trimws(worldcup$ISO.code)

worldcup <- worldcup[,-3]

country_goals <- tapply(worldcup$Goals.for, worldcup$Team, sum)
sort(country_goals)

##                Canada              China PR     Dutch East Indies 
##                     0                     0                     0 
##                Greece   Trinidad and Tobago                 Zaire 
##                     0                     0                     0 
##                Angola               Bolivia           El Salvador 
##                     1                     1                     1 
##                  Iraq                Israel                  Togo 
##                     1                     1                     1 
##                 Haiti              Honduras                Kuwait 
##                     2                     2                     2 
##           New Zealand Serbia and Montenegro              Slovenia 
##                     2                     2                     2 
##  United Arab Emirates        Czech Republic                 Egypt 
##                     2                     3                     3 
##               Jamaica                 Ghana                 Wales 
##                     3                     4                     4 
##             Australia         Côte d'Ivoire                  Cuba 
##                     5                     5                     5 
##             German DR             Korea DPR               Ukraine 
##                     5                     5                     5 
##               Algeria                  Iran               Ecuador 
##                     6                     6                     7 
##                Norway               Senegal                 Japan 
##                     7                     7                     8 
##          South Africa               Tunisia          Saudi Arabia 
##                     8                     8                     9 
##   Republic of Ireland                Russia            Costa Rica 
##                    10                    11                    12 
##               Morocco      Northern Ireland              Colombia 
##                    12                    13                    14 
##               Nigeria              Cameroon               Croatia 
##                    14                    15                    15 
##                  Peru                Turkey              Bulgaria 
##                    19                    20                    22 
##        Korea Republic               Denmark              Scotland 
##                    22                    24                    25 
##              Paraguay                   USA               Romania 
##                    27                    27                    30 
##                 Chile              Portugal           Switzerland 
##                    31                    32                    37 
##               Austria        Czechoslovakia                Poland 
##                    43                    44                    44 
##               Belgium                Mexico          Soviet Union 
##                    46                    48                    53 
##               Germany           Netherlands            Yugoslavia 
##                    59                    59                    60 
##               Uruguay               England                Sweden 
##                    65                    74                    74 
##                 Spain               Hungary                France 
##                    80                    87                    95 
##             Argentina                 Italy            Germany FR 
##                   113                   122                   131 
##                Brazil 
##                   201

head(worldcup[order(worldcup$shot_accuracy, decreasing = TRUE),])

##     Year      Team Goals.for Goals.against Penalties Matches Shots.on.goal
## 298 2002    Brazil        18             4         2       7            54
## 332 2006 Argentina        11             3         0       5            30
## 307 2002  Portugal         6             4         0       3            17
## 313 2002   Denmark         5             5         1       4            15
## 301 2002    Turkey        10             6         0       7            32
## 349 2006   Tunisia         3             6         0       3             8
##     Shots.wide Free.kicks Offside Corners Won Drawn Lost totalshots
## 298         39          0      26      34   7     0    0         93
## 332         27          3      24      28   3     2    0         57
## 307         16          0       7      12   1     0    2         33
## 313         14          0       9      19   2     1    1         29
## 301         30          0      21      39   4     1    2         62
## 349         11          3      12       7   0     1    2         19
##     shot_accuracy
## 298     0.1935484
## 332     0.1929825
## 307     0.1818182
## 313     0.1724138
## 301     0.1612903
## 349     0.1578947

Exercise: Recoding/Replacing Values

Working with data on Chipotle online orders:

chipotle <- read.csv("data/chipotle.tsv", stringsAsFactors = FALSE, sep="\t",
                     na.strings = c("NULL"))
DT::datatable(chipotle, rownames = FALSE)

Convert the item_total_price column to a number so you can work with it.

Replace the "$" with an empty string "". You can use the sub() function for this, or stringr:str_replace_all() For the “pattern” argument in either function, use "\\$" as the pattern, because $ is a special charcter in string patterns. Or you could use use substr() to exclude the first character in the string (nchar() is a function to get the number of characters in a string).
Use the as.numeric function to convert the character data to a number.

Answer

chipotle$item_total_price <- sub("\\$", "", chipotle$item_total_price)
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)

Alternatives:

chipotle$item_total_price <- stringr::str_replace_all(chipotle$item_total_price, "\\$", "")
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)


chipotle$item_total_price <- substr(chipotle$item_total_price, 2, nchar(chipotle$item_total_price))
chipotle$item_total_price <- as.numeric(chipotle$item_total_price)

Exercise: Aggregating

Using the transformed chipotle data from above, compute the following, use aggregate where appropriate:

Average cost per order: note that there are multiple rows (items) for some orders
Find the order that cost the most; look at the items in the order Hint: if you use tapply() you may want to reference the names of the vector.
Find the order with the most items in it; look at the items in the order

Answer

total_per_order <- aggregate(chipotle$item_total_price, by=list(chipotle$order_id), sum)
max(total_per_order$x)
total_per_order[which.max(total_per_order$x),]
chipotle[chipotle$order_id == total_per_order[which.max(total_per_order$x), 1], ]
items_per_order <- aggregate(chipotle$quantity, by=list(chipotle$order_id), sum)
items_per_order[which.max(items_per_order$x),]
chipotle[chipotle$order_id == items_per_order[which.max(items_per_order$x), 1], ]

Exercise: Apply Functions

Load data on alcohol consumption by country:

alcohol <- read.csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv",
                    stringsAsFactors = FALSE,
                    na.strings = c("")) # NOT NA, because that is the code for North America

Use apply functions to help you find:

Which continent drinks the most beer? Is this different for wine?
For each country, what is their most consumed alcohol type (beer, wine, spirits)? Create a new variable
Challenge: What is the most common favorite type across countries? Hint: there is no function for finding the modal value in base R. There are functions in some packages though. Solving this will take some internet searches to find these functions or advanced work with lists or custom functions.
Is the result for North American surprising to you? View the data to see if you can figure out why the favorite is the favorite.

Answer

avg_beer <- tapply(alcohol$beer_servings, alcohol$continent, mean)
sort(avg_beer)

##        AS        AF        OC        NA        SA        EU 
##  37.04545  61.47170  89.68750 145.43478 175.08333 193.77778

avg_wine <- tapply(alcohol$wine_servings, alcohol$continent, mean)
sort(avg_wine)

##         AS         AF         NA         OC         SA         EU 
##   9.068182  16.264151  24.521739  35.625000  62.416667 142.222222

alcohol$favorite <- apply(alcohol[, 2:4], 1, which.max)
alcohol$favorite <- c("beer", "spirits", "wine")[alcohol$favorite] # there are other ways to do this recoding
table(alcohol$favorite)

## 
##    beer spirits    wine 
##     114      59      20

# Challenge
library(raster)
favorite_type <- tapply(alcohol$favorite, alcohol$continent, modal)

Exercise: t-test, correlation

Using the alcohol data and some of the variables you computed above, do North American countries drink less beer on average than South American countries?

What are the pairwise correlations between consumption of beer, wine, and spirits?

Answer

t.test(alcohol$beer_servings[alcohol$continent=="NA"],
       alcohol$beer_servings[alcohol$continent=="SA"])

## 
##  Welch Two Sample t-test
## 
## data:  alcohol$beer_servings[alcohol$continent == "NA"] and alcohol$beer_servings[alcohol$continent == "SA"]
## t = -1.1809, df = 26.682, p-value = 0.2481
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -81.19221  21.89510
## sample estimates:
## mean of x mean of y 
##  145.4348  175.0833

cor(alcohol[,2:4])

##                 beer_servings spirit_servings wine_servings
## beer_servings       1.0000000       0.4588189     0.5271717
## spirit_servings     0.4588189       1.0000000     0.1947970
## wine_servings       0.5271717       0.1947970     1.0000000

Challenge Exercise: Create indicator variables

Following the example in the code in exercises/exerciseA.R, make indicator variables for all of the different ingredients in the choice_description column of the chipotle data. Use a linear regression model to predict the item_total_price (normalize by quantity first) using the item type and indicator for guacamole for burritos and bowls. How much does guacamole cost?

Then check this by looking at the cost of guacamole by gettign the average price per item according to whether or not Guacamole was included. Use the aggregate function.

Answer

There are few ingredients that seem to be duplicates of each other with slightly different names, but we’re going to ignore that for now.

This is just one answer. There are other ways to do this, especially if you use packages and functions we didn’t cover.

library(qdapTools)
ingredients <- mtabulate(strsplit(chipotle$choice_description, ", "))
chipotle <- cbind(chipotle, ingredients)
chipotle$burrito_or_bowl <- stringr::str_detect(chipotle$item_name, "Burrito|Bowl")
chipotle$single_price <- chipotle$item_total_price/chipotle$quantity

# regression
lm(single_price ~ item_name + Guacamole, data=chipotle, subset=chipotle$burrito_or_bowl)

## 
## Call:
## lm(formula = single_price ~ item_name + Guacamole, data = chipotle, 
##     subset = chipotle$burrito_or_bowl)
## 
## Coefficients:
##                  (Intercept)     item_nameBarbacoa Burrito  
##                     9.153200                      0.012563  
## item_nameBarbacoa Salad Bowl                 item_nameBowl  
##                     0.222933                     -3.017067  
##             item_nameBurrito        item_nameCarnitas Bowl  
##                    -2.595778                     -0.027133  
##    item_nameCarnitas Burrito  item_nameCarnitas Salad Bowl  
##                     0.008716                      0.218311  
##        item_nameChicken Bowl      item_nameChicken Burrito  
##                    -0.477320                     -0.500021  
##  item_nameChicken Salad Bowl           item_nameSteak Bowl  
##                    -0.416058                      0.001179  
##       item_nameSteak Burrito     item_nameSteak Salad Bowl  
##                    -0.032119                      0.218630  
##         item_nameVeggie Bowl       item_nameVeggie Burrito  
##                    -0.479530                     -0.534843  
##   item_nameVeggie Salad Bowl                     Guacamole  
##                    -0.418608                      2.527734

# aggregate
avg_costs <- aggregate(chipotle$single_price[chipotle$burrito_or_bowl], 
                       by=list(item=chipotle$item_name[chipotle$burrito_or_bowl], 
                               guac=chipotle$Guacamole[chipotle$burrito_or_bowl]), mean)
avg_costs <- avg_costs[order(avg_costs$item, avg_costs$guac),]
guac_diff <- avg_costs[seq(2, nrow(avg_costs), 2),3]-avg_costs[seq(1, nrow(avg_costs), 2),3]
mean(guac_diff)

## [1] 2.234194

Challenge Exercise: Complex Group Summary

Find all countries where life expectancy increased by at least 60% from 1952 to 2007.

Hint: you might find the merge function useful. One way to approach this would be to make a data frame with the 1952 data, another with the 2007 data, and then merge them together. There are, however, other ways to proceed.

Answer

# one way, there are others
y1952<-gapminder[gapminder$year==1952, c("country","lifeExp")]
y2007<-gapminder[gapminder$year==2007, c("country","lifeExp")]
bothyears<-merge(y1952, y2007, by="country")
bothyears$pctIncrease <- (bothyears$lifeExp.y - bothyears$lifeExp.x)/bothyears$lifeExp.x
bothyears[bothyears$pctIncrease >= .6,]

##                country lifeExp.x lifeExp.y pctIncrease
## 3              Algeria    43.077    72.301   0.6784131
## 9           Bangladesh    37.484    64.062   0.7090492
## 12             Bolivia    40.414    65.554   0.6220617
## 17        Burkina Faso    31.975    52.295   0.6354965
## 25               China    44.000    72.961   0.6582045
## 27             Comoros    40.715    65.152   0.6001965
## 39               Egypt    41.893    71.338   0.7028621
## 42             Eritrea    35.928    58.040   0.6154531
## 47              Gambia    30.000    59.448   0.9816000
## 51           Guatemala    42.023    70.259   0.6719178
## 52              Guinea    33.609    56.007   0.6664286
## 54               Haiti    37.579    60.916   0.6210117
## 55            Honduras    41.912    70.198   0.6748902
## 59               India    37.373    64.698   0.7311428
## 60           Indonesia    37.468    70.650   0.8856091
## 68              Jordan    43.158    72.535   0.6806849
## 71         Korea, Rep.    47.453    78.623   0.6568605
## 76               Libya    42.723    73.952   0.7309646
## 77          Madagascar    36.681    59.443   0.6205392
## 80                Mali    33.685    54.467   0.6169512
## 86             Morocco    42.873    71.164   0.6598792
## 88             Myanmar    36.319    62.069   0.7089953
## 90               Nepal    36.157    63.785   0.7641121
## 93           Nicaragua    42.314    72.899   0.7228104
## 97                Oman    37.578    75.640   1.0128799
## 101               Peru    43.902    71.421   0.6268279
## 110       Saudi Arabia    39.875    72.777   0.8251285
## 111            Senegal    37.278    63.062   0.6916680
## 125              Syria    45.883    74.143   0.6159144
## 131            Tunisia    44.600    73.923   0.6574664
## 132             Turkey    43.585    71.777   0.6468280
## 138            Vietnam    40.412    74.249   0.8373008
## 139 West Bank and Gaza    43.160    73.422   0.7011585
## 140        Yemen, Rep.    32.548    62.698   0.9263242

Challenge Exercise: Drops in Population

Find all countries that experienced a drop in population at any point in the timespan of the data.

Hint: look at the diff function. Note that diff will result in one fewer values in the result than in the original, so you’ll need to address that.

Answer

gapminder <- gapminder[order(gapminder$country, gapminder$year),] # make sure in order for diffs
gapminder$popdiff <- c(NA, diff(gapminder$pop, lag=1))
# We lagged across countries: get rid of first (1952) observation for each country, 
# which doesn't have anything previous to compare to
gapminder$popdiff[gapminder$year == 1952] <- NA
gapminder[!is.na(gapminder$popdiff) & gapminder$popdiff < 0,]

## # A tibble: 46 x 8
##    country     continent  year lifeExp    pop gdpPercap popmillion popdiff
##    <fct>       <fct>     <int>   <dbl>  <int>     <dbl>      <dbl>   <int>
##  1 Afghanistan Asia       1982    39.9 1.29e7      978.      12.9  -2.00e6
##  2 Bosnia and… Europe     1992    72.2 4.26e6     2547.       4.26 -8.30e4
##  3 Bosnia and… Europe     1997    73.2 3.61e6     4766.       3.61 -6.49e5
##  4 Bulgaria    Europe     1992    71.2 8.66e6     6303.       8.66 -3.13e5
##  5 Bulgaria    Europe     1997    70.3 8.07e6     5970.       8.07 -5.92e5
##  6 Bulgaria    Europe     2002    72.1 7.66e6     7697.       7.66 -4.04e5
##  7 Bulgaria    Europe     2007    73.0 7.32e6    10681.       7.32 -3.39e5
##  8 Cambodia    Asia       1977    31.2 6.98e6      525.       6.98 -4.72e5
##  9 Croatia     Europe     1997    73.7 4.44e6     9876.       4.44 -4.94e4
## 10 Czech Repu… Europe     1997    74.0 1.03e7    16049.      10.3  -1.50e4
## # ... with 36 more rows

Exercises Part 4: Data Manipulation and Basic Exploration

Christina Maimone

2018-06-27

Setup

Option 1: Read data file

Option 2: Install and Use Package

Hint

Note

Exercise: Explore the Structure of the Dataset

Answer

Exercise: Selecting and Filtering

Answer

Exercise: Transforming Data

Answer

Exercise: Grouping and Summarizing

Answer

Exercise: Dates, Aggregating

Answer

Exercise: Sorting, Adding and Removing Columns

Answer

Exercise: Recoding/Replacing Values

Answer

Exercise: Aggregating

Answer

Exercise: Apply Functions

Answer

Exercise: t-test, correlation

Answer

Challenge Exercise: Create indicator variables

Answer

Challenge Exercise: Complex Group Summary

Answer

Challenge Exercise: Drops in Population

Answer