Data Frames

We’re going to work with data on penguins from https://github.com/allisonhorst/palmerpenguins. It’s been packaged into an R package.

install.packages("palmerpenguins")

library(palmerpenguins)

Manual Creation

How we could manually create a data frame (very rarely will do this)

x <- data.frame(month=month.name,
                index=1:12,
                days=c(31,28,31,30,31,30,31,31,30,31,30,31))
x

Data Shape and Names

Use the penguin data

penguins

What is penguins?

class(penguins)

## [1] "tbl_df"     "tbl"        "data.frame"

“tbl_df” is a tibble data frame. These behave a little bit differently from normal data frames. You’ll see tibbles instead of data frames within the tidyverse set of packages (and those packages that work within that framework).

The biggest difference is that tibbles give you a tibble back when subsetting with [], while data frame sometimes give you a vector.

View the first few rows

head(penguins)

Or use the viewer:

View(penguins)

Clicking on the name of the data frame in the environment tab will also open the viewer.

Dimensions of the data frame

dim(penguins)

## [1] 341   9

What will the length of a data frame be?

length(penguins)

## [1] 9

# of columns. This is because it’s technically a list of vectors (lists are a different type). Don’t use length with a data frame; use ncol() instead:

ncol(penguins)

## [1] 9

nrow(penguins)

## [1] 341

What are the variable names? The columns

names(penguins)

## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "bill_length_cm"

Rows have names too

rownames(penguins)

##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21" 
##  [22] "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42" 
##  [43] "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63" 
##  [64] "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99"  "100" "101" "102" "103" "104" "105"
## [106] "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
## [127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147"
## [148] "148" "149" "150" "151" "152" "153" "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168"
## [169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187" "188" "189"
## [190] "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200" "201" "202" "203" "204" "205" "206" "207" "208" "209" "210"
## [211] "211" "212" "213" "214" "215" "216" "217" "218" "219" "220" "221" "222" "223" "224" "225" "226" "227" "228" "229" "230" "231"
## [232] "232" "233" "234" "235" "236" "237" "238" "239" "240" "241" "242" "243" "244" "245" "246" "247" "248" "249" "250" "251" "252"
## [253] "253" "254" "255" "256" "257" "258" "259" "260" "261" "262" "263" "264" "265" "266" "267" "268" "269" "270" "271" "272" "273"
## [274] "274" "275" "276" "277" "278" "279" "280" "281" "282" "283" "284" "285" "286" "287" "288" "289" "290" "291" "292" "293" "294"
## [295] "295" "296" "297" "298" "299" "300" "301" "302" "303" "304" "305" "306" "307" "308" "309" "310" "311" "312" "313" "314" "315"
## [316] "316" "317" "318" "319" "320" "321" "322" "323" "324" "325" "326" "327" "328" "329" "330" "331" "332" "333" "334" "335" "336"
## [337] "337" "338" "339" "340" "341"

These were generated by default. Generally don’t use row names

What are the column types?

str(penguins)

## tibble [341 × 9] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:341] 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42 37.8 37.8 ...
##  $ bill_depth_mm    : num [1:341] 17.4 18 19.3 20.6 17.8 19.6 18.1 20.2 17.1 17.3 ...
##  $ flipper_length_mm: int [1:341] 186 195 193 190 181 195 193 190 186 180 ...
##  $ body_mass_g      : int [1:341] 3800 3250 3450 3650 3625 4675 3475 4250 3300 3700 ...
##  $ sex              : Factor w/ 2 levels "female","male": 1 1 1 2 1 2 NA NA NA NA ...
##  $ year             : int [1:341] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ bill_length_cm   : num [1:341] 3.95 4.03 3.67 3.93 3.89 3.92 3.41 4.2 3.78 3.78 ...

Quick summary of every columns

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g       sex           year     
##  Adelie   :150   Biscoe   :167   Min.   :32.10   Min.   :13.10   Min.   :172       Min.   :2700   female:165   Min.   :2007  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.30   1st Qu.:15.60   1st Qu.:190       1st Qu.:3550   male  :167   1st Qu.:2007  
##  gentoo   :123   Torgersen: 50   Median :44.50   Median :17.30   Median :197       Median :4050   NA's  :  9   Median :2008  
##                                  Mean   :43.94   Mean   :17.15   Mean   :201       Mean   :4203                Mean   :2008  
##                                  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213       3rd Qu.:4750                3rd Qu.:2009  
##                                  Max.   :59.60   Max.   :21.50   Max.   :231       Max.   :6300                Max.   :2009  
##  bill_length_cm 
##  Min.   :3.210  
##  1st Qu.:3.930  
##  Median :4.450  
##  Mean   :4.394  
##  3rd Qu.:4.850  
##  Max.   :5.960

EXERCISE

Run the code above first that creates the x data frame - it’s near line 30.

Using the x data frame we created above:

How many rows and columns?
Names of the variables?

Indexing

Select first row

penguins[1,]

Select first two rows

penguins[1:2,]

Select first column

penguins[,1]

This gave us a tibble/data.frame back, but with x:

x[,1]

##  [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

we get a vector back. This is one difference between tibbles and regular data frames.

We can select rows and columns at the same time:

penguins[1:2, 4:5]

EXERCISE

Using x created above: select rows 2 through 5, and columns 1 through 2 from x

More Indexing

If we want rows or columns that aren’t next to each other, you can use a vector.

x[c(1, 3), ]

Reference columns by name with $ notation (no quotes on names)

names(penguins)

## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "bill_length_cm"

penguins$species

##   [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [13] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [25] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [37] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [49] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [61] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [73] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [85] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [97] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [109] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [121] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [133] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [145] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [157] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [169] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [181] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [193] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [205] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [217] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [229] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [241] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [253] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo   
## [265] gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    gentoo    Chinstrap Chinstrap Chinstrap
## [277] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [289] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [301] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [313] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [325] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## Levels: Adelie Chinstrap gentoo

Note that the $ notation got us a vector back.

penguins$bill_length_mm

##   [1] 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6
##  [26] 40.5 37.9 40.5 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0 39.6 41.1 37.5 36.0 42.3 39.6 40.1
##  [51] 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5 42.8 40.9
##  [76] 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0
## [101] 37.7 37.8 37.9 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1 36.2 37.7 40.2 41.4 35.2 40.6 38.8
## [126] 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8 36.0 41.5
## [151] 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9
## [176] 46.1 44.5 47.8 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3 49.6 50.5 43.6 45.5 50.5 44.9 45.2
## [201] 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2 46.5 46.4
## [226] 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1
## [251] 48.5 55.9 47.2 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8 47.2 46.8 50.4 45.2 49.9 46.5 50.0
## [276] 51.3 45.4 52.7 45.2 46.1 51.3 46.0 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6 46.7 52.0 50.5
## [301] 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4
## [326] 45.7 50.7 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2

Use names in []: put them in quotes

penguins[,"species"]

Since we used [], we get a tibble/data.frame back this time

penguins[,"bill_length_mm"]

Multiple columns by name, need to use a vector of names:

penguins[,c("island", "species")]

EXERCISE

Using the x data frame we created above:

Select the days column using $ notation
Select the month and days columns by name using []

Boolean Selection

If we have a boolean vector (TRUE and FALSE values) that is the same length as the number of rows or columns, we can use it to select from the data frame as we did with vectors.

penguins[penguins$bill_length_mm < 34,]

Note the rows of NA.

If we don’t want the missing rows included:

sum(is.na(penguins$bill_length_mm))  # how many missing?

## [1] 0

penguins[is.na(penguins$bill_length_mm), ]  # which rows have missing values

# select rows where bill_length_mm is not missing and < 34
penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34,]

If I forget the , in the []:

penguins[penguins$bill_length_mm < 34]

It tries to index the columns instead, and our vector is too long.

Multiple conditions

penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16,]

penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58,]

EXERCISE

Using x created above: select rows from x with 31 days. Remember: == for testing for equality.

Renaming Columns

Normally we don’t need to do this first, but I want to keep the current names to use them later first

oldnames <- names(penguins)  # save so we can reset later

Then I can change the name of the first variable with:

names(penguins)[1] <- "boo"  # change the name of the first column
names(penguins)

## [1] "boo"               "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "bill_length_cm"

names(penguins) <- c("a", "b", "c", "d", "e", "f", "g", "h")  # change all of the column names

## Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 9, not 8.

## Warning: The `value` argument of `names<-` can't be empty as of tibble 3.0.0.
## Column 9 must be named.

head(penguins)

Put the old names back

names(penguins) <- oldnames
head(penguins)

Working with Variables

max(penguins$bill_length_mm)

## [1] 59.6

max(penguins$bill_length_mm, na.rm=TRUE)

## [1] 59.6

Now to get the observations (rows) for the penguins with that max value:

penguins[penguins$bill_length_mm == max(penguins$bill_length_mm, na.rm=TRUE),]

Break it out into steps:

max_bill_length <- max(penguins$bill_length_mm, na.rm=TRUE)
penguins[penguins$bill_length_mm == max_bill_length,]

EXERCISE

Using x created above: select the rows from x where days is at its minimum value.

You can also use which.min() or which.max() to get the index location of the first value with the minimum or maximum value respectively (if there’s more than one you only get one):

x[which.min(x$days), ]

Making new variables

We can add new variables to the data frame by naming them with the $ notation:

penguins$bill_length_cm <- penguins$bill_length_mm / 10  # make new variable
names(penguins)  # check to see that it was added

## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "bill_length_cm"

penguins[, c("bill_length_cm", "bill_length_mm")]  # select the two vars to view them

EXERCISE

Using x created above: make a new variable as part of x, called weeks, that is the number of days divided by 7

Missing values

Any missing values? Show up in summary() output:

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g       sex           year     
##  Adelie   :150   Biscoe   :167   Min.   :32.10   Min.   :13.10   Min.   :172       Min.   :2700   female:165   Min.   :2007  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.30   1st Qu.:15.60   1st Qu.:190       1st Qu.:3550   male  :167   1st Qu.:2007  
##  gentoo   :123   Torgersen: 50   Median :44.50   Median :17.30   Median :197       Median :4050   NA's  :  9   Median :2008  
##                                  Mean   :43.94   Mean   :17.15   Mean   :201       Mean   :4203                Mean   :2008  
##                                  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213       3rd Qu.:4750                3rd Qu.:2009  
##                                  Max.   :59.60   Max.   :21.50   Max.   :231       Max.   :6300                Max.   :2009  
##  bill_length_cm 
##  Min.   :3.210  
##  1st Qu.:3.930  
##  Median :4.450  
##  Mean   :4.394  
##  3rd Qu.:4.850  
##  Max.   :5.960

Look at the rows where body_mass_g is missing:

is.na(penguins$body_mass_g)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [106] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [127] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [148] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [190] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [211] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [274] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [295] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [316] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE

penguins[is.na(penguins$body_mass_g),]

Remove rows where body_mass_g is missing by selecting rows where the value is not missing:

penguins[!is.na(penguins$body_mass_g),]

penguins <- penguins[!is.na(penguins$body_mass_g),]

Remove by index instead:

penguins <- penguins[-1,]  # removes the first row of the data

EXERCISE

Using x created above: remove the 2nd row of x:

Replacing Values

table(penguins$species)

## 
##    Adelie Chinstrap    gentoo 
##       149        68       123

species is a factor; let’s un-factor it first so we can change values – this is an extra step related to it being a factor

penguins$species <- as.character(penguins$species)

Here are the general steps we’d do if we didn’t have a factor

penguins$species[penguins$species == "Gentoo"]  # which observations do we want to replace?

## character(0)

penguins$species[penguins$species == "Gentoo"] <- "gentoo"  # set the value

table(penguins$species)

## 
##    Adelie Chinstrap    gentoo 
##       149        68       123

We could re-factor it again if we wanted it to stay a factor:

penguins$species <- factor(penguins$species)

EXERCISE

Using x created above: replace the month value “December” with the value “Dec” instead

Sort and Order

We’re going to talk about sorting and ordering with a vector first.

y <- c(30, 60, 20, 30)
y

## [1] 30 60 20 30

sort() puts the values in order (doesn’t change the vector, returns a copy)

sort(y)

## [1] 20 30 30 60

Remember, it didn’t change y – sort() returned a copy.

## [1] 30 60 20 30

sort(y, decreasing=TRUE)

## [1] 60 30 30 20

## [1] 30 60 20 30

order() tells you what order to put the elements in to sort them:

order(y)

## [1] 3 1 4 2

You can use the output of order() to sort:

y[order(y)]

## [1] 20 30 30 60

order() also allows decreasing:

y[order(y, decreasing=TRUE)]

## [1] 60 30 30 20

Why do we care? Because we need this if we want to sort a data frame not just a vector. sort() only takes a vector – one column of a data frame. If we want to sort by one or more columns, we need to use order() to figure out what to put the rows in – because we’d want entire rows re-ordered together.

We can sort a single column by itself

sort(penguins$bill_length_mm)

##   [1] 32.1 33.1 33.5 34.0 34.1 34.4 34.5 34.6 34.6 35.0 35.0 35.1 35.2 35.3 35.5 35.5 35.6 35.7 35.7 35.7 35.9 35.9 36.0 36.0 36.0
##  [26] 36.0 36.2 36.2 36.2 36.3 36.4 36.4 36.5 36.5 36.6 36.6 36.7 36.7 36.8 36.9 37.0 37.0 37.2 37.2 37.3 37.3 37.3 37.5 37.5 37.6
##  [51] 37.6 37.6 37.7 37.7 37.7 37.8 37.8 37.8 37.8 37.8 37.9 37.9 38.1 38.1 38.1 38.1 38.2 38.2 38.3 38.5 38.6 38.6 38.6 38.7 38.8
##  [76] 38.8 38.8 38.9 38.9 39.0 39.0 39.0 39.2 39.2 39.2 39.3 39.5 39.5 39.6 39.6 39.6 39.6 39.6 39.7 39.7 39.7 39.7 39.8 40.1 40.2
## [101] 40.2 40.2 40.3 40.3 40.5 40.5 40.6 40.6 40.6 40.6 40.7 40.8 40.8 40.9 40.9 40.9 40.9 41.0 41.1 41.1 41.1 41.1 41.1 41.1 41.1
## [126] 41.3 41.3 41.4 41.4 41.5 41.5 41.6 41.7 41.8 42.0 42.0 42.0 42.1 42.2 42.2 42.3 42.4 42.5 42.5 42.5 42.6 42.7 42.7 42.8 42.8
## [151] 42.9 42.9 43.1 43.2 43.2 43.2 43.2 43.3 43.3 43.4 43.5 43.5 43.5 43.6 43.8 44.0 44.1 44.1 44.4 44.5 44.5 44.5 44.9 44.9 45.0
## [176] 45.1 45.1 45.1 45.2 45.2 45.2 45.2 45.2 45.2 45.3 45.3 45.4 45.4 45.5 45.5 45.5 45.5 45.5 45.6 45.6 45.7 45.7 45.7 45.8 45.8
## [201] 45.8 45.9 46.0 46.0 46.1 46.1 46.1 46.2 46.2 46.2 46.2 46.2 46.3 46.4 46.4 46.4 46.4 46.5 46.5 46.5 46.5 46.5 46.6 46.6 46.7
## [226] 46.7 46.8 46.8 46.8 46.8 46.9 46.9 47.0 47.2 47.2 47.3 47.3 47.4 47.5 47.5 47.5 47.5 47.6 47.6 47.7 47.8 48.1 48.1 48.2 48.2
## [251] 48.4 48.4 48.4 48.5 48.5 48.5 48.6 48.7 48.7 48.7 48.8 49.0 49.0 49.0 49.1 49.1 49.1 49.2 49.2 49.3 49.3 49.4 49.5 49.5 49.5
## [276] 49.6 49.6 49.6 49.7 49.8 49.8 49.8 49.9 50.0 50.0 50.0 50.0 50.0 50.1 50.1 50.2 50.2 50.2 50.3 50.4 50.4 50.5 50.5 50.5 50.5
## [301] 50.5 50.6 50.7 50.7 50.8 50.8 50.8 50.8 50.9 50.9 51.0 51.1 51.1 51.3 51.3 51.3 51.3 51.4 51.5 51.5 51.7 51.9 52.0 52.0 52.0
## [326] 52.1 52.2 52.2 52.5 52.7 52.8 53.4 53.5 54.2 54.3 55.1 55.8 55.9 58.0 59.6

But if we want rows re-ordered by the value of bill_length_mm, we need to get the order:

order(penguins$bill_length_mm)

##   [1] 140  96  68  90   6  16  52  12  78  50  98  81 122  23  64  72 134  58  88 116  20  66  40  46 146 148  76  92 118  84  32  62
##  [33]  38  56  13 145   2  80 130  85  42 136  29  75  82 115 142  45 131  36  57  60  19 100 119   8   9  18 101 147  26 102  94 106
##  [65] 108 132  21 105  86 128  11 104 114  14  22  34 124   4  87  54 126 143   5  33 144   3  28  30  43  48  70  91 112  69 103 110
##  [97] 137  37  49 120 135 138   1  95  25  27  24  55 123 139 141  39  93  31  74 160 303  99  10  44  61  65  89 117 133  59  83  53
## [129] 121 125 149  63 256  67   7  51 166  77  35 111  47 293  15 305 327 186 113 190  73 181  79 174 129  97 107 208 295 158 258 236
## [161] 218 262 337 195 206 188  41 127 187 176 244 266 198 234 205 171 182 202 199 228 229 271 278 329 192 210 156 276 162 196 207 242
## [193] 316 109 332 212 325 335  71 164 214 287  17 281 150 175 279 168 211 216 264 314 173 221 224 291 301 155 172 204 223 273 200 283
## [225] 157 297 159 255 269 334 247 311 285 252 268 180 254 232 226 238 240 308 154 309 220 177 259 323 178 222 163 185 248 201 250 294
## [257] 225 152 169 189 267 161 313 320 184 230 253 167 292 165 330 246 217 243 300 191 193 338 307 215 261 322 272 151 153 179 233 274
## [289] 203 319 170 331 340 289 209 270 194 197 260 288 299 296 219 326 235 245 318 339 315 317 306 227 249 237 275 280 282 324 263 321
## [321] 284 333 286 298 310 239 241 328 231 277 302 257 312 304 213 265 336 251 290 183

and then use that to re-order the whole data frame

penguins[order(penguins$bill_length_mm), ]

order() will take multiple vectors (columns):

penguins[order(penguins$island,
               penguins$bill_length_mm), ]

EXERCISE

Sort the penguins data frame by flipper_length_mm decreasing

Note: there’s an easier way to sort using the package dplyr - the arrange() function. We aren’t covering the dplyr package this week, but we have additional workshops on the Tidyverse (which includes dplyr).

Coming Up

Today you’ll work more with data frames. You’ll also learn how to read a csv file into R into a data frame, and go over the parts of an R help page.

There are exercises, but I strongly encourage you to read in one of your own data files and start working with it. If you don’t have a data file, there are some examples linked in the exercises that you could try.

One tip to start: if you get an error message that a file can’t be found when you’re trying to import it:

Check the spelling of the filename for typos
Check your working directory (getwd()) and make sure the path to the file is correct and completely specified given what your working directory is.
Make sure the file is actually in the folder you think it is. I recommend copying or moving any downloaded files into the project/directory for this workshop so you know you really have the file. I’ve seen some problems on Windows computers in particular before where a .zip file isn’t really unzipped - it’s just letting you see inside without actually expanding the contents and creating real files.

Part 2: Data Frames

Data Frames

Manual Creation

Data Shape and Names

EXERCISE

Indexing

EXERCISE

More Indexing

EXERCISE

Boolean Selection

EXERCISE

Renaming Columns

Working with Variables

EXERCISE

Making new variables

EXERCISE

Missing values

EXERCISE

Replacing Values

EXERCISE

Sort and Order

EXERCISE

Coming Up