We’re going to work with data on penguins from https://github.com/allisonhorst/palmerpenguins. It’s been packaged into an R package.
install.packages("palmerpenguins")
library(palmerpenguins)
How we could manually create a data frame (very rarely will do this)
x <- data.frame(month=month.name,
index=1:12,
days=c(31,28,31,30,31,30,31,31,30,31,30,31))
x
Use the penguin data
penguins
What is penguins
?
class(penguins)
## [1] "tbl_df" "tbl" "data.frame"
“tbl_df” is a tibble data frame. These behave a little bit differently from normal data frames. You’ll see tibbles instead of data frames within the tidyverse set of packages (and those packages that work within that framework).
The biggest difference is that tibbles give you a tibble back when subsetting with [], while data frame sometimes give you a vector.
View the first few rows
head(penguins)
Or use the viewer:
View(penguins)
Clicking on the name of the data frame in the environment tab will also open the viewer.
Dimensions of the data frame
dim(penguins)
## [1] 341 9
What will the length of a data frame be?
length(penguins)
## [1] 9
# of columns. This is because it’s technically a list of vectors (lists are a different type). Don’t use length with a data frame; use ncol()
instead:
ncol(penguins)
## [1] 9
nrow(penguins)
## [1] 341
What are the variable names? The columns
names(penguins)
## [1] "species" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year" "bill_length_cm"
Rows have names too
rownames(penguins)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21"
## [22] "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63"
## [64] "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99" "100" "101" "102" "103" "104" "105"
## [106] "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
## [127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147"
## [148] "148" "149" "150" "151" "152" "153" "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168"
## [169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187" "188" "189"
## [190] "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200" "201" "202" "203" "204" "205" "206" "207" "208" "209" "210"
## [211] "211" "212" "213" "214" "215" "216" "217" "218" "219" "220" "221" "222" "223" "224" "225" "226" "227" "228" "229" "230" "231"
## [232] "232" "233" "234" "235" "236" "237" "238" "239" "240" "241" "242" "243" "244" "245" "246" "247" "248" "249" "250" "251" "252"
## [253] "253" "254" "255" "256" "257" "258" "259" "260" "261" "262" "263" "264" "265" "266" "267" "268" "269" "270" "271" "272" "273"
## [274] "274" "275" "276" "277" "278" "279" "280" "281" "282" "283" "284" "285" "286" "287" "288" "289" "290" "291" "292" "293" "294"
## [295] "295" "296" "297" "298" "299" "300" "301" "302" "303" "304" "305" "306" "307" "308" "309" "310" "311" "312" "313" "314" "315"
## [316] "316" "317" "318" "319" "320" "321" "322" "323" "324" "325" "326" "327" "328" "329" "330" "331" "332" "333" "334" "335" "336"
## [337] "337" "338" "339" "340" "341"
These were generated by default. Generally don’t use row names
What are the column types?
str(penguins)
## tibble [341 × 9] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:341] 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42 37.8 37.8 ...
## $ bill_depth_mm : num [1:341] 17.4 18 19.3 20.6 17.8 19.6 18.1 20.2 17.1 17.3 ...
## $ flipper_length_mm: int [1:341] 186 195 193 190 181 195 193 190 186 180 ...
## $ body_mass_g : int [1:341] 3800 3250 3450 3650 3625 4675 3475 4250 3300 3700 ...
## $ sex : Factor w/ 2 levels "female","male": 1 1 1 2 1 2 NA NA NA NA ...
## $ year : int [1:341] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ bill_length_cm : num [1:341] 3.95 4.03 3.67 3.93 3.89 3.92 3.41 4.2 3.78 3.78 ...
Quick summary of every columns
summary(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## Adelie :150 Biscoe :167 Min. :32.10 Min. :13.10 Min. :172 Min. :2700 female:165 Min. :2007
## Chinstrap: 68 Dream :124 1st Qu.:39.30 1st Qu.:15.60 1st Qu.:190 1st Qu.:3550 male :167 1st Qu.:2007
## gentoo :123 Torgersen: 50 Median :44.50 Median :17.30 Median :197 Median :4050 NA's : 9 Median :2008
## Mean :43.94 Mean :17.15 Mean :201 Mean :4203 Mean :2008
## 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213 3rd Qu.:4750 3rd Qu.:2009
## Max. :59.60 Max. :21.50 Max. :231 Max. :6300 Max. :2009
## bill_length_cm
## Min. :3.210
## 1st Qu.:3.930
## Median :4.450
## Mean :4.394
## 3rd Qu.:4.850
## Max. :5.960
Run the code above first that creates the x
data frame - it’s near line 30.
Using the x
data frame we created above:
Select first row
penguins[1,]
Select first two rows
penguins[1:2,]
Select first column
penguins[,1]
This gave us a tibble/data.frame back, but with x
:
x[,1]
## [1] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October"
## [11] "November" "December"
we get a vector back. This is one difference between tibbles and regular data frames.
We can select rows and columns at the same time:
penguins[1:2, 4:5]
Using x
created above: select rows 2 through 5, and columns 1 through 2 from x
If we want rows or columns that aren’t next to each other, you can use a vector.
x[c(1, 3), ]
Reference columns by name with $
notation (no quotes on names)
names(penguins)
## [1] "species" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year" "bill_length_cm"
penguins$species
## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [13] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [25] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [37] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [49] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [61] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [73] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [85] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [97] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [109] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [121] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [133] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [145] Adelie Adelie Adelie Adelie Adelie Adelie gentoo gentoo gentoo gentoo gentoo gentoo
## [157] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [169] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [181] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [193] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [205] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [217] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [229] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [241] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [253] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo
## [265] gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo gentoo Chinstrap Chinstrap Chinstrap
## [277] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [289] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [301] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [313] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [325] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## Levels: Adelie Chinstrap gentoo
Note that the $
notation got us a vector back.
penguins$bill_length_mm
## [1] 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6
## [26] 40.5 37.9 40.5 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0 39.6 41.1 37.5 36.0 42.3 39.6 40.1
## [51] 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5 42.8 40.9
## [76] 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0
## [101] 37.7 37.8 37.9 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1 36.2 37.7 40.2 41.4 35.2 40.6 38.8
## [126] 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8 36.0 41.5
## [151] 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9
## [176] 46.1 44.5 47.8 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3 49.6 50.5 43.6 45.5 50.5 44.9 45.2
## [201] 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2 46.5 46.4
## [226] 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1
## [251] 48.5 55.9 47.2 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8 47.2 46.8 50.4 45.2 49.9 46.5 50.0
## [276] 51.3 45.4 52.7 45.2 46.1 51.3 46.0 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6 46.7 52.0 50.5
## [301] 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4
## [326] 45.7 50.7 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2
Use names in []
: put them in quotes
penguins[,"species"]
Since we used []
, we get a tibble/data.frame back this time
penguins[,"bill_length_mm"]
Multiple columns by name, need to use a vector of names:
penguins[,c("island", "species")]
Using the x
data frame we created above:
$
notation[]
If we have a boolean vector (TRUE
and FALSE
values) that is the same length as the number of rows or columns, we can use it to select from the data frame as we did with vectors.
penguins[penguins$bill_length_mm < 34,]
Note the rows of NA
.
If we don’t want the missing rows included:
sum(is.na(penguins$bill_length_mm)) # how many missing?
## [1] 0
penguins[is.na(penguins$bill_length_mm), ] # which rows have missing values
# select rows where bill_length_mm is not missing and < 34
penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34,]
If I forget the ,
in the []
:
penguins[penguins$bill_length_mm < 34]
It tries to index the columns instead, and our vector is too long.
Multiple conditions
penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16,]
penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58,]
Using x
created above: select rows from x
with 31 days. Remember: ==
for testing for equality.
Normally we don’t need to do this first, but I want to keep the current names to use them later first
oldnames <- names(penguins) # save so we can reset later
Then I can change the name of the first variable with:
names(penguins)[1] <- "boo" # change the name of the first column
names(penguins)
## [1] "boo" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year" "bill_length_cm"
names(penguins) <- c("a", "b", "c", "d", "e", "f", "g", "h") # change all of the column names
## Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 9, not 8.
## Warning: The `value` argument of `names<-` can't be empty as of tibble 3.0.0.
## Column 9 must be named.
head(penguins)
Put the old names back
names(penguins) <- oldnames
head(penguins)
max(penguins$bill_length_mm)
## [1] 59.6
max(penguins$bill_length_mm, na.rm=TRUE)
## [1] 59.6
Now to get the observations (rows) for the penguins with that max value:
penguins[penguins$bill_length_mm == max(penguins$bill_length_mm, na.rm=TRUE),]
Break it out into steps:
max_bill_length <- max(penguins$bill_length_mm, na.rm=TRUE)
penguins[penguins$bill_length_mm == max_bill_length,]
Using x
created above: select the rows from x
where days
is at its minimum value.
You can also use which.min()
or which.max()
to get the index location of the first value with the minimum or maximum value respectively (if there’s more than one you only get one):
x[which.min(x$days), ]
We can add new variables to the data frame by naming them with the $
notation:
penguins$bill_length_cm <- penguins$bill_length_mm / 10 # make new variable
names(penguins) # check to see that it was added
## [1] "species" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year" "bill_length_cm"
penguins[, c("bill_length_cm", "bill_length_mm")] # select the two vars to view them
Using x
created above: make a new variable as part of x
, called weeks
, that is the number of days divided by 7
Any missing values? Show up in summary() output:
summary(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## Adelie :150 Biscoe :167 Min. :32.10 Min. :13.10 Min. :172 Min. :2700 female:165 Min. :2007
## Chinstrap: 68 Dream :124 1st Qu.:39.30 1st Qu.:15.60 1st Qu.:190 1st Qu.:3550 male :167 1st Qu.:2007
## gentoo :123 Torgersen: 50 Median :44.50 Median :17.30 Median :197 Median :4050 NA's : 9 Median :2008
## Mean :43.94 Mean :17.15 Mean :201 Mean :4203 Mean :2008
## 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213 3rd Qu.:4750 3rd Qu.:2009
## Max. :59.60 Max. :21.50 Max. :231 Max. :6300 Max. :2009
## bill_length_cm
## Min. :3.210
## 1st Qu.:3.930
## Median :4.450
## Mean :4.394
## 3rd Qu.:4.850
## Max. :5.960
Look at the rows where body_mass_g
is missing:
is.na(penguins$body_mass_g)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [106] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [127] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [148] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [190] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [211] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [274] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [295] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [316] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE
penguins[is.na(penguins$body_mass_g),]
Remove rows where body_mass_g
is missing by selecting rows where the value is not missing:
penguins[!is.na(penguins$body_mass_g),]
penguins <- penguins[!is.na(penguins$body_mass_g),]
Remove by index instead:
penguins <- penguins[-1,] # removes the first row of the data
Using x
created above: remove the 2nd row of x
:
table(penguins$species)
##
## Adelie Chinstrap gentoo
## 149 68 123
species
is a factor; let’s un-factor it first so we can change values – this is an extra step related to it being a factor
penguins$species <- as.character(penguins$species)
Here are the general steps we’d do if we didn’t have a factor
penguins$species[penguins$species == "Gentoo"] # which observations do we want to replace?
## character(0)
penguins$species[penguins$species == "Gentoo"] <- "gentoo" # set the value
table(penguins$species)
##
## Adelie Chinstrap gentoo
## 149 68 123
We could re-factor it again if we wanted it to stay a factor:
penguins$species <- factor(penguins$species)
Using x
created above: replace the month
value “December” with the value “Dec” instead
We’re going to talk about sorting and ordering with a vector first.
y <- c(30, 60, 20, 30)
y
## [1] 30 60 20 30
sort()
puts the values in order (doesn’t change the vector, returns a copy)
sort(y)
## [1] 20 30 30 60
Remember, it didn’t change y – sort()
returned a copy.
y
## [1] 30 60 20 30
sort(y, decreasing=TRUE)
## [1] 60 30 30 20
y
## [1] 30 60 20 30
order()
tells you what order to put the elements in to sort them:
order(y)
## [1] 3 1 4 2
You can use the output of order()
to sort:
y[order(y)]
## [1] 20 30 30 60
order()
also allows decreasing:
y[order(y, decreasing=TRUE)]
## [1] 60 30 30 20
Why do we care? Because we need this if we want to sort a data frame not just a vector. sort()
only takes a vector – one column of a data frame. If we want to sort by one or more columns, we need to use order()
to figure out what to put the rows in – because we’d want entire rows re-ordered together.
We can sort a single column by itself
sort(penguins$bill_length_mm)
## [1] 32.1 33.1 33.5 34.0 34.1 34.4 34.5 34.6 34.6 35.0 35.0 35.1 35.2 35.3 35.5 35.5 35.6 35.7 35.7 35.7 35.9 35.9 36.0 36.0 36.0
## [26] 36.0 36.2 36.2 36.2 36.3 36.4 36.4 36.5 36.5 36.6 36.6 36.7 36.7 36.8 36.9 37.0 37.0 37.2 37.2 37.3 37.3 37.3 37.5 37.5 37.6
## [51] 37.6 37.6 37.7 37.7 37.7 37.8 37.8 37.8 37.8 37.8 37.9 37.9 38.1 38.1 38.1 38.1 38.2 38.2 38.3 38.5 38.6 38.6 38.6 38.7 38.8
## [76] 38.8 38.8 38.9 38.9 39.0 39.0 39.0 39.2 39.2 39.2 39.3 39.5 39.5 39.6 39.6 39.6 39.6 39.6 39.7 39.7 39.7 39.7 39.8 40.1 40.2
## [101] 40.2 40.2 40.3 40.3 40.5 40.5 40.6 40.6 40.6 40.6 40.7 40.8 40.8 40.9 40.9 40.9 40.9 41.0 41.1 41.1 41.1 41.1 41.1 41.1 41.1
## [126] 41.3 41.3 41.4 41.4 41.5 41.5 41.6 41.7 41.8 42.0 42.0 42.0 42.1 42.2 42.2 42.3 42.4 42.5 42.5 42.5 42.6 42.7 42.7 42.8 42.8
## [151] 42.9 42.9 43.1 43.2 43.2 43.2 43.2 43.3 43.3 43.4 43.5 43.5 43.5 43.6 43.8 44.0 44.1 44.1 44.4 44.5 44.5 44.5 44.9 44.9 45.0
## [176] 45.1 45.1 45.1 45.2 45.2 45.2 45.2 45.2 45.2 45.3 45.3 45.4 45.4 45.5 45.5 45.5 45.5 45.5 45.6 45.6 45.7 45.7 45.7 45.8 45.8
## [201] 45.8 45.9 46.0 46.0 46.1 46.1 46.1 46.2 46.2 46.2 46.2 46.2 46.3 46.4 46.4 46.4 46.4 46.5 46.5 46.5 46.5 46.5 46.6 46.6 46.7
## [226] 46.7 46.8 46.8 46.8 46.8 46.9 46.9 47.0 47.2 47.2 47.3 47.3 47.4 47.5 47.5 47.5 47.5 47.6 47.6 47.7 47.8 48.1 48.1 48.2 48.2
## [251] 48.4 48.4 48.4 48.5 48.5 48.5 48.6 48.7 48.7 48.7 48.8 49.0 49.0 49.0 49.1 49.1 49.1 49.2 49.2 49.3 49.3 49.4 49.5 49.5 49.5
## [276] 49.6 49.6 49.6 49.7 49.8 49.8 49.8 49.9 50.0 50.0 50.0 50.0 50.0 50.1 50.1 50.2 50.2 50.2 50.3 50.4 50.4 50.5 50.5 50.5 50.5
## [301] 50.5 50.6 50.7 50.7 50.8 50.8 50.8 50.8 50.9 50.9 51.0 51.1 51.1 51.3 51.3 51.3 51.3 51.4 51.5 51.5 51.7 51.9 52.0 52.0 52.0
## [326] 52.1 52.2 52.2 52.5 52.7 52.8 53.4 53.5 54.2 54.3 55.1 55.8 55.9 58.0 59.6
But if we want rows re-ordered by the value of bill_length_mm, we need to get the order:
order(penguins$bill_length_mm)
## [1] 140 96 68 90 6 16 52 12 78 50 98 81 122 23 64 72 134 58 88 116 20 66 40 46 146 148 76 92 118 84 32 62
## [33] 38 56 13 145 2 80 130 85 42 136 29 75 82 115 142 45 131 36 57 60 19 100 119 8 9 18 101 147 26 102 94 106
## [65] 108 132 21 105 86 128 11 104 114 14 22 34 124 4 87 54 126 143 5 33 144 3 28 30 43 48 70 91 112 69 103 110
## [97] 137 37 49 120 135 138 1 95 25 27 24 55 123 139 141 39 93 31 74 160 303 99 10 44 61 65 89 117 133 59 83 53
## [129] 121 125 149 63 256 67 7 51 166 77 35 111 47 293 15 305 327 186 113 190 73 181 79 174 129 97 107 208 295 158 258 236
## [161] 218 262 337 195 206 188 41 127 187 176 244 266 198 234 205 171 182 202 199 228 229 271 278 329 192 210 156 276 162 196 207 242
## [193] 316 109 332 212 325 335 71 164 214 287 17 281 150 175 279 168 211 216 264 314 173 221 224 291 301 155 172 204 223 273 200 283
## [225] 157 297 159 255 269 334 247 311 285 252 268 180 254 232 226 238 240 308 154 309 220 177 259 323 178 222 163 185 248 201 250 294
## [257] 225 152 169 189 267 161 313 320 184 230 253 167 292 165 330 246 217 243 300 191 193 338 307 215 261 322 272 151 153 179 233 274
## [289] 203 319 170 331 340 289 209 270 194 197 260 288 299 296 219 326 235 245 318 339 315 317 306 227 249 237 275 280 282 324 263 321
## [321] 284 333 286 298 310 239 241 328 231 277 302 257 312 304 213 265 336 251 290 183
and then use that to re-order the whole data frame
penguins[order(penguins$bill_length_mm), ]
order()
will take multiple vectors (columns):
penguins[order(penguins$island,
penguins$bill_length_mm), ]
Sort the penguins
data frame by flipper_length_mm
decreasing
Note: there’s an easier way to sort using the package dplyr
- the arrange()
function. We aren’t covering the dplyr
package this week, but we have additional workshops on the Tidyverse (which includes dplyr
).
Today you’ll work more with data frames. You’ll also learn how to read a csv file into R into a data frame, and go over the parts of an R help page.
There are exercises, but I strongly encourage you to read in one of your own data files and start working with it. If you don’t have a data file, there are some examples linked in the exercises that you could try.
One tip to start: if you get an error message that a file can’t be found when you’re trying to import it:
Check the spelling of the filename for typos
Check your working directory (getwd()
) and make sure the path to the file is correct and completely specified given what your working directory is.
Make sure the file is actually in the folder you think it is. I recommend copying or moving any downloaded files into the project/directory for this workshop so you know you really have the file. I’ve seen some problems on Windows computers in particular before where a .zip file isn’t really unzipped - it’s just letting you see inside without actually expanding the contents and creating real files.