Run this cell today (we ignored it other days, but need to run it today):
# have our notebook use the project directory as the working directory;
# defaults to the directory of the notebook if we don't have this
knitr::opts_knit$set(root.dir = normalizePath("../"))
On previous days we covered things you’ll likely do regularly in R. Today, we’re covering programming topics that you likely won’t use as often. However, the concepts today can help you make your scripts less repetitive, which will help reduce errors and make your life easier.
One thing that is common about today is that we use {}
(curly braces) to group multiple lines of R code together in a block. Code that is in {}
is a unit that is executed together.
RStudio will indent lines of code within {}
for you, but this is just formatting – it doesn’t affect how the code executes.
Unlike the previous days, I’m not going to break for exercises during the session today. There are exercise files in the repository that you can work on after the session.
When do I write functions in R?
tapply()
or aggregate()
that takes another function, but I need to do multiple steps to the dataFunctions can take inputs, and they can return values, but they don’t have to do either:
say_hello <- function() {
print("Hello!")
}
say_hello()
## [1] "Hello!"
Add input:
say_hello <- function(name) {
paste("Hello", name)
}
say_hello("Henry")
## [1] "Hello Henry"
We can be explicit about what the return value (output) of a function is with return()
, but in R, functions will by default return the value of the last command:
x <- say_hello("Christina")
x
## [1] "Hello Christina"
Here’s a more realistic example. I want to make the same plot for multiple variables. The only thing that changes is the labels and which variable I’m plotting. I could copy and paste the same code over and over again (sometimes this makes sense). But then if I change my mind about the styling, I have to change the code in multiple places. So I can use a function:
my_plot <- function(data, my_label) {
hist(data,
col = "pink",
breaks = 20,
border = FALSE, # no line around bars
main = "Cars Dataset",
xlab = my_label) # get the value from input
}
# use the built in mtcars data
my_plot(mtcars$mpg, "MPG")
my_plot(mtcars$disp, "Displacement")
my_plot(mtcars$wt, "Weight")
Write a function that returns (gives as output) the sum of two input values. Call it to test it.
Challenge: make one of the input values optional by setting a default value for it. We didn’t talk about this above. For a hint, take a look at the help page for read.csv
and see how the arguments with default values are specified.
There’s much more to learn about functions, but you can practice on your own later today and ask questions on the discussion boards.
As you learn more about functions, you’ll want to pay attention to the scope of variables: if you use x
inside your function, where does R look for x
? And if you change the value of x
, what happens? You can also set default values for your inputs like many of the R functions you’ve already used have.
A for loop lets us to the same thing with each element in a vector: for each element in the vector, do…
for (variable in vector) {
# do stuff
}
variable
is any name you want to give your loop variable – usually something short; vector
is any vector.
With a for loop, we often have some variable we define before the loop to hold our result, run the loop, and then use the result after. But you could also use a loop to change some existing data in place.
There are a few different ways we might use a for loop.
You may want to use the values of a vector directly:
letter_sample <- sample(letters, 50, replace=TRUE) # letters is a built-in vector
letter_sample
## [1] "s" "e" "m" "e" "h" "b" "u" "i" "e" "b" "u" "u" "j" "a" "u" "c" "e" "y" "u" "q" "k" "k" "c" "h" "q" "y" "r" "t" "m" "c" "m" "y"
## [33] "j" "m" "x" "y" "e" "q" "v" "x" "l" "k" "a" "y" "d" "k" "a" "u" "a" "v"
result <- ''
for (letter in letter_sample) {
print(letter) # so we can see what's happening
if (letter %in% c('a', 'e', 'i', 'o', 'u')) {
result <- paste0(result, letter) # concatenate the letter onto the end of the result
}
}
## [1] "s"
## [1] "e"
## [1] "m"
## [1] "e"
## [1] "h"
## [1] "b"
## [1] "u"
## [1] "i"
## [1] "e"
## [1] "b"
## [1] "u"
## [1] "u"
## [1] "j"
## [1] "a"
## [1] "u"
## [1] "c"
## [1] "e"
## [1] "y"
## [1] "u"
## [1] "q"
## [1] "k"
## [1] "k"
## [1] "c"
## [1] "h"
## [1] "q"
## [1] "y"
## [1] "r"
## [1] "t"
## [1] "m"
## [1] "c"
## [1] "m"
## [1] "y"
## [1] "j"
## [1] "m"
## [1] "x"
## [1] "y"
## [1] "e"
## [1] "q"
## [1] "v"
## [1] "x"
## [1] "l"
## [1] "k"
## [1] "a"
## [1] "y"
## [1] "d"
## [1] "k"
## [1] "a"
## [1] "u"
## [1] "a"
## [1] "v"
result
## [1] "eeuieuuaueueaaua"
You may want to use a counter variable in a for loop. Here, we’re not using the values in the vector or the variable – it’s just telling us how many times to repeat the loop.
result <- 0
for (i in 1:20) {
print(i)
if (runif(1) > .5) { # random draw from a uniform distribution (0 to 1)
result <- result + 1 # keep track of how many times the value was > .5
}
print(paste0("Result: ", result))
}
## [1] 1
## [1] "Result: 1"
## [1] 2
## [1] "Result: 2"
## [1] 3
## [1] "Result: 3"
## [1] 4
## [1] "Result: 3"
## [1] 5
## [1] "Result: 3"
## [1] 6
## [1] "Result: 4"
## [1] 7
## [1] "Result: 5"
## [1] 8
## [1] "Result: 5"
## [1] 9
## [1] "Result: 6"
## [1] 10
## [1] "Result: 7"
## [1] 11
## [1] "Result: 8"
## [1] 12
## [1] "Result: 8"
## [1] 13
## [1] "Result: 8"
## [1] 14
## [1] "Result: 9"
## [1] 15
## [1] "Result: 9"
## [1] 16
## [1] "Result: 10"
## [1] 17
## [1] "Result: 11"
## [1] 18
## [1] "Result: 11"
## [1] 19
## [1] "Result: 12"
## [1] 20
## [1] "Result: 13"
result # you'll get different results each time
## [1] 13
We could use the counter variable to fill each element of a result vector one at a time.
result <- rep(NA, 20) # make the vector of the length we need ahead of time
for (i in 1:length(result)) {
print(i)
result[i] <- i^2 # replace each value in result one by one as i increments
print(result)
}
## [1] 1
## [1] 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 2
## [1] 1 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 3
## [1] 1 4 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 4
## [1] 1 4 9 16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 5
## [1] 1 4 9 16 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 6
## [1] 1 4 9 16 25 36 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 7
## [1] 1 4 9 16 25 36 49 NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 8
## [1] 1 4 9 16 25 36 49 64 NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 9
## [1] 1 4 9 16 25 36 49 64 81 NA NA NA NA NA NA NA NA NA NA NA
## [1] 10
## [1] 1 4 9 16 25 36 49 64 81 100 NA NA NA NA NA NA NA NA NA NA
## [1] 11
## [1] 1 4 9 16 25 36 49 64 81 100 121 NA NA NA NA NA NA NA NA NA
## [1] 12
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 NA NA NA NA NA NA NA NA
## [1] 13
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 NA NA NA NA NA NA NA
## [1] 14
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 NA NA NA NA NA NA
## [1] 15
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 NA NA NA NA NA
## [1] 16
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 NA NA NA NA
## [1] 17
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 NA NA NA
## [1] 18
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 NA NA
## [1] 19
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 NA
## [1] 20
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400
result
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400
Normally we’d do something more complicated than squaring a number – it’s just an example of the format.
Using the built in mtcars
dataset, use a for loop to print each car’s name and the mpg
. As a starting point, to do this for the car in the 3rd row, you could do: paste0(rownames(mtcars)[3], ": ", mtcars$mpg[3])
. Inside a loop, you’ll need to use print
as well to see the results: print(paste0(rownames(mtcars)[3], ": ", mtcars$mpg[3]))
Hint: you can get the number of rows in the data with nrow(mtcars)
.
Challenge: instead of printing output for each iteration of your loop, save your pasted strings in a vector, and then print the vector when you’re done.
Challenge 2: do the same thing without a loop. No trick here, just a reminder that many functions are vectorized.
There are functions (e.g. the apply()
family of functions and the purrr
library) that will iterate over elements like a loop does. And many functions will accept vectors directly, so you don’t need to explicitly loop over the elements of a vector.
For example, with the simple for loop examples above, we don’t really need to use a for loop:
paste(letter_sample[letter_sample %in% c('a', 'e', 'i', 'o', 'u')], collapse="")
sum(runif(20) > .5)
(1:20)^2
But sometimes for loops can just be easier to wrap your head around or make your code clearer.
Here’s a more realistic example.
The data/pdb
directory has multiple data files. Here’s what one looks like:
COMPND Ammonia
AUTHOR DAVE WOODCOCK 97 10 31
ATOM 1 N 1 0.257 -0.363 0.000
ATOM 2 H 1 0.257 0.727 0.000
ATOM 3 H 1 0.771 -0.727 0.890
ATOM 4 H 1 0.771 -0.727 -0.890
TER 5 1
END
We want to process each of these files. We want the data from the lines that start with ATOM, and we need the molecule name from the first line.
We need two loops:
Let’s start with the first loop. There’s a function that will list all of the files in a directory (optionally that match a pattern):
list.files("data/pdb", # where the files are
"\\.pdb$", # optional pattern (regular expression) to match filenames against
full.names=TRUE) # return the path to the file, not just the name
## [1] "data/pdb/aldrin.pdb" "data/pdb/ammonia.pdb" "data/pdb/ascorbic-acid.pdb" "data/pdb/benzaldehyde.pdb"
## [5] "data/pdb/camphene.pdb" "data/pdb/cholesterol.pdb" "data/pdb/cinnamaldehyde.pdb" "data/pdb/citronellal.pdb"
## [9] "data/pdb/codeine.pdb" "data/pdb/cubane.pdb" "data/pdb/cyclobutane.pdb" "data/pdb/cyclohexanol.pdb"
## [13] "data/pdb/cyclopropane.pdb" "data/pdb/ethane.pdb" "data/pdb/ethanol.pdb" "data/pdb/glycol.pdb"
## [17] "data/pdb/heme.pdb" "data/pdb/lactose.pdb" "data/pdb/lsd.pdb" "data/pdb/maltose.pdb"
## [21] "data/pdb/menthol.pdb" "data/pdb/methane.pdb" "data/pdb/methanol.pdb" "data/pdb/mint.pdb"
## [25] "data/pdb/morphine.pdb" "data/pdb/mustard.pdb" "data/pdb/nerol.pdb" "data/pdb/norethindrone.pdb"
## [29] "data/pdb/octane.pdb" "data/pdb/pentane.pdb" "data/pdb/piperine.pdb" "data/pdb/propane.pdb"
## [33] "data/pdb/pyridoxal.pdb" "data/pdb/quinine.pdb" "data/pdb/strychnine.pdb" "data/pdb/styrene.pdb"
## [37] "data/pdb/sucrose.pdb" "data/pdb/testosterone.pdb" "data/pdb/thiamine.pdb" "data/pdb/tnt.pdb"
## [41] "data/pdb/tyrian-purple.pdb" "data/pdb/vanillin.pdb" "data/pdb/vinyl-chloride.pdb" "data/pdb/vitamin-a.pdb"
This gives us a vector we can loop through:
for (file in list.files("data/pdb", "\\.pdb$", full.names=TRUE)) {
print(file) # just print for now...
}
## [1] "data/pdb/aldrin.pdb"
## [1] "data/pdb/ammonia.pdb"
## [1] "data/pdb/ascorbic-acid.pdb"
## [1] "data/pdb/benzaldehyde.pdb"
## [1] "data/pdb/camphene.pdb"
## [1] "data/pdb/cholesterol.pdb"
## [1] "data/pdb/cinnamaldehyde.pdb"
## [1] "data/pdb/citronellal.pdb"
## [1] "data/pdb/codeine.pdb"
## [1] "data/pdb/cubane.pdb"
## [1] "data/pdb/cyclobutane.pdb"
## [1] "data/pdb/cyclohexanol.pdb"
## [1] "data/pdb/cyclopropane.pdb"
## [1] "data/pdb/ethane.pdb"
## [1] "data/pdb/ethanol.pdb"
## [1] "data/pdb/glycol.pdb"
## [1] "data/pdb/heme.pdb"
## [1] "data/pdb/lactose.pdb"
## [1] "data/pdb/lsd.pdb"
## [1] "data/pdb/maltose.pdb"
## [1] "data/pdb/menthol.pdb"
## [1] "data/pdb/methane.pdb"
## [1] "data/pdb/methanol.pdb"
## [1] "data/pdb/mint.pdb"
## [1] "data/pdb/morphine.pdb"
## [1] "data/pdb/mustard.pdb"
## [1] "data/pdb/nerol.pdb"
## [1] "data/pdb/norethindrone.pdb"
## [1] "data/pdb/octane.pdb"
## [1] "data/pdb/pentane.pdb"
## [1] "data/pdb/piperine.pdb"
## [1] "data/pdb/propane.pdb"
## [1] "data/pdb/pyridoxal.pdb"
## [1] "data/pdb/quinine.pdb"
## [1] "data/pdb/strychnine.pdb"
## [1] "data/pdb/styrene.pdb"
## [1] "data/pdb/sucrose.pdb"
## [1] "data/pdb/testosterone.pdb"
## [1] "data/pdb/thiamine.pdb"
## [1] "data/pdb/tnt.pdb"
## [1] "data/pdb/tyrian-purple.pdb"
## [1] "data/pdb/vanillin.pdb"
## [1] "data/pdb/vinyl-chloride.pdb"
## [1] "data/pdb/vitamin-a.pdb"
With the name, we can open each file. But these files don’t have a standard format (like CSV) to read in, so we’ll read in each line as text with readLines()
.
readLines("data/pdb/ammonia.pdb")
## [1] "COMPND Ammonia" "AUTHOR DAVE WOODCOCK 97 10 31"
## [3] "ATOM 1 N 1 0.257 -0.363 0.000" "ATOM 2 H 1 0.257 0.727 0.000"
## [5] "ATOM 3 H 1 0.771 -0.727 0.890" "ATOM 4 H 1 0.771 -0.727 -0.890"
## [7] "TER 5 1" "END"
Now loop through lines within our loop to read the files (we call this nested loops – one is nested inside the other):
filelist <- list.files("data/pdb", "\\.pdb$", full.names=TRUE)
for (file in filelist[1:3]) { # just loop through a few for now to keep the output small
for (line in readLines(file)) { # file comes from the loop above
print(line) # just print the line for now -- we'll process them later
}
}
## [1] "COMPND ALDRIN"
## [1] "AUTHOR DAVE WOODCOCK 97 08 06"
## [1] "ATOM 1 C 1 -0.888 0.654 -1.753"
## [1] "ATOM 2 C 1 0.276 1.404 -1.135"
## [1] "ATOM 3 C 1 1.381 0.453 -0.730"
## [1] "ATOM 4 C 1 0.786 -0.400 0.372"
## [1] "ATOM 5 C 1 -0.640 0.143 0.522"
## [1] "ATOM 6 C 1 -1.451 -0.124 -0.733"
## [1] "ATOM 7 CL 1 -1.479 -0.191 2.029"
## [1] "ATOM 8 CL 1 -2.844 -1.216 -0.888"
## [1] "ATOM 9 CL 1 -1.450 0.743 -3.432"
## [1] "ATOM 10 CL 1 0.737 2.870 -1.987"
## [1] "ATOM 11 C 1 -0.369 1.633 0.244"
## [1] "ATOM 12 CL 1 -1.860 2.599 0.158"
## [1] "ATOM 13 CL 1 0.735 2.363 1.410"
## [1] "ATOM 14 C 1 1.037 -1.812 -0.091"
## [1] "ATOM 15 C 1 0.994 -1.782 -1.608"
## [1] "ATOM 16 C 1 1.937 -0.617 -1.697"
## [1] "ATOM 17 C 1 2.584 -2.004 0.033"
## [1] "ATOM 18 C 1 3.113 -1.214 -1.046"
## [1] "ATOM 19 H 1 2.231 1.006 -0.328"
## [1] "ATOM 20 H 1 1.320 -0.204 1.301"
## [1] "ATOM 21 H 1 0.442 -2.597 0.378"
## [1] "ATOM 22 H 1 0.002 -1.636 -2.012"
## [1] "ATOM 23 H 1 1.441 -2.681 -2.039"
## [1] "ATOM 24 H 1 2.104 -0.212 -2.687"
## [1] "ATOM 25 H 1 2.985 -2.347 0.999"
## [1] "ATOM 26 H 1 4.160 -0.911 -0.708"
## [1] "TER 27 1"
## [1] "END"
## [1] "COMPND Ammonia"
## [1] "AUTHOR DAVE WOODCOCK 97 10 31"
## [1] "ATOM 1 N 1 0.257 -0.363 0.000"
## [1] "ATOM 2 H 1 0.257 0.727 0.000"
## [1] "ATOM 3 H 1 0.771 -0.727 0.890"
## [1] "ATOM 4 H 1 0.771 -0.727 -0.890"
## [1] "TER 5 1"
## [1] "END"
## [1] "COMPND ASCORBIC ACID, VITAMIN C"
## [1] "AUTHOR DAVE WOODCOCK 96 05 24"
## [1] "ATOM 1 C 1 0.182 0.119 -0.645 1.00 0.00"
## [1] "ATOM 2 C 1 0.591 -1.187 0.091 1.00 0.00"
## [1] "ATOM 3 C 1 1.908 -1.150 0.401 1.00 0.00"
## [1] "ATOM 4 C 1 2.407 0.104 0.031 1.00 0.00"
## [1] "ATOM 5 O 1 1.453 0.978 -0.535 1.00 0.00"
## [1] "ATOM 6 O 1 3.677 0.478 0.175 1.00 0.00"
## [1] "ATOM 7 O 1 2.621 -2.174 1.011 1.00 0.00"
## [1] "ATOM 8 O 1 -0.324 -2.236 0.293 1.00 0.00"
## [1] "ATOM 9 H 1 -0.030 -0.160 -1.688 1.00 0.00"
## [1] "ATOM 10 C 1 -1.084 0.762 0.207 1.00 0.00"
## [1] "ATOM 11 H 1 -1.959 0.100 0.315 1.00 0.00"
## [1] "ATOM 12 O 1 -0.591 1.195 1.590 1.00 0.00"
## [1] "ATOM 13 C 1 -1.585 2.076 -0.581 1.00 0.00"
## [1] "ATOM 14 O 1 -2.771 2.432 0.067 1.00 0.00"
## [1] "ATOM 15 H 1 1.946 -2.997 1.192 1.00 0.00"
## [1] "ATOM 16 H 1 -0.747 -2.497 -0.666 1.00 0.00"
## [1] "ATOM 17 H 1 0.118 2.004 1.470 1.00 0.00"
## [1] "ATOM 18 H 1 -1.839 1.784 -1.606 1.00 0.00"
## [1] "ATOM 19 H 1 -0.831 2.856 -0.721 1.00 0.00"
## [1] "ATOM 20 H 1 -2.572 2.906 1.016 1.00 0.00"
## [1] "TER 21 1"
## [1] "END"
There are two ways to have the logic of if-then-else in R.
If-else statements evaluate conditional statements that produce a single TRUE or FALSE value – not vectors. This is like if-else statements you might have used in other programming languages.
ifelse()
takes a vector of TRUE and FALSE values and then returns a vector where you have a different value for TRUE than for FALSE.
When have I used an if-else statement (#1 above) in my R code?
I use ifelse()
regularly when recoding data.
Let’s talk about ifelse() first. Format is:
ifelse(vector of TRUE/FALSE, value to return if TRUE, value to return if FALSE)
ifelse(c(FALSE, TRUE), "a", "b")
## [1] "b" "a"
my_measure <- 1:20
my_measure %% 5 == 0
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
ifelse(my_measure %% 5 == 0, my_measure, NA)
## [1] NA NA NA NA 5 NA NA NA NA 10 NA NA NA NA 15 NA NA NA NA 20
If I wanted to change my_measure, I could do
my_measure <- ifelse(my_measure %% 5 == 0, my_measure, NA)
instead of using ifelse
, I could have written the last line above as:
my_measure[my_measure %% 5 != 0] <- NA
But I prefer ifelse
- I find it easier to read and follow, but that’s personal preference. One advantage ifelse
has though is that I can use the expression without changing my data. With the second option above (without ifelse
), I’m having to save the result because I’m only changing a subset – ifelse
preserves the full vector.
Use ifelse
to recode values of the bird vector that aren’t birds to NA
.
Hint: you’ll need to remember how to use some type of logical condition – some combination of |
, !
==
%in%
or similar operators.
birds <- c("duck", "penguin", "bear", "cow")
ifelse(_____, # test condition
_____, # value if TRUE
_____) # value if FALSE
Unlike ifelse()
, if-else statements can only process a single TRUE or FALSE value. You’ll get a warning message if you try to use a vector (but not an error!)
The basic syntax is:
if (condition) {
# do something
}
and you can add on else:
if (condition) {
# do something if true
} else {
# do something else bc false
}
or another if:
if (condition) {
# do something
} else if (condition2) {
# do something else
} else {
# do a third thing
}
Example
car_index <- 5
if (mtcars$mpg[car_index] > 25) {
print("It's efficient")
} else if (mtcars$mpg[car_index] > 20) {
print("It's OK")
} else {
print("mpg could be better")
}
## [1] "mpg could be better"
Write an if else statement that will tell you whether a number (stored in the variable my_num
) is positive, negative, or zero 0.
my_num <- 15
Let’s use if else to process the pdb data files.
If the line in the file starts with “COMPND”, we want to get the molecule name. If it starts with “ATOM” we want to add the data to our data set with the compound name. Other lines we want to ignore.
How can we detect these lines?
line1 <- "COMPND Ammonia"
line2 <- "ATOM 1 N 1 0.257 -0.363 0.000"
startsWith(line1, "COMPND")
startsWith(line1, "ATOM")
startsWith(line2, "ATOM")
output <- c() # to hold data
filelist <- list.files("data/pdb", "\\.pdb$", full.names=TRUE)
for (file in filelist[1:3]) { # just loop through 3 for now to keep the output small
molecule_name <- NA # to keep track of the name once we find it to add to each line
for (line in readLines(file)) {
if (startsWith(line, "COMPND")) {
# Remove "COMPND" from the line and white spaces from beginning and end of what's left
molecule_name <- trimws(sub("COMPND", "", line))
} else if (startsWith(line, "ATOM")) {
# split line into components on the spaces
processed <- unlist(strsplit(line, " +")) # " +" is a regular expression for 1 or more spaces in a row
# some files have extra data at the end of the line, so just keep the values we want
processed <- processed[1:7]
# replace "ATOM" with molecule name instead
processed[1] <- molecule_name # was set via a previous iteration of the loop
# add to output as a tab separated string (there are commas in some names) -
# how we'd write it to file
output <- c(output, paste(processed, collapse="\t"))
}
}
}
head(output)
## [1] "ALDRIN\t1\tC\t1\t-0.888\t0.654\t-1.753" "ALDRIN\t2\tC\t1\t0.276\t1.404\t-1.135" "ALDRIN\t3\tC\t1\t1.381\t0.453\t-0.730"
## [4] "ALDRIN\t4\tC\t1\t0.786\t-0.400\t0.372" "ALDRIN\t5\tC\t1\t-0.640\t0.143\t0.522" "ALDRIN\t6\tC\t1\t-1.451\t-0.124\t-0.733"
This is a vector of character data. How do we get it into a data frame? This is a bit tricky. We could write the data to file, and then read it back in:
writeLines(output, "data/combined_pdb.tsv") # writeLines because we just have a vector of character data
output <- read.csv("data/combined_pdb.tsv", sep="\t")
But it turns out we don’t have to write it to file first if we don’t want to (this is the tricky bit):
output <- read.csv(text=output, # read from output instead of a file
col.names=c("molecule", "v1", "v2", "v3", "v4", "v5", "v6"),
sep="\t") # we used tabs instead of commas
output
I didn’t just sit down and write this script straight through. I built it up piece by piece, ran it, found things that were wrong, and then fixed them. And I kept doing this over and over.
I ran into issues:
read.csv()
on text I already had within R – I googled to see how to parse a string of CSV data into a data frame and found this solution.As with other days, there are a few programming exercises in the repository for you to work on. Unlike other days, these concepts can be hard to apply to your own data and code until a need for them arises. If you get through the programming material, I recommend returning to one of your own data sets and continuing with your own work in R. I’m happy to answer any R questions at all this afternoon.
After today, next steps are for you to keep working and learning. We’re here to help you. I’ll monitor the discussion forum on Canvas for another week. After that, reach out with a consultation request if we can be of help.
We teach other R workshops throughout the year - keep a look out for them.
I’ll send out follow-up links with everything via email.