Setup

Run this cell today (we ignored it other days, but need to run it today):

# have our notebook use the project directory as the working directory;
# defaults to the directory of the notebook if we don't have this
knitr::opts_knit$set(root.dir = normalizePath("../")) 

Intro

On previous days we covered things you’ll likely do regularly in R. Today, we’re covering programming topics that you likely won’t use as often. However, the concepts today can help you make your scripts less repetitive, which will help reduce errors and make your life easier.

One thing that is common about today is that we use {} (curly braces) to group multiple lines of R code together in a block. Code that is in {} is a unit that is executed together.

RStudio will indent lines of code within {} for you, but this is just formatting – it doesn’t affect how the code executes.

Unlike the previous days, I’m not going to break for exercises during the session today. There are exercise files in the repository that you can work on after the session.

Functions

When do I write functions in R?

Functions can take inputs, and they can return values, but they don’t have to do either:

say_hello <- function() {
  print("Hello!")
}
say_hello()
## [1] "Hello!"

Add input:

say_hello <- function(name) {
  paste("Hello", name)
}

say_hello("Henry")
## [1] "Hello Henry"

We can be explicit about what the return value (output) of a function is with return(), but in R, functions will by default return the value of the last command:

x <- say_hello("Christina")

x
## [1] "Hello Christina"

Here’s a more realistic example. I want to make the same plot for multiple variables. The only thing that changes is the labels and which variable I’m plotting. I could copy and paste the same code over and over again (sometimes this makes sense). But then if I change my mind about the styling, I have to change the code in multiple places. So I can use a function:

my_plot <- function(data, my_label) {
    hist(data, 
         col = "pink",
         breaks = 20,
         border = FALSE,  # no line around bars
         main = "Cars Dataset", 
         xlab = my_label)  # get the value from input
}

# use the built in mtcars data
my_plot(mtcars$mpg, "MPG")
my_plot(mtcars$disp, "Displacement")
my_plot(mtcars$wt, "Weight")

EXERCISE

Write a function that returns (gives as output) the sum of two input values. Call it to test it.

Challenge: make one of the input values optional by setting a default value for it. We didn’t talk about this above. For a hint, take a look at the help page for read.csv and see how the arguments with default values are specified.

There’s much more to learn about functions, but you can practice on your own later today and ask questions on the discussion boards.

As you learn more about functions, you’ll want to pay attention to the scope of variables: if you use x inside your function, where does R look for x? And if you change the value of x, what happens? You can also set default values for your inputs like many of the R functions you’ve already used have.

For loops

A for loop lets us to the same thing with each element in a vector: for each element in the vector, do…

for (variable in vector) {
  # do stuff
}

variable is any name you want to give your loop variable – usually something short; vector is any vector.

With a for loop, we often have some variable we define before the loop to hold our result, run the loop, and then use the result after. But you could also use a loop to change some existing data in place.

There are a few different ways we might use a for loop.

Example 1

You may want to use the values of a vector directly:

letter_sample <- sample(letters, 50, replace=TRUE) # letters is a built-in vector
letter_sample
##  [1] "s" "e" "m" "e" "h" "b" "u" "i" "e" "b" "u" "u" "j" "a" "u" "c" "e" "y" "u" "q" "k" "k" "c" "h" "q" "y" "r" "t" "m" "c" "m" "y"
## [33] "j" "m" "x" "y" "e" "q" "v" "x" "l" "k" "a" "y" "d" "k" "a" "u" "a" "v"
result <- ''
for (letter in letter_sample) {  
  print(letter)   # so we can see what's happening
  if (letter %in% c('a', 'e', 'i', 'o', 'u')) {
    result <- paste0(result, letter)  # concatenate the letter onto the end of the result 
  }
}
## [1] "s"
## [1] "e"
## [1] "m"
## [1] "e"
## [1] "h"
## [1] "b"
## [1] "u"
## [1] "i"
## [1] "e"
## [1] "b"
## [1] "u"
## [1] "u"
## [1] "j"
## [1] "a"
## [1] "u"
## [1] "c"
## [1] "e"
## [1] "y"
## [1] "u"
## [1] "q"
## [1] "k"
## [1] "k"
## [1] "c"
## [1] "h"
## [1] "q"
## [1] "y"
## [1] "r"
## [1] "t"
## [1] "m"
## [1] "c"
## [1] "m"
## [1] "y"
## [1] "j"
## [1] "m"
## [1] "x"
## [1] "y"
## [1] "e"
## [1] "q"
## [1] "v"
## [1] "x"
## [1] "l"
## [1] "k"
## [1] "a"
## [1] "y"
## [1] "d"
## [1] "k"
## [1] "a"
## [1] "u"
## [1] "a"
## [1] "v"
result  
## [1] "eeuieuuaueueaaua"

Example 2

You may want to use a counter variable in a for loop. Here, we’re not using the values in the vector or the variable – it’s just telling us how many times to repeat the loop.

result <- 0
for (i in 1:20) {
  print(i)
  if (runif(1) > .5) {  # random draw from a uniform distribution (0 to 1)
    result <- result + 1  # keep track of how many times the value was > .5
  }
  print(paste0("Result: ", result))
}
## [1] 1
## [1] "Result: 1"
## [1] 2
## [1] "Result: 2"
## [1] 3
## [1] "Result: 3"
## [1] 4
## [1] "Result: 3"
## [1] 5
## [1] "Result: 3"
## [1] 6
## [1] "Result: 4"
## [1] 7
## [1] "Result: 5"
## [1] 8
## [1] "Result: 5"
## [1] 9
## [1] "Result: 6"
## [1] 10
## [1] "Result: 7"
## [1] 11
## [1] "Result: 8"
## [1] 12
## [1] "Result: 8"
## [1] 13
## [1] "Result: 8"
## [1] 14
## [1] "Result: 9"
## [1] 15
## [1] "Result: 9"
## [1] 16
## [1] "Result: 10"
## [1] 17
## [1] "Result: 11"
## [1] 18
## [1] "Result: 11"
## [1] 19
## [1] "Result: 12"
## [1] 20
## [1] "Result: 13"
result  # you'll get different results each time
## [1] 13

Example 3

We could use the counter variable to fill each element of a result vector one at a time.

result <- rep(NA, 20)  # make the vector of the length we need ahead of time

for (i in 1:length(result)) {
  print(i)
  result[i] <- i^2   # replace each value in result one by one as i increments
  print(result)
}
## [1] 1
##  [1]  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 2
##  [1]  1  4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 3
##  [1]  1  4  9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 4
##  [1]  1  4  9 16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 5
##  [1]  1  4  9 16 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 6
##  [1]  1  4  9 16 25 36 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 7
##  [1]  1  4  9 16 25 36 49 NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 8
##  [1]  1  4  9 16 25 36 49 64 NA NA NA NA NA NA NA NA NA NA NA NA
## [1] 9
##  [1]  1  4  9 16 25 36 49 64 81 NA NA NA NA NA NA NA NA NA NA NA
## [1] 10
##  [1]   1   4   9  16  25  36  49  64  81 100  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
## [1] 11
##  [1]   1   4   9  16  25  36  49  64  81 100 121  NA  NA  NA  NA  NA  NA  NA  NA  NA
## [1] 12
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144  NA  NA  NA  NA  NA  NA  NA  NA
## [1] 13
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169  NA  NA  NA  NA  NA  NA  NA
## [1] 14
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196  NA  NA  NA  NA  NA  NA
## [1] 15
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225  NA  NA  NA  NA  NA
## [1] 16
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256  NA  NA  NA  NA
## [1] 17
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289  NA  NA  NA
## [1] 18
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324  NA  NA
## [1] 19
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361  NA
## [1] 20
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361 400
result
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361 400

Normally we’d do something more complicated than squaring a number – it’s just an example of the format.

EXERCISE

Using the built in mtcars dataset, use a for loop to print each car’s name and the mpg. As a starting point, to do this for the car in the 3rd row, you could do: paste0(rownames(mtcars)[3], ": ", mtcars$mpg[3]). Inside a loop, you’ll need to use print as well to see the results: print(paste0(rownames(mtcars)[3], ": ", mtcars$mpg[3]))

Hint: you can get the number of rows in the data with nrow(mtcars).

Challenge: instead of printing output for each iteration of your loop, save your pasted strings in a vector, and then print the vector when you’re done.

Challenge 2: do the same thing without a loop. No trick here, just a reminder that many functions are vectorized.

Alternatives

There are functions (e.g. the apply() family of functions and the purrr library) that will iterate over elements like a loop does. And many functions will accept vectors directly, so you don’t need to explicitly loop over the elements of a vector.

For example, with the simple for loop examples above, we don’t really need to use a for loop:

paste(letter_sample[letter_sample %in% c('a', 'e', 'i', 'o', 'u')], collapse="")

sum(runif(20) > .5)

(1:20)^2

But sometimes for loops can just be easier to wrap your head around or make your code clearer.

Process some files

Here’s a more realistic example.

The data/pdb directory has multiple data files. Here’s what one looks like:

COMPND      Ammonia
AUTHOR      DAVE WOODCOCK  97 10 31
ATOM      1  N           1       0.257  -0.363   0.000
ATOM      2  H           1       0.257   0.727   0.000
ATOM      3  H           1       0.771  -0.727   0.890
ATOM      4  H           1       0.771  -0.727  -0.890
TER       5              1
END

We want to process each of these files. We want the data from the lines that start with ATOM, and we need the molecule name from the first line.

We need two loops:

  1. Loop through all of the files to process each one.
  2. Loop through the lines in each file.

Let’s start with the first loop. There’s a function that will list all of the files in a directory (optionally that match a pattern):

list.files("data/pdb",   # where the files are
           "\\.pdb$",   # optional pattern (regular expression) to match filenames against
           full.names=TRUE)  # return the path to the file, not just the name
##  [1] "data/pdb/aldrin.pdb"         "data/pdb/ammonia.pdb"        "data/pdb/ascorbic-acid.pdb"  "data/pdb/benzaldehyde.pdb"  
##  [5] "data/pdb/camphene.pdb"       "data/pdb/cholesterol.pdb"    "data/pdb/cinnamaldehyde.pdb" "data/pdb/citronellal.pdb"   
##  [9] "data/pdb/codeine.pdb"        "data/pdb/cubane.pdb"         "data/pdb/cyclobutane.pdb"    "data/pdb/cyclohexanol.pdb"  
## [13] "data/pdb/cyclopropane.pdb"   "data/pdb/ethane.pdb"         "data/pdb/ethanol.pdb"        "data/pdb/glycol.pdb"        
## [17] "data/pdb/heme.pdb"           "data/pdb/lactose.pdb"        "data/pdb/lsd.pdb"            "data/pdb/maltose.pdb"       
## [21] "data/pdb/menthol.pdb"        "data/pdb/methane.pdb"        "data/pdb/methanol.pdb"       "data/pdb/mint.pdb"          
## [25] "data/pdb/morphine.pdb"       "data/pdb/mustard.pdb"        "data/pdb/nerol.pdb"          "data/pdb/norethindrone.pdb" 
## [29] "data/pdb/octane.pdb"         "data/pdb/pentane.pdb"        "data/pdb/piperine.pdb"       "data/pdb/propane.pdb"       
## [33] "data/pdb/pyridoxal.pdb"      "data/pdb/quinine.pdb"        "data/pdb/strychnine.pdb"     "data/pdb/styrene.pdb"       
## [37] "data/pdb/sucrose.pdb"        "data/pdb/testosterone.pdb"   "data/pdb/thiamine.pdb"       "data/pdb/tnt.pdb"           
## [41] "data/pdb/tyrian-purple.pdb"  "data/pdb/vanillin.pdb"       "data/pdb/vinyl-chloride.pdb" "data/pdb/vitamin-a.pdb"

This gives us a vector we can loop through:

for (file in list.files("data/pdb", "\\.pdb$", full.names=TRUE)) {
  print(file)  # just print for now...
}
## [1] "data/pdb/aldrin.pdb"
## [1] "data/pdb/ammonia.pdb"
## [1] "data/pdb/ascorbic-acid.pdb"
## [1] "data/pdb/benzaldehyde.pdb"
## [1] "data/pdb/camphene.pdb"
## [1] "data/pdb/cholesterol.pdb"
## [1] "data/pdb/cinnamaldehyde.pdb"
## [1] "data/pdb/citronellal.pdb"
## [1] "data/pdb/codeine.pdb"
## [1] "data/pdb/cubane.pdb"
## [1] "data/pdb/cyclobutane.pdb"
## [1] "data/pdb/cyclohexanol.pdb"
## [1] "data/pdb/cyclopropane.pdb"
## [1] "data/pdb/ethane.pdb"
## [1] "data/pdb/ethanol.pdb"
## [1] "data/pdb/glycol.pdb"
## [1] "data/pdb/heme.pdb"
## [1] "data/pdb/lactose.pdb"
## [1] "data/pdb/lsd.pdb"
## [1] "data/pdb/maltose.pdb"
## [1] "data/pdb/menthol.pdb"
## [1] "data/pdb/methane.pdb"
## [1] "data/pdb/methanol.pdb"
## [1] "data/pdb/mint.pdb"
## [1] "data/pdb/morphine.pdb"
## [1] "data/pdb/mustard.pdb"
## [1] "data/pdb/nerol.pdb"
## [1] "data/pdb/norethindrone.pdb"
## [1] "data/pdb/octane.pdb"
## [1] "data/pdb/pentane.pdb"
## [1] "data/pdb/piperine.pdb"
## [1] "data/pdb/propane.pdb"
## [1] "data/pdb/pyridoxal.pdb"
## [1] "data/pdb/quinine.pdb"
## [1] "data/pdb/strychnine.pdb"
## [1] "data/pdb/styrene.pdb"
## [1] "data/pdb/sucrose.pdb"
## [1] "data/pdb/testosterone.pdb"
## [1] "data/pdb/thiamine.pdb"
## [1] "data/pdb/tnt.pdb"
## [1] "data/pdb/tyrian-purple.pdb"
## [1] "data/pdb/vanillin.pdb"
## [1] "data/pdb/vinyl-chloride.pdb"
## [1] "data/pdb/vitamin-a.pdb"

With the name, we can open each file. But these files don’t have a standard format (like CSV) to read in, so we’ll read in each line as text with readLines().

readLines("data/pdb/ammonia.pdb")
## [1] "COMPND      Ammonia"                                    "AUTHOR      DAVE WOODCOCK  97 10 31"                   
## [3] "ATOM      1  N           1       0.257  -0.363   0.000" "ATOM      2  H           1       0.257   0.727   0.000"
## [5] "ATOM      3  H           1       0.771  -0.727   0.890" "ATOM      4  H           1       0.771  -0.727  -0.890"
## [7] "TER       5              1"                             "END"

Now loop through lines within our loop to read the files (we call this nested loops – one is nested inside the other):

filelist <-  list.files("data/pdb", "\\.pdb$", full.names=TRUE)

for (file in filelist[1:3]) {  # just loop through a few for now to keep the output small
  
  for (line in readLines(file)) {  # file comes from the loop above
    print(line)   # just print the line for now -- we'll process them later
  }
  
}
## [1] "COMPND      ALDRIN"
## [1] "AUTHOR      DAVE WOODCOCK  97 08 06"
## [1] "ATOM      1  C           1      -0.888   0.654  -1.753"
## [1] "ATOM      2  C           1       0.276   1.404  -1.135"
## [1] "ATOM      3  C           1       1.381   0.453  -0.730"
## [1] "ATOM      4  C           1       0.786  -0.400   0.372"
## [1] "ATOM      5  C           1      -0.640   0.143   0.522"
## [1] "ATOM      6  C           1      -1.451  -0.124  -0.733"
## [1] "ATOM      7 CL           1      -1.479  -0.191   2.029"
## [1] "ATOM      8 CL           1      -2.844  -1.216  -0.888"
## [1] "ATOM      9 CL           1      -1.450   0.743  -3.432"
## [1] "ATOM     10 CL           1       0.737   2.870  -1.987"
## [1] "ATOM     11  C           1      -0.369   1.633   0.244"
## [1] "ATOM     12 CL           1      -1.860   2.599   0.158"
## [1] "ATOM     13 CL           1       0.735   2.363   1.410"
## [1] "ATOM     14  C           1       1.037  -1.812  -0.091"
## [1] "ATOM     15  C           1       0.994  -1.782  -1.608"
## [1] "ATOM     16  C           1       1.937  -0.617  -1.697"
## [1] "ATOM     17  C           1       2.584  -2.004   0.033"
## [1] "ATOM     18  C           1       3.113  -1.214  -1.046"
## [1] "ATOM     19  H           1       2.231   1.006  -0.328"
## [1] "ATOM     20  H           1       1.320  -0.204   1.301"
## [1] "ATOM     21  H           1       0.442  -2.597   0.378"
## [1] "ATOM     22  H           1       0.002  -1.636  -2.012"
## [1] "ATOM     23  H           1       1.441  -2.681  -2.039"
## [1] "ATOM     24  H           1       2.104  -0.212  -2.687"
## [1] "ATOM     25  H           1       2.985  -2.347   0.999"
## [1] "ATOM     26  H           1       4.160  -0.911  -0.708"
## [1] "TER      27              1"
## [1] "END"
## [1] "COMPND      Ammonia"
## [1] "AUTHOR      DAVE WOODCOCK  97 10 31"
## [1] "ATOM      1  N           1       0.257  -0.363   0.000"
## [1] "ATOM      2  H           1       0.257   0.727   0.000"
## [1] "ATOM      3  H           1       0.771  -0.727   0.890"
## [1] "ATOM      4  H           1       0.771  -0.727  -0.890"
## [1] "TER       5              1"
## [1] "END"
## [1] "COMPND      ASCORBIC ACID, VITAMIN C"
## [1] "AUTHOR      DAVE WOODCOCK  96 05 24"
## [1] "ATOM      1  C           1       0.182   0.119  -0.645  1.00  0.00"
## [1] "ATOM      2  C           1       0.591  -1.187   0.091  1.00  0.00"
## [1] "ATOM      3  C           1       1.908  -1.150   0.401  1.00  0.00"
## [1] "ATOM      4  C           1       2.407   0.104   0.031  1.00  0.00"
## [1] "ATOM      5  O           1       1.453   0.978  -0.535  1.00  0.00"
## [1] "ATOM      6  O           1       3.677   0.478   0.175  1.00  0.00"
## [1] "ATOM      7  O           1       2.621  -2.174   1.011  1.00  0.00"
## [1] "ATOM      8  O           1      -0.324  -2.236   0.293  1.00  0.00"
## [1] "ATOM      9  H           1      -0.030  -0.160  -1.688  1.00  0.00"
## [1] "ATOM     10  C           1      -1.084   0.762   0.207  1.00  0.00"
## [1] "ATOM     11  H           1      -1.959   0.100   0.315  1.00  0.00"
## [1] "ATOM     12  O           1      -0.591   1.195   1.590  1.00  0.00"
## [1] "ATOM     13  C           1      -1.585   2.076  -0.581  1.00  0.00"
## [1] "ATOM     14  O           1      -2.771   2.432   0.067  1.00  0.00"
## [1] "ATOM     15  H           1       1.946  -2.997   1.192  1.00  0.00"
## [1] "ATOM     16  H           1      -0.747  -2.497  -0.666  1.00  0.00"
## [1] "ATOM     17  H           1       0.118   2.004   1.470  1.00  0.00"
## [1] "ATOM     18  H           1      -1.839   1.784  -1.606  1.00  0.00"
## [1] "ATOM     19  H           1      -0.831   2.856  -0.721  1.00  0.00"
## [1] "ATOM     20  H           1      -2.572   2.906   1.016  1.00  0.00"
## [1] "TER      21              1"
## [1] "END"

If/else and ifelse()

There are two ways to have the logic of if-then-else in R.

  1. If-else statements evaluate conditional statements that produce a single TRUE or FALSE value – not vectors. This is like if-else statements you might have used in other programming languages.

  2. ifelse() takes a vector of TRUE and FALSE values and then returns a vector where you have a different value for TRUE than for FALSE.

When have I used an if-else statement (#1 above) in my R code?

I use ifelse() regularly when recoding data.

ifelse()

Let’s talk about ifelse() first. Format is:

ifelse(vector of TRUE/FALSE, value to return if TRUE, value to return if FALSE)
ifelse(c(FALSE, TRUE), "a", "b")
## [1] "b" "a"
my_measure <- 1:20

my_measure %% 5 == 0
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
ifelse(my_measure %% 5 == 0, my_measure, NA)
##  [1] NA NA NA NA  5 NA NA NA NA 10 NA NA NA NA 15 NA NA NA NA 20

If I wanted to change my_measure, I could do

my_measure <- ifelse(my_measure %% 5 == 0, my_measure, NA)

instead of using ifelse, I could have written the last line above as:

my_measure[my_measure %% 5 != 0] <- NA

But I prefer ifelse - I find it easier to read and follow, but that’s personal preference. One advantage ifelse has though is that I can use the expression without changing my data. With the second option above (without ifelse), I’m having to save the result because I’m only changing a subset – ifelse preserves the full vector.

EXERCISE

Use ifelse to recode values of the bird vector that aren’t birds to NA.

Hint: you’ll need to remember how to use some type of logical condition – some combination of |, ! == %in% or similar operators.

birds <- c("duck", "penguin", "bear", "cow")

ifelse(_____,  # test condition
       _____,  # value if TRUE
       _____)  # value if FALSE

if else

Unlike ifelse(), if-else statements can only process a single TRUE or FALSE value. You’ll get a warning message if you try to use a vector (but not an error!)

The basic syntax is:

if (condition) {
  # do something 
}

and you can add on else:

if (condition) {
  # do something if true
} else {
  # do something else bc false
}

or another if:

if (condition) {
  # do something
} else if (condition2) {
  # do something else
} else {
  # do a third thing
}

Example

car_index <- 5

if (mtcars$mpg[car_index] > 25) {
  print("It's efficient")
} else if (mtcars$mpg[car_index] > 20) {
  print("It's OK")
} else {
  print("mpg could be better")
}
## [1] "mpg could be better"

EXERCISE

Write an if else statement that will tell you whether a number (stored in the variable my_num) is positive, negative, or zero 0.

my_num <- 15

Process Files with if else

Let’s use if else to process the pdb data files.

If the line in the file starts with “COMPND”, we want to get the molecule name. If it starts with “ATOM” we want to add the data to our data set with the compound name. Other lines we want to ignore.

How can we detect these lines?

line1 <- "COMPND      Ammonia"
line2 <- "ATOM      1  N           1       0.257  -0.363   0.000"

startsWith(line1, "COMPND")
startsWith(line1, "ATOM")
startsWith(line2, "ATOM")
output <- c()  # to hold data

filelist <-  list.files("data/pdb", "\\.pdb$", full.names=TRUE)

for (file in filelist[1:3]) {  # just loop through 3 for now to keep the output small
  molecule_name <- NA  # to keep track of the name once we find it to add to each line
  for (line in readLines(file)) { 
    if (startsWith(line, "COMPND")) {
      # Remove "COMPND" from the line and white spaces from beginning and end of what's left
      molecule_name <- trimws(sub("COMPND", "", line))
    } else if (startsWith(line, "ATOM")) {
      # split line into components on the spaces
      processed <- unlist(strsplit(line, " +"))  # " +" is a regular expression for 1 or more spaces in a row
      
      # some files have extra data at the end of the line, so just keep the values we want
      processed <- processed[1:7]
      
      # replace "ATOM" with molecule name instead
      processed[1] <- molecule_name  # was set via a previous iteration of the loop
  
      # add to output as a tab separated string (there are commas in some names) - 
      # how we'd write it to file 
      output <- c(output, paste(processed, collapse="\t"))
    }
  }
}
head(output)
## [1] "ALDRIN\t1\tC\t1\t-0.888\t0.654\t-1.753"  "ALDRIN\t2\tC\t1\t0.276\t1.404\t-1.135"   "ALDRIN\t3\tC\t1\t1.381\t0.453\t-0.730"  
## [4] "ALDRIN\t4\tC\t1\t0.786\t-0.400\t0.372"   "ALDRIN\t5\tC\t1\t-0.640\t0.143\t0.522"   "ALDRIN\t6\tC\t1\t-1.451\t-0.124\t-0.733"

This is a vector of character data. How do we get it into a data frame? This is a bit tricky. We could write the data to file, and then read it back in:

writeLines(output, "data/combined_pdb.tsv")  # writeLines because we just have a vector of character data
output <- read.csv("data/combined_pdb.tsv", sep="\t")

But it turns out we don’t have to write it to file first if we don’t want to (this is the tricky bit):

output <- read.csv(text=output,  # read from output instead of a file
                   col.names=c("molecule", "v1", "v2", "v3", "v4", "v5", "v6"),
                   sep="\t")  # we used tabs instead of commas
output

Note

I didn’t just sit down and write this script straight through. I built it up piece by piece, ran it, found things that were wrong, and then fixed them. And I kept doing this over and over.

I ran into issues:

  • I originally wrote this in a stand alone script, so the path to the data files was correct. But R notebooks (this type of file) assume that your working directory is the directory with the notebook file in it, so I had to change that at the top of the file so the paths would work correctly.
  • How to store the rows after I processed them? I couldn’t make an empty data frame to fill in because I didn’t know the number of rows I’d need. I could bind rows together, but that requires starting with a NULL data frame, which made the code more complicated to explain here.
  • I found inconsistencies in the data files like commas in some molecule names and extra columns of data.
  • I had never used read.csv() on text I already had within R – I googled to see how to parse a string of CSV data into a data frame and found this solution.

Coming Up

As with other days, there are a few programming exercises in the repository for you to work on. Unlike other days, these concepts can be hard to apply to your own data and code until a need for them arises. If you get through the programming material, I recommend returning to one of your own data sets and continuing with your own work in R. I’m happy to answer any R questions at all this afternoon.

After today, next steps are for you to keep working and learning. We’re here to help you. I’ll monitor the discussion forum on Canvas for another week. After that, reach out with a consultation request if we can be of help.

We teach other R workshops throughout the year - keep a look out for them.

I’ll send out follow-up links with everything via email.