Overall Note

R is generally better suited to selecting data from databases than for creating database tables or entering data into a database. But there are functions to do all operations.

Connection

The DBI package has the core functionality of connecting R to database servers. There are then packages that implement the core functionality of DBI for each specific implementation of SQL. A package for PostgreSQL is RPostgres.

if(!'RPostgres' %in% installed.packages()){
  install.packages("RPostgres")
}
library(RPostgres)

We connect with a function call like the following.

Note: this code was generated on my local machine connected to a local copy of the database. Your connection details will be different. Note I also have permissions to modify this database.

con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")

We will need a connection like this for any of the methods of connecting below. You can have multiple connection objects to different databases – just call the variables something different.

Note that the above example doesn’t have a username or password because the database is local and doesn’t require one. If you’re using a remote database, you don’t want to hard core your credentials into your code. There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information. But RStudio also gives you the option to pop up a box to enter the information:

con2 <- dbConnect(RPostgres::Postgres(), 
                  host="localhost", 
                  dbname="dvdrental",
                  user=rstudioapi::askForPassword("Database username"),
                  password=rstudioapi::askForPassword("Database password"))

Using DBI

We can use the basic functions in the DBI library:

Get Database Information

Note that the following db- prefixed functions are exported from the DBI and RPostgreSQL namespaces.

dbListTables(con)
##  [1] "film_actor"                 "address"                   
##  [3] "city"                       "actor"                     
##  [5] "film_category"              "inventory"                 
##  [7] "actor_info"                 "category"                  
##  [9] "country"                    "customer"                  
## [11] "customer_list"              "film_list"                 
## [13] "language"                   "rental"                    
## [15] "nicer_but_slower_film_list" "staff"                     
## [17] "sales_by_film_category"     "store"                     
## [19] "payment"                    "sales_by_store"            
## [21] "staff_list"                 "film"                      
## [23] "measure"
dbListFields(con, "actor")
## [1] "actor_id"    "first_name"  "last_name"   "last_update"

Execute Queries

actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50")
head(actor_subset)
##   actor_id first_name last_name         last_update
## 1       51       Gary   Phoenix 2013-05-26 14:47:57
## 2       52     Carmen      Hunt 2013-05-26 14:47:57
## 3       53       Mena    Temple 2013-05-26 14:47:57
## 4       54   Penelope   Pinkett 2013-05-26 14:47:57
## 5       55        Fay    Kilmer 2013-05-26 14:47:57
## 6       56        Dan    Harris 2013-05-26 14:47:57

Note that we don’t need a ; at the end of the query.

If we want an entire table, there’s a function for that:

actor <- dbReadTable(con, "actor")
head(actor)
##   actor_id first_name last_name         last_update
## 1        1   Penelope   Guiness 2013-05-26 14:47:57
## 2        2       Nick  Wahlberg 2013-05-26 14:47:57
## 3        3         Ed     Chase 2013-05-26 14:47:57
## 4        6      Bette Nicholson 2013-05-26 14:47:57
## 5        7      Grace    Mostel 2013-05-26 14:47:57
## 6        8    Matthew Johansson 2013-05-26 14:47:57

If you want part of your query to be determined by a variable – especially if it’s a variable supplied or defined as input or by a user (not you) – you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query:

# YES
myquery <- dbSendQuery(con, "select * from actor where actor_id = $1")
dbBind(myquery, list(4))
dbFetch(myquery)
##   actor_id first_name last_name         last_update
## 1        4       Jenn     Davis 2019-08-12 19:24:02

The $1 stands in for a value you’ll substitute in. For multiple, you can use $2, etc. (The $1 notation is for PostgreSQL – other types of databases use ? or other symbols.)

When you’re done with the results from a prepared query, clear the result:

dbClearResult(myquery)
# NO! - at least not if the variable can be manipulated by a user
dbSendQuery(con, paste0("select * from actor where actor_id=", myvar))

For more, see http://db.rstudio.com/best-practices/run-queries-safely/.

Parameterizing statements also lets you reuse them with different values.

You may also need to use dbClearResult() after other calls to dbSendQuery() that return a result (select statements, table creations, inserts, updates) – but not to dbGetQuery(). Otherwise you may get some warning messages like:

Warning message: In result_create(, statement) : Closing open result set, cancelling previous query

Which are ok, but could get annoying.

Modifying a Database

If you’re not a superuser on the dvdrental database, just try connecting to a database you can modify. Then the basic function is dbSendQuery for any command you want to execute where you aren’t retrieving results.

Note that by default, statements take effect immediately - they are not in a transaction that you need to commit. To use transactions, see below.

res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4")
print(res)  # contains info on result of update
dbClearResult(res)  # prevent warning messages

To create a table, you can give it a data frame

mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
dbWriteTable(con, "mynewtable", mytbl)

or you could specify the table with SQL, and execute with dbSendQuery but this can get cumbersome.

To remove a table

dbRemoveTable(con, "mynewtable")

Transactions

There are also methods for managing transactions if you need: dbBegin, dbRollback, dbCommit. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. UPDATE, CREATE, DROP, DELETE, etc.) execute correctly before they’re made permanent (i.e. “committed”).

dbBegin(con)
dbWriteTable(con, "mynewtable", mytbl)
dbRollback(con)
dbGetQuery(con, "SELECT * FROM mynewtable")

The above will produce error:

Error in result_create(conn@ptr, statement) : 
  Failed to prepare query: ERROR:  relation "mynewtable" does not exist
LINE 1: SELECT * FROM mynewtable

because the transaction was rolled back, not committed.

Close Connection

Connections will get closed when you quit R, but it’s good practice to explicitly close them.

dbDisconnect(con)

Use dplyr

For more complete info, see the RStudio databases site.

needToInstall <- c("tidyverse")
needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())]
if(length(needToInstall) > 0){
  sapply(needToInstall, install.packages)
}
library(tidyverse)

First, connect like normal

con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")

Get a reference to a table:

actortbl <- tbl(con, "actor")

If we look at this object, it doesn’t have data in it:

str(actortbl)
## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
##   .. .. ..@ ptr     :<externalptr> 
##   .. .. ..@ bigint  : chr "integer64"
##   .. .. ..@ typnames:'data.frame':   446 obs. of  2 variables:
##   .. .. .. ..$ oid    : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...
##   .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   : 'ident' chr "actor"
##   ..$ vars: chr [1:4] "actor_id" "first_name" "last_name" "last_update"
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

It just has connection information. dplyr will try to perform operations within the database where it can, instead of pulling all of the data into R.

Yet you can print the object and see observations:

actortbl
## # Source:   table<actor> [?? x 4]
## # Database: postgres [christina@localhost:5432/dvdrental]
##    actor_id first_name last_name last_update        
##       <int> <chr>      <chr>     <dttm>             
##  1        1 Penelope   Guiness   2013-05-26 14:47:57
##  2        2 Nick       Wahlberg  2013-05-26 14:47:57
##  3        3 Ed         Chase     2013-05-26 14:47:57
##  4        6 Bette      Nicholson 2013-05-26 14:47:57
##  5        7 Grace      Mostel    2013-05-26 14:47:57
##  6        8 Matthew    Johansson 2013-05-26 14:47:57
##  7        9 Joe        Swank     2013-05-26 14:47:57
##  8       10 Christian  Gable     2013-05-26 14:47:57
##  9       11 Zero       Cage      2013-05-26 14:47:57
## 10       12 Karl       Berry     2013-05-26 14:47:57
## # … with more rows

It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) – output looks a litle different in the console and in RMarkdown files like this.

You can use dplyr verbs to work with the table objects from the database, as if they were data frames (or tibbles).

actortbl %>% 
  select(actor_id, first_name, last_name) %>% 
  filter(actor_id > 150)
## # Source:   lazy query [?? x 3]
## # Database: postgres [christina@localhost:5432/dvdrental]
##    actor_id first_name last_name
##       <int> <chr>      <chr>    
##  1      151 Geoffrey   Heston   
##  2      152 Ben        Harris   
##  3      153 Minnie     Kilmer   
##  4      154 Meryl      Gibson   
##  5      155 Ian        Tandy    
##  6      156 Fay        Wood     
##  7      157 Greta      Malden   
##  8      158 Vivien     Basinger 
##  9      159 Laura      Brody    
## 10      160 Chris      Depp     
## # … with more rows

The above generates and executes the SQL needed to get the result. It turns filter into a select statement with the appropriate where clause.

rentaltbl <- tbl(con, "rental")
rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n())
## # Source:   lazy query [?? x 2]
## # Database: postgres [christina@localhost:5432/dvdrental]
##    customer_id count     
##          <int> <integr64>
##  1          87 30        
##  2         184 23        
##  3         477 22        
##  4         273 35        
##  5         550 32        
##  6         394 22        
##  7          51 33        
##  8         272 20        
##  9          70 18        
## 10         190 27        
## # … with more rows

What does the above correspond to as a SQL query? select customer_id, count(*) from rental group by customer_id;

rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) %>% 
  show_query()
## <SQL>
## SELECT "customer_id", COUNT(*) AS "count"
## FROM "rental"
## GROUP BY "customer_id"

You can use collect to pull down all of the data (tell dplyr to stop being lazy).

# First, without collecting
df1 <- rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) 
df1
## # Source:   lazy query [?? x 2]
## # Database: postgres [christina@localhost:5432/dvdrental]
##    customer_id count     
##          <int> <integr64>
##  1          87 30        
##  2         184 23        
##  3         477 22        
##  4         273 35        
##  5         550 32        
##  6         394 22        
##  7          51 33        
##  8         272 20        
##  9          70 18        
## 10         190 27        
## # … with more rows

Looks OK, except:

df1[1,]

Gives you:

Error in df1[1, ] : incorrect number of dimensions

It’s the wrong dimensions because df1 isn’t actually a data.frame:

str(df1)
## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
##   .. .. ..@ ptr     :<externalptr> 
##   .. .. ..@ bigint  : chr "integer64"
##   .. .. ..@ typnames:'data.frame':   446 obs. of  2 variables:
##   .. .. .. ..$ oid    : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...
##   .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
##  $ ops:List of 4
##   ..$ name: chr "summarise"
##   ..$ x   :List of 4
##   .. ..$ name: chr "group_by"
##   .. ..$ x   :List of 2
##   .. .. ..$ x   : 'ident' chr "rental"
##   .. .. ..$ vars: chr [1:7] "rental_id" "rental_date" "inventory_id" "customer_id" ...
##   .. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##   .. ..$ dots:List of 1
##   .. .. ..$ customer_id: symbol customer_id
##   .. ..$ args:List of 1
##   .. .. ..$ add: logi FALSE
##   .. ..- attr(*, "class")= chr [1:3] "op_group_by" "op_single" "op"
##   ..$ dots:List of 1
##   .. ..$ count: language ~n()
##   .. .. ..- attr(*, ".Environment")=<environment: 0x7fa71e16dce8> 
##   ..$ args: list()
##   ..- attr(*, "class")= chr [1:3] "op_summarise" "op_single" "op"
##  - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

It is telling us we need to collect the data first to actually pull it into R.

# Then with collecting
df2 <- rentaltbl %>% 
  group_by(customer_id) %>% 
  summarize(count=n()) %>% 
  collect()
df2
## # A tibble: 599 x 2
##    customer_id count     
##          <int> <integr64>
##  1          87 30        
##  2         184 23        
##  3         477 22        
##  4         273 35        
##  5         550 32        
##  6         394 22        
##  7          51 33        
##  8         272 20        
##  9          70 18        
## 10         190 27        
## # … with 589 more rows
df2[1,]
## # A tibble: 1 x 2
##   customer_id count     
##         <int> <integr64>
## 1          87 30

You can also use dplyr’s commands to join:

custtbl <- tbl(con, "customer")
addrtbl <- tbl(con, "address")
custtbl %>% 
  inner_join(addrtbl, by="address_id") %>%
  filter(postal_code == '52137') %>%
  select(first_name, last_name, postal_code)
## # Source:   lazy query [?? x 3]
## # Database: postgres [christina@localhost:5432/dvdrental]
##   first_name last_name postal_code
##   <chr>      <chr>     <chr>      
## 1 James      Gannon    52137      
## 2 Freddie    Duggan    52137

You could create a table with copy_to (if you have the correct permissions)

mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
copy_to(con, mytbl, "mynewtable")

By default, it creates a temporary table. But this is a setting you can change, and you can also specify what columns to index on the table.

Disconnect like we normally do

dbDisconnect(con)

R Markdown

R Markdown lets you execute SQL queries directly. You first set up a DBI connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code:

```{r}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
```{sql, connection=con}
select * from actor where actor_id > 75;
```
```{r}
dbDisconnect(con)
```

Here is the above, actually executed in RMarkdown:

library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
select * from actor where actor_id > 75;
Displaying records 1 - 10
actor_id first_name last_name last_update
76 Angelina Astaire 2013-05-26 14:47:57
77 Cary Mcconaughey 2013-05-26 14:47:57
78 Groucho Sinatra 2013-05-26 14:47:57
79 Mae Hoffman 2013-05-26 14:47:57
80 Ralph Cruz 2013-05-26 14:47:57
81 Scarlett Damon 2013-05-26 14:47:57
82 Woody Jolie 2013-05-26 14:47:57
83 Ben Willis 2013-05-26 14:47:57
84 James Pitt 2013-05-26 14:47:57
85 Minnie Zellweger 2013-05-26 14:47:57
dbDisconnect(con)

For more details, see knitr Language Engines: SQL.

PL/R

Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries. This is done with PL/R. Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server. Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems. You can use to to generate reports, compute statistical methods, and even create plots.

We aren’t covering PL/R (or even writing SQL functions more generally), but it’s good to know this functionality exists if you’re ever working with a large production system.