R is generally better suited to selecting data from databases than for creating database tables or entering data into a database. But there are functions to do all operations.
The DBI
package has the core functionality of connecting R to database servers. There are then packages that implement the core functionality of DBI
for each specific implementation of SQL. A package for PostgreSQL is RPostgres
.
if(!'RPostgres' %in% installed.packages()){
install.packages("RPostgres")
}
library(RPostgres)
We connect with a function call like the following.
Note: this code was generated on my local machine connected to a local copy of the database. Your connection details will be different. Note I also have permissions to modify this database.
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
We will need a connection like this for any of the methods of connecting below. You can have multiple connection objects to different databases – just call the variables something different.
Note that the above example doesn’t have a username or password because the database is local and doesn’t require one. If you’re using a remote database, you don’t want to hard core your credentials into your code. There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information. But RStudio also gives you the option to pop up a box to enter the information:
con2 <- dbConnect(RPostgres::Postgres(),
host="localhost",
dbname="dvdrental",
user=rstudioapi::askForPassword("Database username"),
password=rstudioapi::askForPassword("Database password"))
We can use the basic functions in the DBI library:
Note that the following db-
prefixed functions are exported from the DBI and RPostgreSQL namespaces.
dbListTables(con)
## [1] "film_actor" "address"
## [3] "city" "actor"
## [5] "film_category" "inventory"
## [7] "actor_info" "category"
## [9] "country" "customer"
## [11] "customer_list" "film_list"
## [13] "language" "rental"
## [15] "nicer_but_slower_film_list" "staff"
## [17] "sales_by_film_category" "store"
## [19] "payment" "sales_by_store"
## [21] "staff_list" "film"
## [23] "measure"
dbListFields(con, "actor")
## [1] "actor_id" "first_name" "last_name" "last_update"
actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50")
head(actor_subset)
## actor_id first_name last_name last_update
## 1 51 Gary Phoenix 2013-05-26 14:47:57
## 2 52 Carmen Hunt 2013-05-26 14:47:57
## 3 53 Mena Temple 2013-05-26 14:47:57
## 4 54 Penelope Pinkett 2013-05-26 14:47:57
## 5 55 Fay Kilmer 2013-05-26 14:47:57
## 6 56 Dan Harris 2013-05-26 14:47:57
Note that we don’t need a ;
at the end of the query.
If we want an entire table, there’s a function for that:
actor <- dbReadTable(con, "actor")
head(actor)
## actor_id first_name last_name last_update
## 1 1 Penelope Guiness 2013-05-26 14:47:57
## 2 2 Nick Wahlberg 2013-05-26 14:47:57
## 3 3 Ed Chase 2013-05-26 14:47:57
## 4 6 Bette Nicholson 2013-05-26 14:47:57
## 5 7 Grace Mostel 2013-05-26 14:47:57
## 6 8 Matthew Johansson 2013-05-26 14:47:57
If you want part of your query to be determined by a variable – especially if it’s a variable supplied or defined as input or by a user (not you) – you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query:
# YES
myquery <- dbSendQuery(con, "select * from actor where actor_id = $1")
dbBind(myquery, list(4))
dbFetch(myquery)
## actor_id first_name last_name last_update
## 1 4 Jenn Davis 2019-08-12 19:24:02
The $1
stands in for a value you’ll substitute in. For multiple, you can use $2
, etc. (The $1
notation is for PostgreSQL – other types of databases use ?
or other symbols.)
When you’re done with the results from a prepared query, clear the result:
dbClearResult(myquery)
# NO! - at least not if the variable can be manipulated by a user
dbSendQuery(con, paste0("select * from actor where actor_id=", myvar))
For more, see http://db.rstudio.com/best-practices/run-queries-safely/.
Parameterizing statements also lets you reuse them with different values.
You may also need to use dbClearResult()
after other calls to dbSendQuery()
that return a result (select statements, table creations, inserts, updates) – but not to dbGetQuery()
. Otherwise you may get some warning messages like:
Warning message: In result_create(conn@ptr, statement) : Closing open result set, cancelling previous query
Which are ok, but could get annoying.
If you’re not a superuser on the dvdrental
database, just try connecting to a database you can modify. Then the basic function is dbSendQuery
for any command you want to execute where you aren’t retrieving results.
Note that by default, statements take effect immediately - they are not in a transaction that you need to commit. To use transactions, see below.
res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4")
print(res) # contains info on result of update
dbClearResult(res) # prevent warning messages
To create a table, you can give it a data frame
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
dbWriteTable(con, "mynewtable", mytbl)
or you could specify the table with SQL, and execute with dbSendQuery
but this can get cumbersome.
To remove a table
dbRemoveTable(con, "mynewtable")
There are also methods for managing transactions if you need: dbBegin
, dbRollback
, dbCommit
. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. UPDATE
, CREATE
, DROP
, DELETE
, etc.) execute correctly before they’re made permanent (i.e. “committed”).
dbBegin(con)
dbWriteTable(con, "mynewtable", mytbl)
dbRollback(con)
dbGetQuery(con, "SELECT * FROM mynewtable")
The above will produce error:
Error in result_create(conn@ptr, statement) :
Failed to prepare query: ERROR: relation "mynewtable" does not exist
LINE 1: SELECT * FROM mynewtable
because the transaction was rolled back, not committed.
Connections will get closed when you quit R, but it’s good practice to explicitly close them.
dbDisconnect(con)
dplyr
For more complete info, see the RStudio databases site.
needToInstall <- c("tidyverse")
needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())]
if(length(needToInstall) > 0){
sapply(needToInstall, install.packages)
}
library(tidyverse)
First, connect like normal
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
Get a reference to a table:
actortbl <- tbl(con, "actor")
If we look at this object, it doesn’t have data in it:
str(actortbl)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ bigint : chr "integer64"
## .. .. ..@ typnames:'data.frame': 446 obs. of 2 variables:
## .. .. .. ..$ oid : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...
## .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "actor"
## ..$ vars: chr [1:4] "actor_id" "first_name" "last_name" "last_update"
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
It just has connection information. dplyr
will try to perform operations within the database where it can, instead of pulling all of the data into R.
Yet you can print the object and see observations:
actortbl
## # Source: table<actor> [?? x 4]
## # Database: postgres [christina@localhost:5432/dvdrental]
## actor_id first_name last_name last_update
## <int> <chr> <chr> <dttm>
## 1 1 Penelope Guiness 2013-05-26 14:47:57
## 2 2 Nick Wahlberg 2013-05-26 14:47:57
## 3 3 Ed Chase 2013-05-26 14:47:57
## 4 6 Bette Nicholson 2013-05-26 14:47:57
## 5 7 Grace Mostel 2013-05-26 14:47:57
## 6 8 Matthew Johansson 2013-05-26 14:47:57
## 7 9 Joe Swank 2013-05-26 14:47:57
## 8 10 Christian Gable 2013-05-26 14:47:57
## 9 11 Zero Cage 2013-05-26 14:47:57
## 10 12 Karl Berry 2013-05-26 14:47:57
## # … with more rows
It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) – output looks a litle different in the console and in RMarkdown files like this.
You can use dplyr
verbs to work with the table objects from the database, as if they were data frames (or tibbles).
actortbl %>%
select(actor_id, first_name, last_name) %>%
filter(actor_id > 150)
## # Source: lazy query [?? x 3]
## # Database: postgres [christina@localhost:5432/dvdrental]
## actor_id first_name last_name
## <int> <chr> <chr>
## 1 151 Geoffrey Heston
## 2 152 Ben Harris
## 3 153 Minnie Kilmer
## 4 154 Meryl Gibson
## 5 155 Ian Tandy
## 6 156 Fay Wood
## 7 157 Greta Malden
## 8 158 Vivien Basinger
## 9 159 Laura Brody
## 10 160 Chris Depp
## # … with more rows
The above generates and executes the SQL needed to get the result. It turns filter
into a select statement with the appropriate where clause.
rentaltbl <- tbl(con, "rental")
rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n())
## # Source: lazy query [?? x 2]
## # Database: postgres [christina@localhost:5432/dvdrental]
## customer_id count
## <int> <integr64>
## 1 87 30
## 2 184 23
## 3 477 22
## 4 273 35
## 5 550 32
## 6 394 22
## 7 51 33
## 8 272 20
## 9 70 18
## 10 190 27
## # … with more rows
What does the above correspond to as a SQL query? select customer_id, count(*) from rental group by customer_id;
rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n()) %>%
show_query()
## <SQL>
## SELECT "customer_id", COUNT(*) AS "count"
## FROM "rental"
## GROUP BY "customer_id"
You can use collect
to pull down all of the data (tell dplyr
to stop being lazy).
# First, without collecting
df1 <- rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n())
df1
## # Source: lazy query [?? x 2]
## # Database: postgres [christina@localhost:5432/dvdrental]
## customer_id count
## <int> <integr64>
## 1 87 30
## 2 184 23
## 3 477 22
## 4 273 35
## 5 550 32
## 6 394 22
## 7 51 33
## 8 272 20
## 9 70 18
## 10 190 27
## # … with more rows
Looks OK, except:
df1[1,]
Gives you:
Error in df1[1, ] : incorrect number of dimensions
It’s the wrong dimensions because df1
isn’t actually a data.frame:
str(df1)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ bigint : chr "integer64"
## .. .. ..@ typnames:'data.frame': 446 obs. of 2 variables:
## .. .. .. ..$ oid : int [1:446] 16 17 18 19 20 21 22 23 24 25 ...
## .. .. .. ..$ typname: chr [1:446] "bool" "bytea" "char" "name" ...
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 4
## ..$ name: chr "summarise"
## ..$ x :List of 4
## .. ..$ name: chr "group_by"
## .. ..$ x :List of 2
## .. .. ..$ x : 'ident' chr "rental"
## .. .. ..$ vars: chr [1:7] "rental_id" "rental_date" "inventory_id" "customer_id" ...
## .. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## .. ..$ dots:List of 1
## .. .. ..$ customer_id: symbol customer_id
## .. ..$ args:List of 1
## .. .. ..$ add: logi FALSE
## .. ..- attr(*, "class")= chr [1:3] "op_group_by" "op_single" "op"
## ..$ dots:List of 1
## .. ..$ count: language ~n()
## .. .. ..- attr(*, ".Environment")=<environment: 0x7fa71e16dce8>
## ..$ args: list()
## ..- attr(*, "class")= chr [1:3] "op_summarise" "op_single" "op"
## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
It is telling us we need to collect the data first to actually pull it into R.
# Then with collecting
df2 <- rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n()) %>%
collect()
df2
## # A tibble: 599 x 2
## customer_id count
## <int> <integr64>
## 1 87 30
## 2 184 23
## 3 477 22
## 4 273 35
## 5 550 32
## 6 394 22
## 7 51 33
## 8 272 20
## 9 70 18
## 10 190 27
## # … with 589 more rows
df2[1,]
## # A tibble: 1 x 2
## customer_id count
## <int> <integr64>
## 1 87 30
You can also use dplyr
’s commands to join:
custtbl <- tbl(con, "customer")
addrtbl <- tbl(con, "address")
custtbl %>%
inner_join(addrtbl, by="address_id") %>%
filter(postal_code == '52137') %>%
select(first_name, last_name, postal_code)
## # Source: lazy query [?? x 3]
## # Database: postgres [christina@localhost:5432/dvdrental]
## first_name last_name postal_code
## <chr> <chr> <chr>
## 1 James Gannon 52137
## 2 Freddie Duggan 52137
You could create a table with copy_to
(if you have the correct permissions)
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
copy_to(con, mytbl, "mynewtable")
By default, it creates a temporary table. But this is a setting you can change, and you can also specify what columns to index on the table.
Disconnect like we normally do
dbDisconnect(con)
R Markdown lets you execute SQL queries directly. You first set up a DBI
connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code:
```{r}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
```{sql, connection=con}
select * from actor where actor_id > 75;
```
```{r}
dbDisconnect(con)
```
Here is the above, actually executed in RMarkdown:
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
select * from actor where actor_id > 75;
actor_id | first_name | last_name | last_update |
---|---|---|---|
76 | Angelina | Astaire | 2013-05-26 14:47:57 |
77 | Cary | Mcconaughey | 2013-05-26 14:47:57 |
78 | Groucho | Sinatra | 2013-05-26 14:47:57 |
79 | Mae | Hoffman | 2013-05-26 14:47:57 |
80 | Ralph | Cruz | 2013-05-26 14:47:57 |
81 | Scarlett | Damon | 2013-05-26 14:47:57 |
82 | Woody | Jolie | 2013-05-26 14:47:57 |
83 | Ben | Willis | 2013-05-26 14:47:57 |
84 | James | Pitt | 2013-05-26 14:47:57 |
85 | Minnie | Zellweger | 2013-05-26 14:47:57 |
dbDisconnect(con)
For more details, see knitr Language Engines: SQL.
Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries. This is done with PL/R. Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server. Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems. You can use to to generate reports, compute statistical methods, and even create plots.
We aren’t covering PL/R (or even writing SQL functions more generally), but it’s good to know this functionality exists if you’re ever working with a large production system.