How to write tidy SQL queries in R
How to write tidy SQL queries in R
Most of us have to interact with databases nowadays, and SQL is by far the most common language used. However, working with SQL in R can be messy. If your queries are complex, you have to code them up as text strings which can be error prone, and suffer from formatting challenges. Also, when you want to build your SQL queries to have variables inside them, then you are forced to do substitution or pasting, which is a little bit hacky.
Ideally you want to be able to work with your database using tidy principles, taking advantage of the wonders of the tidyverse, and preferably without downloading the entire database into your session first. This is where the magic of dbplyr
comes in.
dbplyr
acts as a SQL translator and allows you to play with databases using tidyverse
Basic principles of working in dbplyr
To work in dbplyr
, you set up your database connection in the same way as you normally would in your R session, let’s call it myconn
. You will set up database objects in your R session using dbplyr::in_schema()
. This takes two arguments: first, the schema you want to access in your database connection, and second, that table you are interested in within that schema. Here’s an example of how to set one up:
catdata <- dplyr::tbl( myconn, dbplyr::in_schema("ANIMAL_ANALYSTS", "CAT_TABLE"))
Now catdata
is a database object. The command above connects to the database and downloads a bare minimum of information on fields, data types, etc. — enough to allow manipulation of the object without physical download of the data.
You can now manipulate catdata
in the same way as you would manipulate other tables in R. For example:
weight_by_age <- catdata %>% dplyr::group_by(AGE) %>% dplyr::summarise(AVG_WT = mean(WT, na.rm = TRUE))
All these manipulations occur without physical download of the data, by translating your code into SQL in the background. Since data download is often the most time consuming step, this allows you to think about how much work you want to get done on the server before you pull the data.
When you are ready to pull the data, you just use dplyr::collect()
. This will send the background compiled SQL query to the database and execute it. For example:
weight_by_age %>% dplyr::rename(`Age of Cat` = AGE, `Average Weight` = AVG_WT) %>% dplyr::collect()
More complex SQL operations in dbplyr
dbplyr
is highly flexible and I have yet to find a SQL query that I could not rewrite tidy using dbplyr
.
Joins work by using dplyr
‘s join functions on database objects, for example:
fullcatdata <- dplyr::left_join( catregistrationdetails, catdata, by = "SERIAL_NO") %>% dplyr::left_join( cathealthrecord, by = "SERIAL_NO")
New columns can be added to the data using dplyr::mutate()
, and can even be used for more complex joins. For example, if your cat serial number has a “CAT-” at the beginning in one table but not another:
fullcatdata <- catregistrationdetails %>% dplyr::mutate(SERIAL_NO = paste0("CAT-", SERIAL_NO)) %>% dplyr::left_join(catdata, by = "SERIAL_NO")
dbplyr
cleverly translates R functions into SQL equivalents. You can see what it does using the dbplyr::translate_sql()
function. For example:
dbplyr::translate_sql(substr(NAME, -3, -1))<SQL> substr("NAME", -3, 3)
I find dbplyr
also allows me to code more easily in reactive environments. If you were to build a Shiny app that calculates average weight of cats according to an input input$age
:
weight <- reactive({ catdata %>% dplyr::filter(AGE == input$age) %>% dplyr::select(WT) %>% mean(na.rm = TRUE) %>% dplyr::collect()})
These are just some of the many ways that dbplyr
helps you work more tidy in SQL. I highly recommend it.
For more information on dbplyr
go here.
Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.