How to write tidy SQL queries in R

阿新 • • 發佈：2018-12-28

How to write tidy SQL queries in R

Most of us have to interact with databases nowadays, and SQL is by far the most common language used. However, working with SQL in R can be messy. If your queries are complex, you have to code them up as text strings which can be error prone, and suffer from formatting challenges. Also, when you want to build your SQL queries to have variables inside them, then you are forced to do substitution or pasting, which is a little bit hacky.

Ideally you want to be able to work with your database using tidy principles, taking advantage of the wonders of the tidyverse, and preferably without downloading the entire database into your session first. This is where the magic of dbplyr comes in.

dbplyr acts as a SQL translator and allows you to play with databases using tidyverse

. So now you can pipe to your hearts content. If you haven’t been using this yet, I would get onto it now. Writing tidy database queries has many advantages. You can understand your work more easily when you come back to it after a while, you can comment more clearly, and it also forces you to think about the most efficient structure for your queries.

Basic principles of working in dbplyr

To work in dbplyr, you set up your database connection in the same way as you normally would in your R session, let’s call it myconn. You will set up database objects in your R session using dbplyr::in_schema(). This takes two arguments: first, the schema you want to access in your database connection, and second, that table you are interested in within that schema. Here’s an example of how to set one up:

catdata <- dplyr::tbl(  myconn,  dbplyr::in_schema("ANIMAL_ANALYSTS", "CAT_TABLE"))

Now catdata is a database object. The command above connects to the database and downloads a bare minimum of information on fields, data types, etc. — enough to allow manipulation of the object without physical download of the data.

You can now manipulate catdata in the same way as you would manipulate other tables in R. For example:

weight_by_age <- catdata %>%  dplyr::group_by(AGE) %>%  dplyr::summarise(AVG_WT = mean(WT, na.rm = TRUE))

All these manipulations occur without physical download of the data, by translating your code into SQL in the background. Since data download is often the most time consuming step, this allows you to think about how much work you want to get done on the server before you pull the data.

When you are ready to pull the data, you just use dplyr::collect(). This will send the background compiled SQL query to the database and execute it. For example:

weight_by_age %>%  dplyr::rename(`Age of Cat` = AGE,                `Average Weight` = AVG_WT) %>%  dplyr::collect()

More complex SQL operations in dbplyr

dbplyr is highly flexible and I have yet to find a SQL query that I could not rewrite tidy using dbplyr.

Joins work by using dplyr ‘s join functions on database objects, for example:

fullcatdata <- dplyr::left_join(  catregistrationdetails,   catdata,   by = "SERIAL_NO") %>%  dplyr::left_join(    cathealthrecord,     by = "SERIAL_NO")

New columns can be added to the data using dplyr::mutate(), and can even be used for more complex joins. For example, if your cat serial number has a “CAT-” at the beginning in one table but not another:

 fullcatdata <- catregistrationdetails %>%  dplyr::mutate(SERIAL_NO = paste0("CAT-", SERIAL_NO)) %>%  dplyr::left_join(catdata, by = "SERIAL_NO")

dbplyr cleverly translates R functions into SQL equivalents. You can see what it does using the dbplyr::translate_sql() function. For example:

dbplyr::translate_sql(substr(NAME, -3, -1))<SQL> substr("NAME", -3, 3)

I find dbplyr also allows me to code more easily in reactive environments. If you were to build a Shiny app that calculates average weight of cats according to an input input$age:

weight <- reactive({  catdata %>%  dplyr::filter(AGE == input$age) %>%  dplyr::select(WT) %>%  mean(na.rm = TRUE) %>%  dplyr::collect()})

These are just some of the many ways that dbplyr helps you work more tidy in SQL. I highly recommend it.

For more information on dbplyr go here.

Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.

How to write tidy SQL queries in R

How to write tidy SQL queries in R

Basic principles of working in dbplyr

More complex SQL operations in dbplyr

How to write tidy SQL queries in R

How to write a cell address encoder in ruby.

How to create a sequential model in Keras for R

How to Correctly Use SQL's like in Android

[Python] How to unpack and pack collection in Python?

How To View the HTML Source in Google Chrome

How to Find Processlist Thread id in gdb !!!!!GDB 使用

How to write a robust system level service - some key learning - 如何寫好一個健壯的系統級服務

How to write threats to validity?

How to Write Go Code

How to Install The Latest Eclipse in Ubuntu 16.04, 15.10?

OpenPano: How to write a Panorama Stitcher

How To Change Log Rate Limiting In Linux

How to setup oAuth 1.0 in NetSuite RESTlet API 如何在NetSuite中設定RESTlet API的oAuth認證

[轉]How to display the data read in DataReceived event handler of serialport

How to write a comparison and contrast essay?

How to write educational schema.

CMake:How To Write Platform Checks

Fullstack React: GraphQL is fantastic. Here's how to write aGraphQL server

How to write your own Virtual DOM

How to write tidy SQL queries in R

How to write tidy SQL queries in R

Basic principles of working in dbplyr

More complex SQL operations in dbplyr

相關推薦