Wednesday, August 8, 2012

Manipulating Data Frames Using sqldf - A Brief Overview

By Josh Mills

Introduction

For those who are learning R and who may be well-versed in SQL, the sqldf package provides a mechanism to manipulate R data frames using SQL. Even for experienced R programmers, sqldf can be a useful tool for data manipulation. This site provides a useful introduction to SQL. [SQLCourse.com 2012]

The following packages will be used in this document:
  • sqldf - A package that allows manipulation of R data frames with SQL (as well as connectivity with a limited set of database engines). [Grothendieck 2012]
  • plyr - A useful package for aggregating and summarizing data over multiple subgroups, with more advanced applications. [Wickham 2011]
Load these packages into memory.


Data Sets Used

Highway Data (crashes.csv and roads.csv - Click to Download)

These are fictional data sets containing crash and highway data with the following variables:
Crash data (crashes.csv)
  • Year - The year in which the observation was taken
  • Road - The name of the road being studied
  • N_Crashes - The number of crashes on the road during that year
  • Volume - Average annual daily traffic (AADT) volumes on the road during that year. See this site for a formal definition. [North Carolina Department of Transportation 2012]
Road data (roads.csv)
  • Road - The name of the road being studied
  • District - The administrative district responsible for the road's upkeep and maintenance
  • Length - Length of the road in miles


Joins and Merges with sqldf

Read in and explore the data to get a feel for the data's structure.

setwd("W:/Data Mining and Modeling/Applied Analytics - R Discussion/Related Files")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
head(crashes)
##   Year          Road N_Crashes Volume
## 1 1991 Interstate 65        25  40000
## 2 1992 Interstate 65        37  41000
## 3 1993 Interstate 65        45  45000
## 4 1994 Interstate 65        46  45600
## 5 1995 Interstate 65        46  49000
## 6 1996 Interstate 65        59  51000
tail(crashes)
##     Year           Road N_Crashes Volume
## 105 2007 Interstate 275        32  21900
## 106 2008 Interstate 275        21  21850
## 107 2009 Interstate 275        25  22100
## 108 2010 Interstate 275        24  21500
## 109 2011 Interstate 275        23  20300
## 110 2012 Interstate 275        22  21200
print(roads)
##            Road       District Length
## 1 Interstate 65     Greenfield    262
## 2 Interstate 70      Vincennes    156
## 3         US-36 Crawfordsville    139
## 4         US-40     Greenfield    150
## 5         US-52 Crawfordsville    172

Performing joins is one of the most common operations in SQL. Left joins return all rows in the “left-hand” table - the crash data set in this case, whereas right joins return all rows in the “right-hand” table - the road data set in this case. Inner joins return only rows with matching data for the common variable, and full outer joins return all rows in all data sets, even if there are rows without matches. Currently, sqldf does not support right joins or full outer joins.

It is useful to format SQL statements with spaces and line breaks for readability and to store the query in a character string. The following statement will perform a left join of the crash data set to the road data set based on the common variable Road.

join_string <- "select
                crashes.*
              , roads.District
              , roads.Length
              from crashes
                left join roads
                on crashes.Road = roads.Road"

A new data frame, crashes_join_roads, will be created using the sqldf statement. The sqldf statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors argument will force categorical variables (like Road and District) to have the class character rather than factor.

crashes_join_roads <- sqldf(join_string,stringsAsFactors = FALSE)
## Loading required package: tcltk
head(crashes_join_roads)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65        25  40000 Greenfield    262
## 2 1992 Interstate 65        37  41000 Greenfield    262
## 3 1993 Interstate 65        45  45000 Greenfield    262
## 4 1994 Interstate 65        46  45600 Greenfield    262
## 5 1995 Interstate 65        46  49000 Greenfield    262
## 6 1996 Interstate 65        59  51000 Greenfield    262
tail(crashes_join_roads)
##     Year           Road N_Crashes Volume District Length
## 105 2007 Interstate 275        32  21900     <NA>     NA
## 106 2008 Interstate 275        21  21850     <NA>     NA
## 107 2009 Interstate 275        25  22100     <NA>     NA
## 108 2010 Interstate 275        24  21500     <NA>     NA
## 109 2011 Interstate 275        23  20300     <NA>     NA
## 110 2012 Interstate 275        22  21200     <NA>     NA

By using an inner join, only matching rows will be kept.

join_string2 <- "select
                crashes.*
              , roads.District
              , roads.Length
              from crashes
                inner join roads
                on crashes.Road = roads.Road"
crashes_join_roads2 <- sqldf(join_string2, stringsAsFactors = FALSE)
head(crashes_join_roads2)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65        25  40000 Greenfield    262
## 2 1992 Interstate 65        37  41000 Greenfield    262
## 3 1993 Interstate 65        45  45000 Greenfield    262
## 4 1994 Interstate 65        46  45600 Greenfield    262
## 5 1995 Interstate 65        46  49000 Greenfield    262
## 6 1996 Interstate 65        59  51000 Greenfield    262
tail(crashes_join_roads2)
##    Year  Road N_Crashes Volume       District Length
## 83 2007 US-36        49  24000 Crawfordsville    139
## 84 2008 US-36        52  24500 Crawfordsville    139
## 85 2009 US-36        55  24700 Crawfordsville    139
## 86 2010 US-36        35  23000 Crawfordsville    139
## 87 2011 US-36        33  21000 Crawfordsville    139
## 88 2012 US-36        31  20500 Crawfordsville    139

The merge statement in base R can perform the equivalent of inner and left joins, as well as right and full outer joins, which are unavailable in sqldf.

crashes_merge_roads <- merge(crashes, roads, by = c("Road"))
head(crashes_merge_roads)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000        95  74000 Greenfield    262
## 2 Interstate 65 1997        76  52000 Greenfield    262
## 3 Interstate 65 1998        90  58000 Greenfield    262
## 4 Interstate 65 1999        95  65000 Greenfield    262
## 5 Interstate 65 1991        25  40000 Greenfield    262
## 6 Interstate 65 1992        37  41000 Greenfield    262
tail(crashes_merge_roads)
##     Road Year N_Crashes Volume   District Length
## 83 US-40 2003        94  55200 Greenfield    150
## 84 US-40 2004        25  55300 Greenfield    150
## 85 US-40 2009        67  65000 Greenfield    150
## 86 US-40 2010       102  67000 Greenfield    150
## 87 US-40 2011        87  67500 Greenfield    150
## 88 US-40 2012        32  67500 Greenfield    150
crashes_merge_roads2 <- merge(crashes, roads, by = c("Road"), all.x = TRUE)
head(crashes_merge_roads2)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994        21  21200     <NA>     NA
## 2 Interstate 275 1995        28  23200     <NA>     NA
## 3 Interstate 275 1996        22  20000     <NA>     NA
## 4 Interstate 275 1997        27  18000     <NA>     NA
## 5 Interstate 275 1998        21  19500     <NA>     NA
## 6 Interstate 275 1999        22  21000     <NA>     NA
tail(crashes_merge_roads2)
##      Road Year N_Crashes Volume   District Length
## 105 US-40 2003        94  55200 Greenfield    150
## 106 US-40 2004        25  55300 Greenfield    150
## 107 US-40 2009        67  65000 Greenfield    150
## 108 US-40 2010       102  67000 Greenfield    150
## 109 US-40 2011        87  67500 Greenfield    150
## 110 US-40 2012        32  67500 Greenfield    150
crashes_merge_roads3 <- merge(crashes, roads, by = c("Road"), all.y = TRUE)
head(crashes_merge_roads3)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000        95  74000 Greenfield    262
## 2 Interstate 65 1997        76  52000 Greenfield    262
## 3 Interstate 65 1998        90  58000 Greenfield    262
## 4 Interstate 65 1999        95  65000 Greenfield    262
## 5 Interstate 65 1991        25  40000 Greenfield    262
## 6 Interstate 65 1992        37  41000 Greenfield    262
tail(crashes_merge_roads3)
##     Road Year N_Crashes Volume       District Length
## 84 US-40 2004        25  55300     Greenfield    150
## 85 US-40 2009        67  65000     Greenfield    150
## 86 US-40 2010       102  67000     Greenfield    150
## 87 US-40 2011        87  67500     Greenfield    150
## 88 US-40 2012        32  67500     Greenfield    150
## 89 US-52   NA        NA     NA Crawfordsville    172
crashes_merge_roads4 <- merge(crashes, roads, by = c("Road"), all.x = TRUE, 
    all.y = TRUE)
head(crashes_merge_roads4)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994        21  21200     <NA>     NA
## 2 Interstate 275 1995        28  23200     <NA>     NA
## 3 Interstate 275 1996        22  20000     <NA>     NA
## 4 Interstate 275 1997        27  18000     <NA>     NA
## 5 Interstate 275 1998        21  19500     <NA>     NA
## 6 Interstate 275 1999        22  21000     <NA>     NA
tail(crashes_merge_roads4)
##      Road Year N_Crashes Volume       District Length
## 106 US-40 2004        25  55300     Greenfield    150
## 107 US-40 2009        67  65000     Greenfield    150
## 108 US-40 2010       102  67000     Greenfield    150
## 109 US-40 2011        87  67500     Greenfield    150
## 110 US-40 2012        32  67500     Greenfield    150
## 111 US-52   NA        NA     NA Crawfordsville    172

Note how the order of the rows in the data frames were rearranged when using the merge statement.

The sqldf statement can process SQLite commands, which include most of the standard syntax used in ANSI SQL, except for some of the join operations outlined previously mentioned. [SQLite.org 2012]

Modifying the inner join query to include a where is the equivalent of combining merge and subset statements.

join_string2 <- "select
                crashes.*
              , roads.District
              , roads.Length
                from crashes
                    inner join roads
                    on crashes.Road = roads.Road
                where crashes.Road = 'US-40'"                
crashes_join_roads4 <- sqldf(join_string2,stringsAsFactors = FALSE)
head(crashes_join_roads4)
##   Year  Road N_Crashes Volume   District Length
## 1 1991 US-40        46  21000 Greenfield    150
## 2 1992 US-40       101  21500 Greenfield    150
## 3 1993 US-40        76  23000 Greenfield    150
## 4 1994 US-40        72  21000 Greenfield    150
## 5 1995 US-40        75  24000 Greenfield    150
## 6 1996 US-40       136  23500 Greenfield    150
tail(crashes_join_roads4)
##    Year  Road N_Crashes Volume   District Length
## 17 2007 US-40        45  59500 Greenfield    150
## 18 2008 US-40        23  61000 Greenfield    150
## 19 2009 US-40        67  65000 Greenfield    150
## 20 2010 US-40       102  67000 Greenfield    150
## 21 2011 US-40        87  67500 Greenfield    150
## 22 2012 US-40        32  67500 Greenfield    150


Aggregation Functions and Limitations of sqldf

Aggregate functions available using SQLite can be used through the use of a group by clause.

group_string <- "select
                  crashes.Road
                 , avg(crashes.N_Crashes) as Mean_Crashes
                 from crashes
                    left join roads
                    on crashes.Road = roads.Road
                 group by 1"
sqldf(group_string)
##             Road Mean_Crashes
## 1 Interstate 275        24.95
## 2  Interstate 65       107.82
## 3  Interstate 70        65.18
## 4          US-36        48.00
## 5          US-40        68.68

The available aggregation functions within SQLite or ANSI SQL are limited, however. While sqldf can make certain data manipulation operations easier, more advanced data manipulation tasks and calculations must be performed in R, such as using Hadley Wickham's plyr package.

ddply(crashes_merge_roads,
      c("Road"),
      function(X) data.frame(Mean_Crashes = mean(X$N_Crashes),
                             Q1_Crashes = quantile(X$N_Crashes, 0.25),
                             Q3_Crashes = quantile(X$N_Crashes, 0.75),
                             Median_Crashes = quantile(X$N_Crashes, 0.50))
      )
##            Road Mean_Crashes Q1_Crashes Q3_Crashes Median_Crashes
## 1 Interstate 65       107.82      63.25     140.25          108.5
## 2 Interstate 70        65.18      52.00      75.50           66.5
## 3         US-36        48.00      42.00      57.25           47.0
## 4         US-40        68.68      45.25      90.75           70.0

In short, the sqldf package can make it easy for SQL users to begin making the transition to R. The package provides a convenient mechanism for data manipulation in R using SQL. While there are limitations to the use of SQL within R, the added convenience provides a useful alternative to using standard R functions. In addition, SQL statements used can easily be modified to function in a large-scale database environment such as Teradata or Netezza.

Download this document in R Markdown format.

References
  • SQLCourse.com (2012). SQLCourse.com®: Interactive Online SQL Training. Link
  • G. Grothendieck (2012). sqldf: Perform SQL Selects on R Data Frames. R package version 0.4-6.4. Link
  • H. Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. Link
  • North Carolina Department of Transportation (2012). Training Material for Traffic Engineering Accident Analysis System (TEAAS). Link
  • SQLite.org (2012). SQL As Understood By SQLite. Link
  • R Development Core Team (2012). R: A language and environment for statistical  computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN  3-900051-07-0. Link
This document was generated primarily using R Markdown and knitr.

8 comments:

  1. I LOVE how you've done it in knitr!

    ReplyDelete
  2. The query shown that does not use sqldf could have. sqldf automatically pulls in the RSQLite.extfuns library when using it with the default sqlite database so all that library's functions including lower_quantile, upper_quantile and median are available out of the box. See http://sqldf.googlecode.com/#Example_15._Use_of_RSQLite.extfuns_library_functions

    ReplyDelete
  3. Any chance you can share the datasets for reproducibility's sake?

    ReplyDelete
    Replies
    1. I've uploaded the files so they can be downloaded - I embedded links within the post. Thanks!

      Delete
  4. Hello

    How would you use it jointly with the xts or zoo packages?
    For example I want to apply the aggregate zoo command to two 10GB time series (I want to use zoo because they have different timing or are non-homogeneous and I cannot just glue them together).

    The problem is that I don't know how to use sqldf (or hadoop or ff...) as input for zoo.
    Regards

    ReplyDelete
    Replies
    1. Hello Skan and thanks for the question.

      I have crafted a simple example that creates a zoo object based on the results returned by a call to sqldf (arbitrarily filtering a generated data.frame). I then aggregate over the zoo object using a coarser grained representation of the time variable (similar to the example provided in the zoo documentation).

      #load required libraries
      library(sqldf)
      library(zoo)

      #create some data
      data=data.frame(val1=rnorm(12, mean=20),t=as.Date("2003-02-01") + c(1, 3, 7, 9, 11, 14, 17, 20, 22, 25, 28, 30) - 1)

      #add a column to filter data with
      data$val2=c(rep("a",6),rep("b",6))

      #add a column with a coarser grained date for use with aggregate
      data$year_qtr=as.yearqtr(data$t)

      #arbitrarily query the data.frame using sqldf
      filter.data=sqldf("select val1, val2, t, year_qtr from data where val2='b'")

      #create a zoo object
      the.zoo=zoo(filter.data$val1,filter.data$t)

      #aggregate using zoo
      aggregate(the.zoo,filter.data$year_qtr,mean)

      A value similar to this:

      2003 Q1
      20.42594 should be returned

      A call to sqldf returns a data.frame so you simply need to make sure you pass in the correct vector arguments to zoo in order to make it work. Hope that helps and good luck with those large files.

      Delete
  5. I use sqldf a lot and I am amazed at how simple and fast it operates on data frames. But I was very much surprised when I found that I couldn't use exponentiation.

    ReplyDelete
    Replies
    1. Thanks for the comment Raffael. Exponentiation is supported via the POWER function.

      Try this example using the built-in dataset warpbreaks:

      data(warpbreaks)
      sqldf("select power(breaks,2) as 'breaks^2' from warpbreaks limit 6")

      sqldf will return a new column named breaks^2 containing the squared value of breaks.

      The functions supported by SQLite are here

      Delete