Introduction
For those who are learning R and who may be well-versed in SQL, thesqldf
package provides a mechanism to manipulate R data frames using SQL. Even for experienced R programmers, sqldf
can be a useful tool for data manipulation. This site provides a useful introduction to SQL. [SQLCourse.com 2012]The following packages will be used in this document:
- sqldf - A package that allows manipulation of R data frames with SQL (as well as connectivity with a limited set of database engines). [Grothendieck 2012]
- plyr - A useful package for aggregating and summarizing data over multiple subgroups, with more advanced applications. [Wickham 2011]
Data Sets Used
Highway Data (crashes.csv and roads.csv - Click to Download)
These are fictional data sets containing crash and highway data with the following variables:Crash data (crashes.csv)
Year
- The year in which the observation was takenRoad
- The name of the road being studiedN_Crashes
- The number of crashes on the road during that yearVolume
- Average annual daily traffic (AADT) volumes on the road during that year. See this site for a formal definition. [North Carolina Department of Transportation 2012]
Road
- The name of the road being studiedDistrict
- The administrative district responsible for the road's upkeep and maintenanceLength
- Length of the road in miles
Joins and Merges with sqldf
Read in and explore the data to get a feel for the data's structure.setwd("W:/Data Mining and Modeling/Applied Analytics - R Discussion/Related Files")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
head(crashes)
## Year Road N_Crashes Volume
## 1 1991 Interstate 65 25 40000
## 2 1992 Interstate 65 37 41000
## 3 1993 Interstate 65 45 45000
## 4 1994 Interstate 65 46 45600
## 5 1995 Interstate 65 46 49000
## 6 1996 Interstate 65 59 51000
tail(crashes)
## Year Road N_Crashes Volume
## 105 2007 Interstate 275 32 21900
## 106 2008 Interstate 275 21 21850
## 107 2009 Interstate 275 25 22100
## 108 2010 Interstate 275 24 21500
## 109 2011 Interstate 275 23 20300
## 110 2012 Interstate 275 22 21200
print(roads)
## Road District Length
## 1 Interstate 65 Greenfield 262
## 2 Interstate 70 Vincennes 156
## 3 US-36 Crawfordsville 139
## 4 US-40 Greenfield 150
## 5 US-52 Crawfordsville 172
Performing joins is one of the most common operations in SQL. Left joins return all rows in the “left-hand” table - the crash data set in this case, whereas right joins return all rows in the “right-hand” table - the road data set in this case. Inner joins return only rows with matching data for the common variable, and full outer joins return all rows in all data sets, even if there are rows without matches. Currently,
sqldf
does not support right joins or full outer joins.It is useful to format SQL statements with spaces and line breaks for readability and to store the query in a character string. The following statement will perform a left join of the crash data set to the road data set based on the common variable
Road
. join_string <- "select
crashes.*
, roads.District
, roads.Length
from crashes
left join roads
on crashes.Road = roads.Road"
A new data frame,
crashes_join_roads
, will be created using the sqldf
statement. The sqldf
statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors
argument will force categorical variables (like Road
and District
) to have the class character
rather than factor
.crashes_join_roads <- sqldf(join_string,stringsAsFactors = FALSE)
## Loading required package: tcltk
head(crashes_join_roads)
## Year Road N_Crashes Volume District Length
## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
tail(crashes_join_roads)
## Year Road N_Crashes Volume District Length
## 105 2007 Interstate 275 32 21900 <NA> NA
## 106 2008 Interstate 275 21 21850 <NA> NA
## 107 2009 Interstate 275 25 22100 <NA> NA
## 108 2010 Interstate 275 24 21500 <NA> NA
## 109 2011 Interstate 275 23 20300 <NA> NA
## 110 2012 Interstate 275 22 21200 <NA> NA
By using an inner join, only matching rows will be kept.
join_string2 <- "select
crashes.*
, roads.District
, roads.Length
from crashes
inner join roads
on crashes.Road = roads.Road"
crashes_join_roads2 <- sqldf(join_string2, stringsAsFactors = FALSE)
head(crashes_join_roads2)
## Year Road N_Crashes Volume District Length
## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
tail(crashes_join_roads2)
## Year Road N_Crashes Volume District Length
## 83 2007 US-36 49 24000 Crawfordsville 139
## 84 2008 US-36 52 24500 Crawfordsville 139
## 85 2009 US-36 55 24700 Crawfordsville 139
## 86 2010 US-36 35 23000 Crawfordsville 139
## 87 2011 US-36 33 21000 Crawfordsville 139
## 88 2012 US-36 31 20500 Crawfordsville 139
The
merge
statement in base R can perform the equivalent of inner and left joins, as well as right and full outer joins, which are unavailable in sqldf
.crashes_merge_roads <- merge(crashes, roads, by = c("Road"))
head(crashes_merge_roads)
## Road Year N_Crashes Volume District Length
## 1 Interstate 65 2000 95 74000 Greenfield 262
## 2 Interstate 65 1997 76 52000 Greenfield 262
## 3 Interstate 65 1998 90 58000 Greenfield 262
## 4 Interstate 65 1999 95 65000 Greenfield 262
## 5 Interstate 65 1991 25 40000 Greenfield 262
## 6 Interstate 65 1992 37 41000 Greenfield 262
tail(crashes_merge_roads)
## Road Year N_Crashes Volume District Length
## 83 US-40 2003 94 55200 Greenfield 150
## 84 US-40 2004 25 55300 Greenfield 150
## 85 US-40 2009 67 65000 Greenfield 150
## 86 US-40 2010 102 67000 Greenfield 150
## 87 US-40 2011 87 67500 Greenfield 150
## 88 US-40 2012 32 67500 Greenfield 150
crashes_merge_roads2 <- merge(crashes, roads, by = c("Road"), all.x = TRUE)
head(crashes_merge_roads2)
## Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994 21 21200 <NA> NA
## 2 Interstate 275 1995 28 23200 <NA> NA
## 3 Interstate 275 1996 22 20000 <NA> NA
## 4 Interstate 275 1997 27 18000 <NA> NA
## 5 Interstate 275 1998 21 19500 <NA> NA
## 6 Interstate 275 1999 22 21000 <NA> NA
tail(crashes_merge_roads2)
## Road Year N_Crashes Volume District Length
## 105 US-40 2003 94 55200 Greenfield 150
## 106 US-40 2004 25 55300 Greenfield 150
## 107 US-40 2009 67 65000 Greenfield 150
## 108 US-40 2010 102 67000 Greenfield 150
## 109 US-40 2011 87 67500 Greenfield 150
## 110 US-40 2012 32 67500 Greenfield 150
crashes_merge_roads3 <- merge(crashes, roads, by = c("Road"), all.y = TRUE)
head(crashes_merge_roads3)
## Road Year N_Crashes Volume District Length
## 1 Interstate 65 2000 95 74000 Greenfield 262
## 2 Interstate 65 1997 76 52000 Greenfield 262
## 3 Interstate 65 1998 90 58000 Greenfield 262
## 4 Interstate 65 1999 95 65000 Greenfield 262
## 5 Interstate 65 1991 25 40000 Greenfield 262
## 6 Interstate 65 1992 37 41000 Greenfield 262
tail(crashes_merge_roads3)
## Road Year N_Crashes Volume District Length
## 84 US-40 2004 25 55300 Greenfield 150
## 85 US-40 2009 67 65000 Greenfield 150
## 86 US-40 2010 102 67000 Greenfield 150
## 87 US-40 2011 87 67500 Greenfield 150
## 88 US-40 2012 32 67500 Greenfield 150
## 89 US-52 NA NA NA Crawfordsville 172
crashes_merge_roads4 <- merge(crashes, roads, by = c("Road"), all.x = TRUE,
all.y = TRUE)
head(crashes_merge_roads4)
## Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994 21 21200 <NA> NA
## 2 Interstate 275 1995 28 23200 <NA> NA
## 3 Interstate 275 1996 22 20000 <NA> NA
## 4 Interstate 275 1997 27 18000 <NA> NA
## 5 Interstate 275 1998 21 19500 <NA> NA
## 6 Interstate 275 1999 22 21000 <NA> NA
tail(crashes_merge_roads4)
## Road Year N_Crashes Volume District Length
## 106 US-40 2004 25 55300 Greenfield 150
## 107 US-40 2009 67 65000 Greenfield 150
## 108 US-40 2010 102 67000 Greenfield 150
## 109 US-40 2011 87 67500 Greenfield 150
## 110 US-40 2012 32 67500 Greenfield 150
## 111 US-52 NA NA NA Crawfordsville 172
Note how the order of the rows in the data frames were rearranged when using the
merge
statement.The
sqldf
statement can process SQLite commands, which include most of the standard syntax used in ANSI SQL, except for some of the join operations outlined previously mentioned. [SQLite.org 2012]Modifying the inner join query to include a
where
is the equivalent of combining merge
and subset
statements.join_string2 <- "select
crashes.*
, roads.District
, roads.Length
from crashes
inner join roads
on crashes.Road = roads.Road
where crashes.Road = 'US-40'"
crashes_join_roads4 <- sqldf(join_string2,stringsAsFactors = FALSE)
head(crashes_join_roads4)
## Year Road N_Crashes Volume District Length
## 1 1991 US-40 46 21000 Greenfield 150
## 2 1992 US-40 101 21500 Greenfield 150
## 3 1993 US-40 76 23000 Greenfield 150
## 4 1994 US-40 72 21000 Greenfield 150
## 5 1995 US-40 75 24000 Greenfield 150
## 6 1996 US-40 136 23500 Greenfield 150
tail(crashes_join_roads4)
## Year Road N_Crashes Volume District Length
## 17 2007 US-40 45 59500 Greenfield 150
## 18 2008 US-40 23 61000 Greenfield 150
## 19 2009 US-40 67 65000 Greenfield 150
## 20 2010 US-40 102 67000 Greenfield 150
## 21 2011 US-40 87 67500 Greenfield 150
## 22 2012 US-40 32 67500 Greenfield 150
Aggregation Functions and Limitations of sqldf
Aggregate functions available using SQLite can be used through the use of a group by
clause.group_string <- "select
crashes.Road
, avg(crashes.N_Crashes) as Mean_Crashes
from crashes
left join roads
on crashes.Road = roads.Road
group by 1"
sqldf(group_string)
## Road Mean_Crashes
## 1 Interstate 275 24.95
## 2 Interstate 65 107.82
## 3 Interstate 70 65.18
## 4 US-36 48.00
## 5 US-40 68.68
The available aggregation functions within SQLite or ANSI SQL are limited, however. While
sqldf
can make certain data manipulation operations easier, more advanced data manipulation tasks and calculations must be performed in R, such as using Hadley Wickham's plyr
package.ddply(crashes_merge_roads,
c("Road"),
function(X) data.frame(Mean_Crashes = mean(X$N_Crashes),
Q1_Crashes = quantile(X$N_Crashes, 0.25),
Q3_Crashes = quantile(X$N_Crashes, 0.75),
Median_Crashes = quantile(X$N_Crashes, 0.50))
)
## Road Mean_Crashes Q1_Crashes Q3_Crashes Median_Crashes
## 1 Interstate 65 107.82 63.25 140.25 108.5
## 2 Interstate 70 65.18 52.00 75.50 66.5
## 3 US-36 48.00 42.00 57.25 47.0
## 4 US-40 68.68 45.25 90.75 70.0
Download this document in R Markdown format.
References
- SQLCourse.com (2012). SQLCourse.com®: Interactive Online SQL Training. Link
- G. Grothendieck (2012). sqldf: Perform SQL Selects on R Data Frames. R package version 0.4-6.4. Link
- H. Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. Link
- North Carolina Department of Transportation (2012). Training Material for Traffic Engineering Accident Analysis System (TEAAS). Link
- SQLite.org (2012). SQL As Understood By SQLite. Link
- R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Link
I LOVE how you've done it in knitr!
ReplyDeleteThe query shown that does not use sqldf could have. sqldf automatically pulls in the RSQLite.extfuns library when using it with the default sqlite database so all that library's functions including lower_quantile, upper_quantile and median are available out of the box. See http://sqldf.googlecode.com/#Example_15._Use_of_RSQLite.extfuns_library_functions
ReplyDeleteAny chance you can share the datasets for reproducibility's sake?
ReplyDeleteI've uploaded the files so they can be downloaded - I embedded links within the post. Thanks!
DeleteHello
ReplyDeleteHow would you use it jointly with the xts or zoo packages?
For example I want to apply the aggregate zoo command to two 10GB time series (I want to use zoo because they have different timing or are non-homogeneous and I cannot just glue them together).
The problem is that I don't know how to use sqldf (or hadoop or ff...) as input for zoo.
Regards
Hello Skan and thanks for the question.
DeleteI have crafted a simple example that creates a zoo object based on the results returned by a call to sqldf (arbitrarily filtering a generated data.frame). I then aggregate over the zoo object using a coarser grained representation of the time variable (similar to the example provided in the zoo documentation).
#load required libraries
library(sqldf)
library(zoo)
#create some data
data=data.frame(val1=rnorm(12, mean=20),t=as.Date("2003-02-01") + c(1, 3, 7, 9, 11, 14, 17, 20, 22, 25, 28, 30) - 1)
#add a column to filter data with
data$val2=c(rep("a",6),rep("b",6))
#add a column with a coarser grained date for use with aggregate
data$year_qtr=as.yearqtr(data$t)
#arbitrarily query the data.frame using sqldf
filter.data=sqldf("select val1, val2, t, year_qtr from data where val2='b'")
#create a zoo object
the.zoo=zoo(filter.data$val1,filter.data$t)
#aggregate using zoo
aggregate(the.zoo,filter.data$year_qtr,mean)
A value similar to this:
2003 Q1
20.42594 should be returned
A call to sqldf returns a data.frame so you simply need to make sure you pass in the correct vector arguments to zoo in order to make it work. Hope that helps and good luck with those large files.
I use sqldf a lot and I am amazed at how simple and fast it operates on data frames. But I was very much surprised when I found that I couldn't use exponentiation.
ReplyDeleteThanks for the comment Raffael. Exponentiation is supported via the POWER function.
DeleteTry this example using the built-in dataset warpbreaks:
data(warpbreaks)
sqldf("select power(breaks,2) as 'breaks^2' from warpbreaks limit 6")
sqldf will return a new column named breaks^2 containing the squared value of breaks.
The functions supported by SQLite are here