Monday, February 10, 2014

Three ways to call C/C++ from R

By Ben Ogorek


I only recently discovered the fundamental connection between the C and R languages. It was during a Bay Area useR Group meeting, where presenter J.J. Allaire shared two points to motivate his talk on Rcpp. The first explained just how much of modern R really is C and C++. For illustration, he used the librestats language composition analysis of Core R, which might not have been so interesting if it did not extend to R's contributed packages. Allaire's second point emphasized R's roots as an interactive mechanism for calling code from compiled languages like C and Fortran (which you can also hear from S inventor John Chambers). While I didn't understand all of the finer points at the time, I never thought of R in the same way again.

Wanting to learn more about this connection between languages, I dug into the C calling paradigms of R. My own lack of experience was mitigated by adopting a tremendously limited scope: the function f(x)= 2x. In the course of its implementation, I received advice from respected voices, each highlighting important points about the alternatives. In this article, I'll share what I learned as concisely as possible. So if you have a passing familiarity with C and an interest in R, grab a cup of coffee, pop open a terminal, and prepare to explore .C, .Call, and Rcpp's sourceCpp.

The big three

The .C function interface

Of R's native functions, the .C interface is the simplest but also the most limited way to call C from R. Inside a running R session, the .C interface allows objects to be directly accessed in an R session's active memory. Thus, to write a compatible C function, all arguments must be pointers. No matter the nature of your function's return value, it too must be handled using pointers. The C function you will write is effectively a subroutine.

Our function f(x)= 2x, implemented as double_me in the file doubler.c, is shown below.
void double_me(int* x) {
  // Doubles the value at the memory location pointed to by x
  *x = *x + *x;
To compile the C code, run the following line at your terminal:
$ R CMD SHLIB doubler.c 
In an R interactive session, run:
.C("double_me", x = as.integer(5))
[1] 10
Notice that the output of .C is a list with names corresponding to the arguments. While the above code is pure C, adding C++ code (instead of C) is made possible by using the extern wrapper.


The .Call interface is the more fully featured and complex cousin of the .C interface. Unlike .C, .Call requires header files that come standard with every R installation. These header files provide access to a new data type, SEXP. The following code, stored in the file, doubler2.c, illustrates its use.
#include <R.h>
#include <Rdefines.h>
SEXP double_me2(SEXP x) {
  // Doubles the value of the first integer element of the SEXP input
  SEXP result;
  PROTECT(result = NEW_INTEGER(1)); // i.e., a scalar quantity
  INTEGER(result)[0] = INTEGER(x)[0] * 2;
  UNPROTECT(1); // Release the one item that was protected
  return result;
Unlike our experience with the .C interface, double_me2 is a function and does return a value. While that appeals to intuition, no matter what the native input and output types, they must now live in a SEXP object. To code double_me2, you must know that there's an integer in the input x, and extract it as if it were the first item in a C array. For the return value, you must add your integer result to a SEXP object in an equally unnatural way. The PROTECT function must be used to prevent R's automatic garbage collection from destroying all the objects.

As before, use R at the command line to compile doubler2.c:
$ R CMD SHLIB doubler2.c 
Back in the R interactive console, the steps are very similar.
.Call("double_me2", as.integer(5))
[1] 10
Notice now that the output is an integer vector instead of a list.

Rcpp and the sourceCpp function

The .C and .Call examples above owe a debt to Jonathan Callahan's entries 8 and 10 of his Using R series. When the examples started working, I tweeted to share my excitement. An hour later, I saw a familiar face:

Let's check it out.

In terms of the code alone, it's easy to see where Hadley is coming from. It's readable, looks just like standard C++ code, and features data types that make intuitive sense. Our simple function is implemented below, saved in the final static file doubler3.cpp (though, in all humility, it's really just C).
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int double_me3(int x) {
  // takes a numeric input and doubles it
  return 2 * x;
I'll refer you to Hadley's article High performance functions with Rcpp for details on Rcpp, but for now, note the "// [[Rcpp::export]]" comment, necessary before each C/C++ function, and the updated #include statement. Most importantly, notice how the pointers and SEXP objects have been replaced. Just like our original function f(x), double_me3 takes one integer input and returns one integer output.

After installing the Rcpp package, we're back to the console one final time.
[1] 10
With Rcpp, the function is waiting for us in the global environment, without even compiling at the command line. Pretty convenient!


With the disclaimer that I am a C++ novice, I summarize my thoughts on the three options below.

I like the simplicity of the C code written for the .C interface. It doesn't rely on external header files and it is possible to test using a C compiler alone. On the other hand, I don't like that a function has to be morphed into a subroutine that uses pointers.

In the .Call interface, SEXP objects are also pointers, though that is perhaps superfluous. My biggest complaint is the verboseness that was added to our short example. As Jonathan Callahan points out, .Call "requires much more knowledge of R internals [but] is the recommended, modern approach for serious C [and C++] programmers."

After seeing Rccp in action, it's not hard to understand why Hadley sent me directly to Rcpp. The C code looks great, there were fewer steps, and our function was ready for us inside the R global environment.

Perhaps you're wondering if there is any reason not to use Rcpp. According Murray Stokely, Google Software Engineer, it could be risky on a very large project. "Rcpp's heavy reliance on C macros can make it unsafe to use with large code bases," says Stokely. For example, the Rcpp FAQ (Section 3.9) describes an unresolved issue casting 64-bit integer types. The implication, Stokely explains, is that a loss of precision could occur without any errors or warnings. The Rcpp FAQ considers such examples "corner cases," and perhaps the typical user will not have to worry.

Whatever your decision, I wish you the best on your C++ journey!


Thanks to Max Ghenis and Mindy Greenberg for expedited proofreadings. Keep up with ours and other articles featuring the R language on R-bloggers.

Monday, February 3, 2014

NLSdata: an R package for National Longitudinal Surveys

This article was first published on analyze stuff. It has been contributed to Anything but R-bitrary as the third article in its introductory series.

By Ben Ogorek


Alongside interstate highways, national defense, and social security, your tax dollars are used to collect data. Sometimes it’s high profile and relevant, like the census or NSA’s controversial PRISM surveillance program. Other times it’s just high profile, like the three-billion dollar brain dataset that nobody has figured out how to use. Then there are the lower profile data sets, studied by researchers, available to the public, but with enough barriers to discourage the data hobbyist.

In this article, we’ll work with the NLSY97, a National Longitudinal Survey (NLS) that follows approximately 9,000 youths, documenting "the transition from school to work and into adulthood." The core areas of the survey are education and employment, but subject areas also include relationships with parents, marital and fertility histories, dating, life expectations, and criminal behavior. The NLS data sets are rich, complex, and full of insights on the “making” of a human being.

In this article I introduce NLSdata, an R package facilitating the analysis of National Longitudinal Survey data. I'll discuss the NLS Investigator, a web utility offered by the BLS to download a packet, and demonstrate how the NLSdata package helps to extract value from it. This culminates in the analysis of the belief, "I don't need religion to have good values," and how it has changed with age.

The data

For providing National Longitudinal Survey to the public, the Bureau of Labor Statistics provides an online tool, the NLS Investigator. This article will refer to included files within the Investigator subdirectory of the NLSdata package, but information on how to pull additional data from the NLS Investigator is provided in the Appendix.

While you don’t need to visit the NLS Investigator to run the following examples, the files it provides are the NLSdata package’s raison d’etre and deserve a brief discussion. There are two key files: a structured data file and a semi-structured metadata file called a "codebook."

The data file is in a .csv format and can contain thousands of columns. While it is highly structured, it will probably be unintelligible (see the image below for an example).

The codebook is a text file with a .cdb extension. While it is intelligible, it is only semi-structured (below is an excerpt for one of the training variables).

The NLSdata package

The NLSdata package is currently on GitHub. The easiest way to install it is with the devtools package.
After installation, load in the library and read in the codebook and data files in NLSdata’s Investigator folder.
codebook <- system.file("Investigator", "Religion.cdb", package = "NLSdata")
csv.extract <- system.file("Investigator", "Religion.csv", package = "NLSdata")
The first key step is to create an "NLSdata" object using the CreateNLSdata constructor function. You will then supply the filepath for the codebook and the data. For now, I'll note that there are missing data issues out of the scope of this introduction.
nls.obj <- CreateNLSdata(codebook, csv.extract)
[1] "NLSdata"
[1] "metadata" "data" 
The element "data" is a data frame with variable names that correspond to those in the codebook, but with the survey year appended. Factor labels are added for categorical factors and logical factors have been properly converted. The unique key will always be respondent identifier PUBID.1997.
head(nls.obj$data[order(nls.obj$data$PUBID.1997), c(2, 8, 9, 11)])
     PUBID.1997 YSAQ_282A2.2002 YSAQ_282A2.2005 YSAQ_282A2.2008
5203          1           FALSE            TRUE            TRUE
1918          2            TRUE            TRUE           FALSE
6687          3            TRUE              NA              NA
3682          4            TRUE            TRUE            TRUE
1730          5            TRUE           FALSE           FALSE
2838          6            TRUE           FALSE           FALSE
The metadata element is a list that contains information about each variable in the data set, including the original codebook chunk.
[1] "YSAQ_282A2.2005"


[1] "2005"

[1] "S6316800"

 [1] "S63168.00    [YSAQ-282A2]                              Survey Year: 2005"
 [7] "I don't need religion to have good values."
 [9] "UNIVERSE: All"
(The element "chunk" has been modified for presentation.)

Attitudes about religion through time

The YSAQ_282A2 values indicate each respondent's answer to the question, "I don't need religion to have good values," but the wide format makes it awkward. Thus the NLSdata package provides the function CreateTimeSeriesDf to coerce a portion of the data frame into a long format. Under the hood, it’s using the reshape2's melt function.
religion.df <- CreateTimeSeriesDf(nls.obj, "YSAQ_282A2")
head(religion.df[order(religion.df$PUBID.1997), ])
      PUBID.1997 YSAQ_282A2 year
5203           1      FALSE 2002
14187          1       TRUE 2005
23171          1         NA 2006
32155          1       TRUE 2008
1918           2       TRUE 2002
10902          2       TRUE 2005
Looking at the cell counts, we can see something is strange with the year 2006.
(cell.counts <- with(religion.df, table(YSAQ_282A2, year)))
YSAQ_282A2 2002 2005 2006 2008
     FALSE 4014 3306    0 3213
     TRUE  3829 3679    2 3970
Since only two people answered the question that year, I’m choosing to exclude it from the analysis.
religion.df <- religion.df[religion.df$year != 2006, ]
For the purpose of this article, I want both simplicity and protection from obvious confounding. Thus I'll enforce balance over respondents and years via the ThrowAwayDataForBalance function, which keeps records for respondents who answered a given question in every observed year.
religion.df <- ThrowAwayDataForBalance(religion.df, "YSAQ_282A2")
2002 2005 2008 
6013 6013 6013 
1 2 4 5 6 9 
3 3 3 3 3 3
After deletions, the final cell counts are shown below:
(cell.counts <- with(religion.df, table(YSAQ_282A2, year)))
YSAQ_282A2 2002 2005 2008
     FALSE 3077 2882 2682
     TRUE  2936 3131 3331
We can test whether there is evidence of a changing response distribution by a Chi Square test for association.
        Pearson's Chi-squared test

data:  cell.counts
X-squared = 51.9902, df = 2, p-value = 5.134e-12
There is strong evidence that the likelihood of answering "yes" is not constant over time. Recall that these are the same people in all three years.

And here are the actual proportions.
(proportions <- aggregate(religion.df$YSAQ_282A2, 
                          by  = list(year = religion.df$year),
                          FUN = mean, na.rm = TRUE))
  year         x
1 2002 0.4882754
2 2005 0.5207051
3 2008 0.5539664
In the plot below, I chose a range of 0.40 to 0.65 for the agreement proportion. I believe this is a range which, if covered, would represent a meaningful shift in attitudes.
plot(x ~ year, data = proportions, ylim = c(0.4, 0.65), type = "b",
     ylab = "Proportion agreeing with statement",
     main = 'Belief: "I don\'t need religion to have good values"')


The NLSdata package is still very much a work in progress, and I fully expect that certain untested variables available from the NLS Investigator will cause problems. Reports of such occurrences and general feedback is encouraged and appreciated. The NLS data set itself has many interesting variables. With minimal effort, we determined that beliefs about religion and values changed through time. Many more interesting relationships, those involving education, training, employment, relationships, and even religion, wait to be uncovered. I hope that NLSdata makes the search for these relationships easier and more accessible.


  • Thanks to Max Ghenis, for convincing me to analyze real data and not the output of rnorm(100). He also provided multiple suggestions that improved this article.
  • Thanks to Mindy Greenberg, who deserves a title like "Editor in Chief" for her many contributions to style and readability.

Appendix: using the NLS Investigator

Navigate your browser to the NLS Investigator homepage and sign up for an account. Once you've logged in, you'll be able to choose a survey. In this article, I'm working with NLSY97.
  • the required PubID identifier, which serves as a primary key
  • demographic variables (gender, ethnicity)
  • survey metadata (e.g., release version)
There are literally thousands more under the "Variable Search" tab.

The many variables is due to a massive flattening of the data. The primary ways in which this flattening occurs is summarized below:
  • Each survey round (year) gets a new variable name even with the same question text.
  • Questions are tweaked for different subpopulations, called “universes,” and are repeated.
  • Nested lists called “rosters” add multiple columns for collections like jobs (job1, job2, ..., job 9).
The Investigator’s documentation explicitly warns, “Do not presume that two variables with the same or similar titles necessarily have the same (1) universe of respondents or (2) coding categories or (3) time reference period.” After selecting variables, click the tabs “Save / Download” and then “Advanced Download.” Make sure to select the codebook and the comma-delimited data file.

Monday, January 20, 2014

Statistics meets rhetoric: A text analysis of "I Have a Dream" in R

This article was first published on analyze stuff. It has been contributed to Anything but R-bitrary as the second article in its introductory series.

By Max Ghenis

Today, we celebrate the would-be 85th birthday of Martin Luther King, Jr., a man remembered for pioneering the civil rights movement through his courage, moral leadership, and oratory prowess. This post focuses on his most famous speech,
I Have a Dream [YouTube | text] given on the steps of the Lincoln Memorial to over 250,000 supporters of the March on Washington. While many have analyzed the cultural impact of the speech, few have approached it from a natural language processing perspective. I use R’s text analysis packages and other tools to reveal some of the trends in sentiment, flow (syllables, words, and sentences), and ultimately popularity (Google search volume) manifested in the rhetorical masterpiece.


Word clouds are somewhat controversial among data scientists: some see them as overused and cliche, while others find them a useful exploratory tool, particularly for connecting with a less analytical audience. I consider them a fun and useful starting point, so I started off by throwing the speech’s text into Wordle.

R also has a
wordcloud package, though it’s hard to beat Wordle on looks.
# Load raw data, stored at
speech.raw <- paste(scan(url(""), 
                         what="character"), collapse=" ")

wordcloud(speech.raw) # Also takes other arguments like color

Calculating textual metrics

The qdap package provides functions for text analysis, which I use to split sentences, count syllables and words, and estimate sentiment and readability. I also use the data.table package to organize the sentence-level data structure.

# Split into sentences
# qdap's sentSplit is modeled after dialogue data, so person field is needed
speech.df <- data.table(speech=speech.raw, person="MLK")
sentences <- data.table(sentSplit(speech.df, "speech"))
# Add a sentence counter and remove unnecessary variables
sentences[, sentence.num := seq(nrow(sentences))]
sentences[, person := NULL]
sentences[, tot := NULL]
setcolorder(sentences, c("sentence.num", "speech"))

# Syllables per sentence
sentences[, syllables := syllable.sum(speech)]
# Add cumulative syllable count and percent complete as proxy for progression
sentences[, syllables.cumsum := cumsum(syllables)]
sentences[, pct.complete := syllables.cumsum / sum(sentences$syllables)]
sentences[, pct.complete.100 := pct.complete * 100]
qdap’s sentiment analysis is based on a sentence-level formula classifying each word as either positive, negative, neutral, negator or amplifier, per Hu & Liu’s sentiment lexicon. The function also provides a word count.
pol.df <- polarity(sentences$speech)$all
sentences[, words := pol.df$wc]
sentences[, pol := pol.df$polarity]
A scatterplot hints that polarity increases throughout the speech; that is, the sentiment gets more positive.
with(sentences, plot(pct.complete, pol))

Cleaning up the plot and adding a LOESS smoother clarifies this trend, particularly the peak at the end.

my.theme <- 
  theme(plot.background = element_blank(), # Remove background
        panel.grid.major = element_blank(), # Remove gridlines
        panel.grid.minor = element_blank(), # Remove more gridlines
        panel.border = element_blank(), # Remove border
        panel.background = element_blank(), # Remove more background
        axis.ticks = element_blank(), # Remove axis ticks
        axis.text=element_text(size=14), # Enlarge axis text font
        axis.title=element_text(size=16), # Enlarge axis title font
        plot.title=element_text(size=24, hjust=0)) # Enlarge, left-align title

CustomScatterPlot <- function(gg)
  return(gg + geom_point(color="grey60") + # Lighten dots
           stat_smooth(color="royalblue", fill="lightgray", size=1.4) + 
           xlab("Percent complete (by syllable count)") + 
           scale_x_continuous(labels = percent) + my.theme)

CustomScatterPlot(ggplot(sentences, aes(pct.complete, pol)) +
                    ylab("Sentiment (sentence-level polarity)") + 
                    ggtitle("Sentiment of I Have a Dream speech"))

Through the variation, the trendline reveals two troughs (calls to action, if you will) along with the increasing positivity.

Readability tests
are typically based on syllables, words, and sentences in order to approximate the grade level required to comprehend a text. qdap offers several of the most popular formulas, of which I chose the Automated Readability Index.
sentences[, readability := automated_readability_index(speech, sentence.num)
By graphing similarly to the above polarity chart, I show readability to be mostly constant throughout the speech, though it varies within each section. This makes sense, as one generally avoids too many simple or complex sentences in a row.
CustomScatterPlot(ggplot(sentences, aes(pct.complete, readability)) +
                    ylab("Automated Readability Index") +
                    ggtitle("Readability of I Have a Dream speech"))

Scraping Google search hits

Google search results can serve as a useful indicator of public opinion, if you know what to look for. Last week I had the pleasure of meeting Seth Stephens-Davidowitz, a fellow analyst at Google who has used search data to research several topics, such as quantifying the effect of racism on the 2008 presidential election (Obama did worse in states with higher racist query volume). There’s a lot of room for exploring historically difficult topics with this data, so I thought I’d use it to identify the most memorable pieces of I Have a Dream.
Fortunately, I was able to build off of a function from theBioBucket’s blog post to count Google hits for a query.
GoogleHits <- function(query){
  url <- paste0("", gsub(" ", "+", query))
  CAINFO = paste0(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt")
  script <- getURL(url, followlocation=T, cainfo=CAINFO)
  doc <- htmlParse(script)
  res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
  return(as.numeric(gsub("[^0-9]", "", res)))
From there I needed to pass each sentence to the function, stripped of punctuation and grouped in brackets, and with “mlk” added to ensure it related to the speech.
sentences[, google.hits := GoogleHits(paste0("[", gsub("[,;!.]", "", speech), 
                                             "] mlk"))]
A quick plot reveals that there’s a huge difference between the most-quoted sentences and the rest of the speech, particularly the top seven (though really six as one is a duplicate). Do these top sentences align with your expectations?
ggplot(sentences, aes(pct.complete, google.hits / 1e6)) +
  geom_line(color="grey40") + # Lighten dots
  xlab("Percent complete (by syllable count)") + 
  scale_x_continuous(labels = percent) + my.theme +
  ylim(0, max(sentences$google.hits) / 1e6) +
  ylab("Sentence memorability (millions of Google hits)") +
  ggtitle("Memorability of I Have a Dream speech")
head(sentences[order(-google.hits)]$speech, 7)
[1] "free at last!"
[2] "I have a dream today."
[3] "I have a dream today."
[4] "This is our hope."
[5] "And if America is to be a great nation this must become true."
[6] "I say to you today, my friends, so even though we face the difficulties of today and tomorrow, I still have a dream."
[7] "We cannot turn back."
Plotting Google hits on a log scale reduces skew and allows us to work on a ratio scale.
sentences[, := log(google.hits)]

CustomScatterPlot(ggplot(sentences, aes(pct.complete, +
                    ylab("Memorability (log of sentence's Google hits)") +
                    ggtitle("Memorability of I Have a Dream speech"))

What makes a passage memorable? A linear regression approach

With several metrics for each sentence, along with the natural outcome variable of log(Google hits), I ran a linear regression to determine what makes a sentence memorable. I pruned the regressor list using the stepAIC backward selection technique, which minimizes the Akaike Information Criterion and leads to a more parsimonious model. Finally, based on preliminary model results, I added polynomials of readability and excluded word count, syllable count, and syllables per word (readability is largely based on these factors).
library(MASS) # For stepAIC
google.lm <- stepAIC(lm(log(google.hits) ~ poly(readability, 3) + pol +
                          pct.complete.100, data=sentences))
stepAIC returns the optimal model, which can be summarized like any lm object.
lm(formula = log(google.hits) ~ poly(readability, 3) + pct.complete.100, 
    data = sentences)

    Min      1Q  Median      3Q     Max 
-4.2805 -1.1324 -0.3129  1.1361  6.6748 

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            11.444037   0.405247  28.240  < 2e-16 ***
poly(readability, 3)1 -12.670641   1.729159  -7.328 1.75e-10 ***
poly(readability, 3)2   8.187941   1.834658   4.463 2.65e-05 ***
poly(readability, 3)3  -5.681114   1.730662  -3.283  0.00153 ** 
pct.complete.100        0.013366   0.006848   1.952  0.05449 .  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.729 on 79 degrees of freedom
Multiple R-squared:  0.5564, Adjusted R-squared:  0.534 
F-statistic: 24.78 on 4 and 79 DF,  p-value: 2.605e-13
Four significant regressors explained 56% of the variance (R2=0.5564): a third degree polynomial of readability, along with pct.complete.100 (where the sentence was in the speech); polarity was not significant. 

The effect of pct.complete can be calculated by exponentiating the coefficient, since I log-transformed the outcome variable:
This result can be interpreted as the following: a 1% increase in the location of a sentence in the speech was associated with a 1.3% increase in search hits.
Interpreting the effect of readability is not as straightforward, since I included polynomials. Rather than compute an average effect, I graphed predicted Google hits for values of readability's observed range, holding pct.complete.100 at its mean. <- data.frame(readability=seq(min(sentences$readability), 
                                       max(sentences$readability), by=0.1),
                       pct.complete.100=mean(sentences$pct.complete.100))$pred.hits <- predict(google.lm,

ggplot(, aes(readability, pred.hits)) + 
  geom_line(color="royalblue", size=1.4) + 
  xlab("Automated Readability Index") +
  ylab("Predicted memorability (log Google hits)") +
  ggtitle("Predicted memorability ~ readability") +
This cubic relationship indicates that predicted memorability falls considerably until about grade level 10, at which point it levels off (very few passages have readability exceeding 25).


R tools from qdap to ggplot2 have uncovered some of MLK’s brilliance in I Have a Dream:

  • The speech starts and (especially) ends on a positive note, with a positive middle section filled with two troughs to vary the tone.
  • While readability/complexity varies considerably within each small section, the overall level is fairly consistent throughout the speech.
  • Readability and placement were the strongest drivers of memorability (as quantified by Google hits): sentences below grade level 10 were more memorable, as were those occurring later in the speech.

To a degree, these were intuitive findings--the ebb and flow of intensity and sentiment is a powerful rhetorical device. While we may never be able to fully deconstruct the meaning of this speech, techniques explored here can provide brief insight into the genius of MLK and the power of his message.

Thanks for reading, and enjoy your MLK day!


  • Special thanks to Ben Ogorek for guidance on some of the statistics here, and for a thorough review.
  • Special thanks to Mindy Greenberg for reviewing and always pushing my boundaries of conciseness and clarity.
  • Thanks to Josh Kraut for offering a ggplot2 lesson at work, inspiring me to use it here.