Anything but R-bitrary: 2013

Tuesday, December 24, 2013

How fair is White Elephant?

This article was first published on analyze stuff. It has been contributed to Anything but R-bitrary in celebration of its introductory post.

By Max Ghenis
Welcome to analyze stuff! For our first post, I wanted to reflect on the time of year; after all, ‘tis the season for hams and yams, caroling and sledding, and of course gifts! One popular party gift exchange game is the White Elephant, where each person brings a wrapped (typically regifted or otherwise odd-ball) gift, and then picks one in order with the option of “stealing” another unwrapped gift. Relative to other matching problems, for example, the stable marriage problem or the secretary problem, both of which have elegant defined solutions, strikingly little research has been done on the fairness of this game (with perhaps one exception).

As an operations research guy, my tool of choice for these situations tends to be the discrete event simulation. This is a nice approach for understanding systems that are too complex to model closed-form, of which White Elephant is a perfect example. R's combination of statistical functions, random number generators, and encapsulation makes it an ideal language for performing discrete event simulations.

In this article, I use R to create a function representing a single turn, then a single game, and finally an entire simulation of 200,000 games (full code here). Afterward I analyze the results--both with R and interactive charts from Google Sheets--which brings to light a few things about the Yuletide activity (some may surprise you!).

Sample game outcome
Player 1 gets gift 3, 2 gets 6, and so on

The model

As there are many potential variations on the game, let’s start off with the particular rules I implemented:

Each round consists of a new player joining the game, based on a predetermined order
At each round, the player either steals an unwrapped gift or unwraps one herself
Upon someone else’s gift being stolen, the “stealee” makes the same choice of stealing vs. choosing. A round completes when someone unwraps a gift. While this part is consistent, the following rules are subject to variation:

No gift can be stolen more than once per round
No gift can be stolen more than three times total

The game ends when the final gift is unwrapped

I also made some assumptions about the players’ behavior:

Players will steal each round with probability p = (# stealable gifts) / (# stealable + wrapped gifts)
If players choose to steal, they will steal their favorite gift
Players’ preference for each gift is governed by averaging two uniformly--that is, U(0, 1)--distributed factors:

Each gift’s innate utility (players will partially agree that some gifts are better than others)
Preferences differing randomly across players

Gift utilities are completely unknown prior to unwrapping, i.e. I’m assuming players can’t tell the difference between a wrapped fruitcake and a wrapped Ferrari

In each simulation run, a new utility matrix is generated, and the game’s result is stored as the ultimate set of player-gift matches, along with the utilities and ranks each player associates with their gift. I ran 10,000 white elephant simulations for each of twenty different scenarios, representing player counts ranging from 1 to 20 (200,000 simulation runs in total, 2.1M rows for each player outcome). You can see the full simulation result here.

> result
         n.players player gift steals   utility rank run
      1:         1      1    1      0 0.5550705    1   1
      2:         1      1    1      0 0.5971866    1   1
      3:         1      1    1      0 0.4942946    1   1
      4:         1      1    1      0 0.6106313    1   1
      5:         1      1    1      0 0.7207114    1   1
     ---
2099996:        20     16   10      1 0.7030580   10  20
2099997:        20     17   17      0 0.5422361    9  20
2099998:        20     18   18      1 0.6655989    2  20
2099999:        20     19   20      1 0.7376923    4  20
2100000:        20     20   14      2 0.7768777    7  20

Tools

The simulation utilizes a few of my favorite R packages including data.table (used in many of the summaries below), plyr, reshape2, and parallel. Instead of R, I use Google Sheets to create and embed interactive charts easily (this spreadsheet contains all charts). While R is plenty capable of producing sophisticated visualizations, Google Sheets provides an unparalleled interface for making and sharing interactive charts in a familiar format to any analyst.

Results

Let’s begin the analysis by considering the six-player scenario. From here we can plot four metrics against the player order: average utility, average rank, and likelihood to get favorite and least favorite gifts.

current.n <- 6

# Note that result is a data.table, enabling these operations
result[n.players == current.n,
       list(mean.utility = mean(utility),
            mean.rank = mean(rank),
            pct.favorite.gift = sum(rank == 1) / .N,
            pct.least.favorite.gift = sum(rank == current.n) / .N),
       player]

While the first couple players obtain similar utility, the final player does significantly better. Relative to the first five players, they are 3.2% more likely to get their favorite gift, but also (unexpectedly) 0.6% more likely to get their least favorite gift.

Kernel density plots of utility reveal again that the final player has the most right-skewed distribution; their mode is 0.70, while the penultimate player's is 0.59.

This chart shows yet again how similarly the early players perform, particularly the first and second. The notion that the first player is so disadvantaged is so prevalent that allowing them an extra turn has probably become the most common variation. However, this simulation shows that this may worsen the inequity faced by the second player. This also makes sense intuitively: the second player has no advantage over the first since they see only the first gift, whose expected utility equals that of a random wrapped gift.

Varying the number of players

As one might expect, average utility per player increases with the number of players (recall that each gift has a mean utility of 0.5), along with average number of steals per gift:

While the first player gains utility as the player count increases, it’s only slight relative to the final player’s gains:

Ultimately, adding players both reduces inequality (as defined by the Gini coefficient) and efficiency (as defined by utility per turn taken).

Data powering these charts was generated with the following R code:

library(ineq) # For Gini
n.player.summary <-
  result[, list(mean.steals=mean(steals),
                mean.utility=mean(utility),
                mean.utility.first.player=
                mean(ifelse(player == 1, utility, NA), na.rm=T),
                mean.utility.final.player=
                mean(ifelse(player == n.players, utility, NA), na.rm=T),
                utility.per.turn=mean(utility / (steals + 1)),
                gini=ineq(utility, type="Gini")),
         n.players]

Optimizing the player count involves balancing efficiency against total utility and equity (though from personal experience, the more the merrier).

Finally, a linear regression model shows that both the number of players and each player’s turn have statistically significant relationships to utility (R2=0.058), though the order is 24x more important.

summary(lm(utility ~ n.players + player, data=result))

Call:
lm(formula = utility ~ n.players + player, data = result)

Residuals:
     Min       1Q   Median       3Q      Max
-0.70304 -0.12404  0.01915  0.14726  0.47971

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.147e-01  4.156e-04 1238.34   <2e-16 ***
n.players   -4.462e-04  3.309e-05  -13.48   <2e-16 ***
player       1.057e-02  3.309e-05  319.28   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2001 on 2099997 degrees of freedom
Multiple R-squared:  0.05847, Adjusted R-squared:  0.05847
F-statistic: 6.521e+04 on 2 and 2099997 DF,  p-value: < 2.2e-1

Closing thoughts

What I found most interesting here is that while larger parties have better outcomes, most of this benefit goes to the later players. The analysis has also spawned new questions:

Does an extra turn from player 1 affect overall utility, and equity of it? Since average utility rose and inequality fell as players were added, one might expect this extra turn to improve outcomes; however, such a rule may further diminish player 2’s already poor fortune.
How about other variations? For example, some games only prohibit immediate steal-backs, while allowing gifts to be stolen multiple times in the same turn. This could reduce the final player’s advantage, since the current rules mean they typically get their first or second choice.
How else could we evaluate fairness? Metrics like average utility and Gini coefficient tell part of the story; other alternatives could include comparisons to stable-match solutions as proposed by Gale-Shapley, or perhaps an efficiency measure more nuanced than raw utility per turn.
When should players steal? An analysis of optimal stealing strategy could begin with tweaking each player’s stealing probability. However, the more interesting model assumes players try to estimate the utility distribution of all gifts based on those they’ve seen so far. From this distribution, they can weigh the odds that the next gift will surpass their favorite unwrapped one, and decide to steal accordingly. If a player sees ten fruitcakes unwrapped and then a Ferrari, they can be pretty sure the Ferrari is an outlier that won’t be matched in the next gift.

White Elephant's mechanics ended up more complex and fascinating than initially expected, so I’ll be pursuing some of these questions in a follow-up post. Stay tuned for that and more in the new year, and happy gifting!

Max

Acknowledgements

Special thanks to Ben Ogorek for numerous suggestions and an extra set of eyes on the code.
Thanks to Mindy Greenberg for a thorough review (in particular, weaning me off my semicolon addiction).
Thanks to Bo Cowgill for introducing me to literature surrounding matching problems.

Resources

R code [GitHub]
Raw simulation result [Google Fusion Tables]
Spreadsheet with charts [Google Sheets]

Wednesday, March 27, 2013

Build a search engine in 20 minutes or less

…or your money back.

The

Edit 6/25/2016: In addition to a tutorial on basic text processing and information retrieval in R, this article is also a cautionary tale about forgoing modern document generation and version control; the reader will notice some inconsistencies between the output shown in the article vs in the R console.

Setup

We've got a collection of documents:


doc1 <- "Stray cats are running all over the place. I see 10 a day!"
doc2 <- "Cats are killers. They kill billions of animals a year."
doc3 <- "The best food in Columbus, OH is   the North Market."
doc4 <- "Brand A is the best tasting cat food around. Your cat will love it."
doc5 <- "Buy Brand C cat food for your cat. Brand C makes healthy and happy cats."
doc6 <- "The Arnold Classic came to town this weekend. It reminds us to be healthy."
doc7 <- "I have nothing to say. In summary, I have told you nothing."

doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7)
N.docs <- length(doc.list)
names(doc.list) <- paste0("doc", c(1:N.docs))

We also have an information need that is expressed via the following text query:


query <- "Healthy cat food"

We're going to use an old method that goes way back to the 1960's. Specifically, we'll implement the vector space model of information retrieval in R. In the process, you'll hopefully learn something about the tm package.

Text mining in R

Fundamentals of the tm package

If you have not installed the tm [1][2] and SnowballC [3] packages, please do so now.

install.packages("tm")
install.packages("SnowballC")

Load the tm package into memory.


library(tm)

In text mining and related fields, a corpus is a collection of texts, often with extensive manual annotation. Not surprisingly, the Corpus class is a fundamental data structure in tm.


my.docs <- VectorSource(c(doc.list, query))
my.docs$Names <- c(names(doc.list), "query")

my.corpus <- Corpus(my.docs)
my.corpus


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 8

Above we treated the query like any other document. It is, after all, just another string of text. Queries are not typically known a priori, but in the processing steps that follow, we will pretend like we knew ours in advance to avoid repeating steps.

Standardizing

One of the nice things about the Corpus class is the tm_map function, which cleans and standardizes documents within a Corpus object. Below are some of the transformations.


getTransformations()


[1] "removeNumbers"     "removePunctuation" "removeWords"
[4] "stemDocument"      "stripWhitespace"

First, let's get rid of punctuation.


my.corpus <- tm_map(my.corpus, removePunctuation)
content(my.corpus[[1]])


## Stray cats are running all over the place I see 10 a day

Suppose we don't want to count “cats” and “cat” as two separate words. Then we will use the stemDocument transformation to implement the famous Porter Stemmer algorithm. To use this particular transformation, first load the SnowballC package.


library(SnowballC)
my.corpus <- tm_map(my.corpus, stemDocument)
content(my.corpus[[1]])


## Stray cat are run all over the place I see 10 a day

Finally, remove numbers and any extra white space.


my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, stripWhitespace)
content(my.corpus[[1]])


## stray cat are run all over the place i see a day

We applied all these standardization techniques without much thought. For instance, we sacrificed inflection in favor of fewer words. But at least the transformations make sense on a heuristic level, much like the similarity concepts to follow.

The vector space model

Document similarity

Here's a trick that's been around for a while: represent each document as a vector in \( \mathcal{R}^N \) (with \( N \) as the number of words) and use the angle \( \theta \) between the vectors as a similarity measure. Rank by the similarity of each document to the query and you have a search engine.

One of the simplest things we can do is to count words within documents. This naturally forms a two dimensional structure, the term document matrix, with rows corresponding to the words and the columns corresponding to the documents. As with any matrix, we may think of a term document matrix as a collection of column vectors existing in a space defined by the rows. The query lives in this space as well, though in practice we wouldn't know it beforehand.


term.doc.matrix.stm <- TermDocumentMatrix(my.corpus)
colnames(term.doc.matrix.stm) <- c(names(doc.list), "query")
inspect(term.doc.matrix.stm[0:14, ])


<<TermDocumentMatrix (terms: 14, documents: 8)>>
Non-/sparse entries: 21/91
Sparsity           : 81%
Maximal term length: 8
Weighting          : term frequency (tf)

          Docs
Terms      doc1 doc2 doc3 doc4 doc5 doc6 doc7 query
  all         1    0    0    0    0    0    0     0
  and         0    0    0    0    1    0    0     0
  anim        0    1    0    0    0    0    0     0
  are         1    1    0    0    0    0    0     0
  arnold      0    0    0    0    0    1    0     0
  around      0    0    0    1    0    0    0     0
  best        0    0    1    1    0    0    0     0
  billion     0    1    0    0    0    0    0     0
  brand       0    0    0    1    2    0    0     0
  buy         0    0    0    0    1    0    0     0
  came        0    0    0    0    0    1    0     0
  cat         1    1    0    2    3    0    0     1
  classic     0    0    0    0    0    1    0     0
  columbus    0    0    1    0    0    0    0     0

Sparsity and storage of the term document matrix

The matrices in tm are of type Simple Triplet Matrix where only the triples \( (i, j, value) \) are stored for non-zero values. To work directly with these objects, you may use install the slam [4] package. We bear some extra cost by making the matrix “dense” (i.e., storing all the zeros) below.


term.doc.matrix <- as.matrix(term.doc.matrix.stm)

cat("Dense matrix representation costs", object.size(term.doc.matrix), "bytes.\n", 
    "Simple triplet matrix representation costs", object.size(term.doc.matrix.stm), 
    "bytes.")


## Dense matrix representation costs 6688 bytes.
##  Simple triplet matrix representation costs 5808 bytes.

Variations on a theme

In term.doc.matrix, the dimensions of the document space are simple term frequencies. This is fine, but other heuristics are available. For instance, rather than a linear increase in the term frequency \( tf \), perhaps \( \sqrt(tf) \) or \( \log(tf) \) would provide a more reasonable diminishing returns on word counts within documents.

Rare words can also get a boost. The word “healthy” appears in only one document, whereas “cat” appears in four. A word's document frequency \( df \) is the number of documents that contain it, and a natural choice is to weight words inversely proportional to their \( df \)s. As with term frequency, we may use logarithms or other transformations to achieve the desired effect.

The tm function weightTfIdf offers one variety of tfidf weighting, but below we build our own. Visit the Wikipedia page for the SMART Information Retrieval System for a brief history and a list of popular weighting choices.

Different weighting choices are often made for the query and the documents. For instance, Manning et al.'s worked example [5] uses \( idf \) weighting only for the query.

Choice and implementation

For both the document and query, we choose tfidf weights of \( (1 + \log_2(tf)) \times \log_2(N/df) \), which are defined to be \( 0 \) if \( tf = 0 \). Note that whenever a term does not occur in a specific document, or when it appears in every document, its weight is zero.


get.tf.idf.weights <- function(tf.vec) {
  # Computes tfidf weights from term frequency vector
  n.docs <- length(tf.vec)
  doc.frequency <- length(tf.vec[tf.vec > 0])
  weights <- rep(0, length(tf.vec))
  weights[tf.vec > 0] <- (1 + log2(tf.vec[tf.vec > 0])) * log2(n.docs/doc.frequency)
  return(weights)
}

# For a word appearing in 4 of 6 documents, occurring 1, 2, 3, and 6 times"
get.tf.idf.weights(c(1, 2, 3, 0, 0, 6))


[1] 0.5849625 1.1699250 1.5121061 0.0000000 0.0000000 2.0970686

Using apply, we run the tfidf weighting function on every row of the term document matrix. The document frequency is easily derived from each row by the counting the non-zero entries (not including the query).


tfidf.matrix <- t(apply(term.doc.matrix, 1,
                        FUN = function(row) {get.tf.idf.weights(row)}))
colnames(tfidf.matrix) <- colnames(term.doc.matrix)

tfidf.matrix[0:3, ]

    
Terms  doc1 doc2 doc3 doc4 doc5 doc6 doc7 query
  all     3    0    0    0    0    0    0     0
  and     0    0    0    0    3    0    0     0
  anim    0    3    0    0    0    0    0     0

Dot product geometry

A benefit of being in the vector space \( \mathcal{R}^N \) is the use of its dot product. For vectors \( a \) and \( b \), the geometric definition of the dot product is \( a \cdot b = \vert\vert a\vert\vert \, \vert\vert b \vert \vert \cos \theta \), where \( \vert\vert \cdot \vert \vert \) is the euclidean norm (the root sum of squares) and \( \theta \) is the angle between \( a \) and \( b \).

In fact, we can work directly with the cosine of \( \theta \). For \( \theta \) in the interval \( [-\pi, -\pi] \), the endpoints are orthogonality (totally unrelated documents) and the center, zero, is complete collinearity (maximally similar documents). We can see that the cosine decreases from its maximum value of \( 1.0 \) as the angle departs from zero in either direction.


angle <- seq(-pi, pi, by = pi/16)
plot(cos(angle) ~ angle, type = "b", xlab = "angle in radians",
     main = "Cosine similarity by angle")

plot of chunk unnamed-chunk-16

We may furthermore normalize each column vector in our tfidf matrix so that its norm is one. Now the dot product is \( \cos \theta \).


tfidf.matrix <- scale(tfidf.matrix, center = FALSE,
                      scale = sqrt(colSums(tfidf.matrix^2)))
tfidf.matrix[0:3, ]

    
Terms       doc1      doc2 doc3 doc4      doc5 doc6 doc7 query
  all  0.3625797 0.0000000    0    0 0.0000000    0    0     0
  and  0.0000000 0.0000000    0    0 0.3558476    0    0     0
  anim 0.0000000 0.3923672    0    0 0.0000000    0    0     0

Matrix multiplication: a dot product machine

Treating the query as just another document kept things simple for this article, though in a production system the query will effectively come from a different corpus (see @Lorien's comment below). Now it's time to split them up.


query.vector <- tfidf.matrix[, (N.docs + 1)]
tfidf.matrix <- tfidf.matrix[, 1:N.docs]

With the query vector and the set of document vectors in hand, it is time to go after the cosine similarities. These are simple dot products as our vectors have been normalized to unit length.

Recall that matrix multiplication is really just a sequence of vector dot products. The matrix operation below returns values of \( \cos \theta \) for each document vector and the query vector.


doc.scores <- t(query.vector) %*% tfidf.matrix

With scores in hand, rank the documents by their cosine similarities with the query vector.


results.df <- data.frame(doc = names(doc.list), score = t(doc.scores),
                         text = unlist(doc.list))
results.df <- results.df[order(results.df$score, decreasing = TRUE), ]

The results

How did our search engine do?


options(width = 2000)
print(results.df, row.names = FALSE, right = FALSE, digits = 2)

                                                                     
 doc  score text                                                                      
 doc5 0.267 Buy Brand C cat food for your cat. Brand C makes healthy and happy cats.  
 doc4 0.143 Brand A is the best tasting cat food around. Your cat will love it.       
 doc6 0.132 The Arnold Classic came to town this weekend. It reminds us to be healthy.
 doc3 0.090 The best food in Columbus, OH is   the North Market.                      
 doc2 0.032 Cats are killers. They kill billions of animals a year.                   
 doc1 0.030 Stray cats are running all over the place. I see 10 a day!                
 doc7 0.000 I have nothing to say. In summary, I have told you nothing.

Our “best” document, at least in an intuitive sense, comes out ahead with a score nearly twice as high as its nearest competitor. The second highest ranked document is still about cat food, and the profoundly uninformative document 7 has been ranked dead last.

Discussion

Though tfidf weighting and the vector space model may now be old school, its core concepts are still used in industrial search solutions built using Lucene. In more modern (and statistical) approaches based on probabilistic language modeling, documents are ranked by the probability that their underlying language model produced the query [6]. While there's nothing inherently statistical about the vector space model, a link to probabilistic language modeling has been demonstrated [7].

I hope you've enjoyed exploring the tm package and implementing classic ideas from the information retrieval.

Acknowledgements

The markdown [8] and knitr [9] packages, in conjunction with RStudio's IDE [10], were used to create this document. Thanks to Chris Nicholas and Shannon Terry for their comments and feedback. I first learned about information retrieval in Coursera's Stanford Natural Language Processing course taught by Dan Jurafsky and Christopher Manning. Keep up with ours and other great articles on R-Bloggers .

References

Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.
Kurt Hornik (2013). Snowball: Snowball Stemmers. R package version 0.0-8. http://CRAN.R-project.org/package=Snowball
Kurt Hornik, David Meyer and Christian Buchta (2013). slam: Sparse Lightweight Arrays and Matrices. R package version 0.1-28. http://CRAN.R-project.org/package=slam
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008. URL: http://www-nlp.stanford.edu/IR-book/
Hugo Zaragoza, Djoerd Hiemstra, and Michael Tipping. “Bayesian extension to the language model for ad hoc information retrieval.” Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2003. URL
Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. No. CMU-CS-96-118. Carnegie-Mellon University of Pittsburgh, PA. Department of Computer Science, 1996.
JJ Allaire, Jeffrey Horner, Vicent Marti and Natacha Porte (2012). markdown: Markdown rendering for R. R package version 0.5.3. http://CRAN.R-project.org/package=markdown
Yihui Xie (2012). knitr: A general-purpose package for dynamic report generation in R. R package version 0.6. http://CRAN.R-project.org/package=knitr
RStudio IDE for Windows. URL http://www.rstudio.com/ide/
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org/.