Wednesday, March 27, 2013

Build a search engine in 20 minutes or less

…or your money back.

The

Edit 6/25/2016: In addition to a tutorial on basic text processing and information retrieval in R, this article is also a cautionary tale about forgoing modern document generation and version control; the reader will notice some inconsistencies between the output shown in the article vs in the R console.

Setup

We've got a collection of documents:


doc1 <- "Stray cats are running all over the place. I see 10 a day!"
doc2 <- "Cats are killers. They kill billions of animals a year."
doc3 <- "The best food in Columbus, OH is   the North Market."
doc4 <- "Brand A is the best tasting cat food around. Your cat will love it."
doc5 <- "Buy Brand C cat food for your cat. Brand C makes healthy and happy cats."
doc6 <- "The Arnold Classic came to town this weekend. It reminds us to be healthy."
doc7 <- "I have nothing to say. In summary, I have told you nothing."

doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7)
N.docs <- length(doc.list)
names(doc.list) <- paste0("doc", c(1:N.docs))

We also have an information need that is expressed via the following text query:


query <- "Healthy cat food"

We're going to use an old method that goes way back to the 1960's. Specifically, we'll implement the vector space model of information retrieval in R. In the process, you'll hopefully learn something about the tm package.

Text mining in R

Fundamentals of the tm package

If you have not installed the tm [1][2] and SnowballC [3] packages, please do so now.

install.packages("tm")
install.packages("SnowballC")

Load the tm package into memory.


library(tm)

In text mining and related fields, a corpus is a collection of texts, often with extensive manual annotation. Not surprisingly, the Corpus class is a fundamental data structure in tm.


my.docs <- VectorSource(c(doc.list, query))
my.docs$Names <- c(names(doc.list), "query")

my.corpus <- Corpus(my.docs)
my.corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 8

Above we treated the query like any other document. It is, after all, just another string of text. Queries are not typically known a priori, but in the processing steps that follow, we will pretend like we knew ours in advance to avoid repeating steps.

Standardizing

One of the nice things about the Corpus class is the tm_map function, which cleans and standardizes documents within a Corpus object. Below are some of the transformations.


getTransformations()

[1] "removeNumbers"     "removePunctuation" "removeWords"
[4] "stemDocument"      "stripWhitespace"

First, let's get rid of punctuation.


my.corpus <- tm_map(my.corpus, removePunctuation)
content(my.corpus[[1]])

## Stray cats are running all over the place I see 10 a day

Suppose we don't want to count “cats” and “cat” as two separate words. Then we will use the stemDocument transformation to implement the famous Porter Stemmer algorithm. To use this particular transformation, first load the SnowballC package.


library(SnowballC)
my.corpus <- tm_map(my.corpus, stemDocument)
content(my.corpus[[1]])

## Stray cat are run all over the place I see 10 a day

Finally, remove numbers and any extra white space.


my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, stripWhitespace)
content(my.corpus[[1]])

## stray cat are run all over the place i see a day

We applied all these standardization techniques without much thought. For instance, we sacrificed inflection in favor of fewer words. But at least the transformations make sense on a heuristic level, much like the similarity concepts to follow.

The vector space model

Document similarity

Here's a trick that's been around for a while: represent each document as a vector in \( \mathcal{R}^N \) (with \( N \) as the number of words) and use the angle \( \theta \) between the vectors as a similarity measure. Rank by the similarity of each document to the query and you have a search engine.

One of the simplest things we can do is to count words within documents. This naturally forms a two dimensional structure, the term document matrix, with rows corresponding to the words and the columns corresponding to the documents. As with any matrix, we may think of a term document matrix as a collection of column vectors existing in a space defined by the rows. The query lives in this space as well, though in practice we wouldn't know it beforehand.


term.doc.matrix.stm <- TermDocumentMatrix(my.corpus)
colnames(term.doc.matrix.stm) <- c(names(doc.list), "query")
inspect(term.doc.matrix.stm[0:14, ])

<<TermDocumentMatrix (terms: 14, documents: 8)>>
Non-/sparse entries: 21/91
Sparsity           : 81%
Maximal term length: 8
Weighting          : term frequency (tf)

          Docs
Terms      doc1 doc2 doc3 doc4 doc5 doc6 doc7 query
  all         1    0    0    0    0    0    0     0
  and         0    0    0    0    1    0    0     0
  anim        0    1    0    0    0    0    0     0
  are         1    1    0    0    0    0    0     0
  arnold      0    0    0    0    0    1    0     0
  around      0    0    0    1    0    0    0     0
  best        0    0    1    1    0    0    0     0
  billion     0    1    0    0    0    0    0     0
  brand       0    0    0    1    2    0    0     0
  buy         0    0    0    0    1    0    0     0
  came        0    0    0    0    0    1    0     0
  cat         1    1    0    2    3    0    0     1
  classic     0    0    0    0    0    1    0     0
  columbus    0    0    1    0    0    0    0     0

Sparsity and storage of the term document matrix

The matrices in tm are of type Simple Triplet Matrix where only the triples \( (i, j, value) \) are stored for non-zero values. To work directly with these objects, you may use install the slam [4] package. We bear some extra cost by making the matrix “dense” (i.e., storing all the zeros) below.


term.doc.matrix <- as.matrix(term.doc.matrix.stm)

cat("Dense matrix representation costs", object.size(term.doc.matrix), "bytes.\n", 
    "Simple triplet matrix representation costs", object.size(term.doc.matrix.stm), 
    "bytes.")

## Dense matrix representation costs 6688 bytes.
##  Simple triplet matrix representation costs 5808 bytes.

Variations on a theme

In term.doc.matrix, the dimensions of the document space are simple term frequencies. This is fine, but other heuristics are available. For instance, rather than a linear increase in the term frequency \( tf \), perhaps \( \sqrt(tf) \) or \( \log(tf) \) would provide a more reasonable diminishing returns on word counts within documents.

Rare words can also get a boost. The word “healthy” appears in only one document, whereas “cat” appears in four. A word's document frequency \( df \) is the number of documents that contain it, and a natural choice is to weight words inversely proportional to their \( df \)s. As with term frequency, we may use logarithms or other transformations to achieve the desired effect.

The tm function weightTfIdf offers one variety of tfidf weighting, but below we build our own. Visit the Wikipedia page for the SMART Information Retrieval System for a brief history and a list of popular weighting choices.

Different weighting choices are often made for the query and the documents. For instance, Manning et al.'s worked example [5] uses \( idf \) weighting only for the query.

Choice and implementation

For both the document and query, we choose tfidf weights of \( (1 + \log_2(tf)) \times \log_2(N/df) \), which are defined to be \( 0 \) if \( tf = 0 \). Note that whenever a term does not occur in a specific document, or when it appears in every document, its weight is zero.


get.tf.idf.weights <- function(tf.vec) {
  # Computes tfidf weights from term frequency vector
  n.docs <- length(tf.vec)
  doc.frequency <- length(tf.vec[tf.vec > 0])
  weights <- rep(0, length(tf.vec))
  weights[tf.vec > 0] <- (1 + log2(tf.vec[tf.vec > 0])) * log2(n.docs/doc.frequency)
  return(weights)
}

# For a word appearing in 4 of 6 documents, occurring 1, 2, 3, and 6 times"
get.tf.idf.weights(c(1, 2, 3, 0, 0, 6))

[1] 0.5849625 1.1699250 1.5121061 0.0000000 0.0000000 2.0970686

Using apply, we run the tfidf weighting function on every row of the term document matrix. The document frequency is easily derived from each row by the counting the non-zero entries (not including the query).


tfidf.matrix <- t(apply(term.doc.matrix, 1,
                        FUN = function(row) {get.tf.idf.weights(row)}))
colnames(tfidf.matrix) <- colnames(term.doc.matrix)

tfidf.matrix[0:3, ]
    
Terms  doc1 doc2 doc3 doc4 doc5 doc6 doc7 query
  all     3    0    0    0    0    0    0     0
  and     0    0    0    0    3    0    0     0
  anim    0    3    0    0    0    0    0     0

Dot product geometry

A benefit of being in the vector space \( \mathcal{R}^N \) is the use of its dot product. For vectors \( a \) and \( b \), the geometric definition of the dot product is \( a \cdot b = \vert\vert a\vert\vert \, \vert\vert b \vert \vert \cos \theta \), where \( \vert\vert \cdot \vert \vert \) is the euclidean norm (the root sum of squares) and \( \theta \) is the angle between \( a \) and \( b \).

In fact, we can work directly with the cosine of \( \theta \). For \( \theta \) in the interval \( [-\pi, -\pi] \), the endpoints are orthogonality (totally unrelated documents) and the center, zero, is complete collinearity (maximally similar documents). We can see that the cosine decreases from its maximum value of \( 1.0 \) as the angle departs from zero in either direction.


angle <- seq(-pi, pi, by = pi/16)
plot(cos(angle) ~ angle, type = "b", xlab = "angle in radians",
     main = "Cosine similarity by angle")

plot of chunk unnamed-chunk-16

We may furthermore normalize each column vector in our tfidf matrix so that its norm is one. Now the dot product is \( \cos \theta \).


tfidf.matrix <- scale(tfidf.matrix, center = FALSE,
                      scale = sqrt(colSums(tfidf.matrix^2)))
tfidf.matrix[0:3, ]
    
Terms       doc1      doc2 doc3 doc4      doc5 doc6 doc7 query
  all  0.3625797 0.0000000    0    0 0.0000000    0    0     0
  and  0.0000000 0.0000000    0    0 0.3558476    0    0     0
  anim 0.0000000 0.3923672    0    0 0.0000000    0    0     0

Matrix multiplication: a dot product machine

Treating the query as just another document kept things simple for this article, though in a production system the query will effectively come from a different corpus (see @Lorien's comment below). Now it's time to split them up.


query.vector <- tfidf.matrix[, (N.docs + 1)]
tfidf.matrix <- tfidf.matrix[, 1:N.docs]

With the query vector and the set of document vectors in hand, it is time to go after the cosine similarities. These are simple dot products as our vectors have been normalized to unit length.

Recall that matrix multiplication is really just a sequence of vector dot products. The matrix operation below returns values of \( \cos \theta \) for each document vector and the query vector.


doc.scores <- t(query.vector) %*% tfidf.matrix

With scores in hand, rank the documents by their cosine similarities with the query vector.


results.df <- data.frame(doc = names(doc.list), score = t(doc.scores),
                         text = unlist(doc.list))
results.df <- results.df[order(results.df$score, decreasing = TRUE), ]

The results

How did our search engine do?


options(width = 2000)
print(results.df, row.names = FALSE, right = FALSE, digits = 2)
                                                                     
 doc  score text                                                                      
 doc5 0.267 Buy Brand C cat food for your cat. Brand C makes healthy and happy cats.  
 doc4 0.143 Brand A is the best tasting cat food around. Your cat will love it.       
 doc6 0.132 The Arnold Classic came to town this weekend. It reminds us to be healthy.
 doc3 0.090 The best food in Columbus, OH is   the North Market.                      
 doc2 0.032 Cats are killers. They kill billions of animals a year.                   
 doc1 0.030 Stray cats are running all over the place. I see 10 a day!                
 doc7 0.000 I have nothing to say. In summary, I have told you nothing. 

Our “best” document, at least in an intuitive sense, comes out ahead with a score nearly twice as high as its nearest competitor. The second highest ranked document is still about cat food, and the profoundly uninformative document 7 has been ranked dead last.

Discussion

Though tfidf weighting and the vector space model may now be old school, its core concepts are still used in industrial search solutions built using Lucene. In more modern (and statistical) approaches based on probabilistic language modeling, documents are ranked by the probability that their underlying language model produced the query [6]. While there's nothing inherently statistical about the vector space model, a link to probabilistic language modeling has been demonstrated [7].

I hope you've enjoyed exploring the tm package and implementing classic ideas from the information retrieval.

Acknowledgements

The markdown [8] and knitr [9] packages, in conjunction with RStudio's IDE [10], were used to create this document. Thanks to Chris Nicholas and Shannon Terry for their comments and feedback. I first learned about information retrieval in Coursera's Stanford Natural Language Processing course taught by Dan Jurafsky and Christopher Manning. Keep up with ours and other great articles on R-Bloggers .

References

  1. Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm

  2. Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.

  3. Kurt Hornik (2013). Snowball: Snowball Stemmers. R package version 0.0-8. http://CRAN.R-project.org/package=Snowball

  4. Kurt Hornik, David Meyer and Christian Buchta (2013). slam: Sparse Lightweight Arrays and Matrices. R package version 0.1-28. http://CRAN.R-project.org/package=slam

  5. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008. URL: http://www-nlp.stanford.edu/IR-book/

  6. Hugo Zaragoza, Djoerd Hiemstra, and Michael Tipping. “Bayesian extension to the language model for ad hoc information retrieval.” Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2003. URL

  7. Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. No. CMU-CS-96-118. Carnegie-Mellon University of Pittsburgh, PA. Department of Computer Science, 1996.

  8. JJ Allaire, Jeffrey Horner, Vicent Marti and Natacha Porte (2012). markdown: Markdown rendering for R. R package version 0.5.3. http://CRAN.R-project.org/package=markdown

  9. Yihui Xie (2012). knitr: A general-purpose package for dynamic report generation in R. R package version 0.6. http://CRAN.R-project.org/package=knitr

  10. RStudio IDE for Windows. URL http://www.rstudio.com/ide/

  11. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org/.

23 comments:

  1. Very interesting post!

    Without the stem step (I have not a jvm installed) I get better results:

    doc5 0.460 Buy Brand C cat...
    doc4 0.377 Brand A is the b...
    doc6 0.150 The Arnold Class...
    doc3 0.095 The best food in...
    doc1 0.000 Stray cats are r...
    doc2 0.000 Cats are killers...
    doc7 0.000 I have nothing t...

    ReplyDelete
    Replies
    1. Ah, something useful was removed by stemming, since the singular "cat" precedes "food" in the two most natural documents. It's interesting how the other two documents involving "cats" now have zero scores. Thanks for sharing!

      Delete
  2. Really cool post! Thanks for sharing!

    ReplyDelete
  3. Perfect guide for building search engine and i will try to learn something from it.

    ReplyDelete
  4. The library Snowball is renamed as SnowballC in R v3.1.
    Mention about that as many people may face this problem.
    With regards,

    ReplyDelete
    Replies
    1. It is also not possible to use tolower in the same manner as done in the example above. This is due to changes in new versions of tm. Read this for more http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument

      Delete
    2. Thanks Aayush. I finally got around to updating.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. I get the following error when running the apply function. (R 3.2.1)
    > tfidf.matrix <- t(apply(term.doc.matrix, 1, FUN = get.weights.per.term.vec))
    Error in FUN(newX[, i], ...) : object 'tfidf.row' not found



    Thank you, your post was very helpful

    ReplyDelete
    Replies
    1. Hi Leebasky, I've recently updated the article. If you're still interested, the code should work for you now. Sorry for the delay, and glad you still got something out of it.

      Delete
  7. Hello Sir , I am getting following error , can you plz guide me on this
    Error in tfidf.matrix[, (N.docs + 1)] : subscript out of bounds

    ReplyDelete
    Replies
    1. After Executing the command "query.vector <- tfidf.matrix[, (N.docs + 1)]".
      Also N.docs <- 103, i have calculated i dont know where i am wrong , here its correct why subscript out of bound error is giving.I am stuck

      Delete
    2. Got it where i was wrong, here i have't consider the Query command ...So i got the result ..Thank you for the Post really very helpful ..Looking forward for various recommendation engine tutorials ..Thank well in advance .

      Delete
    3. Hi Nabi, I'm glad you got something out of the article. I finally got around to updating it so the code above should work now.

      Delete
  8. Fantastic article Ben. Found a bug: Need to bump N.docs up by 1 for the query to be processed correctly.

    ReplyDelete
  9. Thanks Lorien! About the bug, I'm a little rusty on my own article. Mind checking to see if I added +1 in the right place?

    ReplyDelete
  10. Yes, Ben, that'll work. I did it a bit differently (by bumping ndocs early then taking it down a notch later) but tomatoes tomahtoes :-)

    This is a really terrific article btw: boils down what is unnecessarily complex elsewhere to its essence. And incredibly useful: lots of my clients looking for this kind of functionality lately.

    BTW, in production your query's going to come from a different corpus, so you're going to need something like the code below after you've cleaned it up using identical code to the original corpus creation:

    r1<-rownames(term.doc.matrix)
    r2<-rownames(query.doc.matrix)

    rearranger<-match(r2,r1) # List of new positions for r2 values
    r2<-rearranger[!is.na(rearranger)] # List of the new positions omitting elements that don't get positioned
    valuesToRearrange <- query.doc.matrix[!is.na(rearranger)] # Values to go into those positions
    newcol<-rep(0,nrow(term.doc.matrix)) # Make all 0s to initialize new matrix col

    # Place the values in their positions
    newcol[r2]<-valuesToRearrange

    # Append the query to the matrix
    term.doc.matrix<-cbind(term.doc.matrix, newcol)

    ReplyDelete
  11. Really glad you got something out of it. This morning I simplified the tfidf function so that it works on a term frequency vector alone and thus can be directly applied to the rows of the term frequency matrix. Since that matrix has the query column included, there's no need to add +1 like before. However, to your point, that's not really how it would work in a production system where a new query wouldn't automatically enter the corpus. I referenced your comment in the article and hopefully it gets the message across.

    ReplyDelete
  12. Excellent post. I have a question or two though. What if you have 100k documents, and you can't create a matrix that is 91G in size? Do you use the words in the query to pull all documents that use those words as an initial Boolean type filter? Also, if the documents on average 4-5 words, such as an ontology, and the queries are similar 4-5 words, do you anticipate that this method with fail? I find hundreds of nearly perfect scores between terms that have no common words, in very large ontology. Thanks!

    ReplyDelete
    Replies
    1. Hi Jonathan, glad you enjoyed the article. Though this article had everything stored in memory, I've read that it's more common to break things up into different stages, with the most resource intensive stage (indexing) done offline in a batch environment. I don't myself know how that works, and digging deeper into Apache Lucene would be a place to start (this Quora post looks interesting - https://www.quora.com/What-is-an-intuitive-description-of-how-Lucene-works).

      As far as the ontology-based method you described, I did once try something like that and it did indeed fail. It was using WordNet if I remember correctly and there were just too many nonsense matches. I think the official term for that task is "Query expansion," and you may want to dig into the research there. Another related area that's interesting is "Word Sense Disambiguation"; if you're searching for "Stocks," you probably want information on the financial instrument and not a job description that includes "stocks shelves." I read that is considered an "AI-complete" problem (and thus very difficult).

      Delete
  13. Excellent post here. I am trying to build on this to cluster similar documents. For example, I have 2000+ emails and I want to use cosine similarity to determine which emails are similar to each other (btw, these emails are all on the same over-arching topic so there are similarities here). Am I crazy for thinking I should be able to use cosine similarity do this? Any ideas are appreciated. Thank you.

    ReplyDelete
    Replies
    1. Hi Natasha, it's been a long time since I got a real comment on this article. I'm not an expert in this area (had just taken a Coursera class before writing this article in 2013), but my understanding is that cosine similarity is still quite useful. For recommender systems, Google Research just released a paper (https://dl.acm.org/doi/pdf/10.1145/3383313.3412488) showing how dot-product similarities outperform "learned" neural network similarities in recommender systems. For your task, computing "2000 choose 2" similarities doesn't sound too bad, but if you run into problems you could check out the Locality Sensitive Hashing (LSH) algorithm. Latent Dirichlet allocation is another model worth checking out, though I don't have much experience with it.

      Delete