Comments on Anything but R-bitrary: Build a search engine in 20 minutes or less

Hi Natasha, it's been a long time since I got ...

2020-10-06T08:29:08.443-07:00

Hi Natasha, it's been a long time since I got a real comment on this article. I'm not an expert in this area (had just taken a Coursera class before writing this article in 2013), but my understanding is that cosine similarity is still quite useful. For recommender systems, Google Research just released a paper (https://dl.acm.org/doi/pdf/10.1145/3383313.3412488) showing how dot-product similarities outperform "learned" neural network similarities in recommender systems. For your task, computing "2000 choose 2" similarities doesn't sound too bad, but if you run into problems you could check out the Locality Sensitive Hashing (LSH) algorithm. Latent Dirichlet allocation is another model worth checking out, though I don't have much experience with it.

Excellent post here. I am trying to build on this ...

2020-10-06T07:50:01.830-07:00

Excellent post here. I am trying to build on this to cluster similar documents. For example, I have 2000+ emails and I want to use cosine similarity to determine which emails are similar to each other (btw, these emails are all on the same over-arching topic so there are similarities here). Am I crazy for thinking I should be able to use cosine similarity do this? Any ideas are appreciated. Thank you.

2018-12-13T05:39:58.039-08:00

Hi Jonathan, glad you enjoyed the article. Though ...

2017-01-03T05:28:57.067-08:00

Hi Jonathan, glad you enjoyed the article. Though this article had everything stored in memory, I've read that it's more common to break things up into different stages, with the most resource intensive stage (indexing) done offline in a batch environment. I don't myself know how that works, and digging deeper into Apache Lucene would be a place to start (this Quora post looks interesting - https://www.quora.com/What-is-an-intuitive-description-of-how-Lucene-works).

As far as the ontology-based method you described, I did once try something like that and it did indeed fail. It was using WordNet if I remember correctly and there were just too many nonsense matches. I think the official term for that task is "Query expansion," and you may want to dig into the research there. Another related area that's interesting is "Word Sense Disambiguation"; if you're searching for "Stocks," you probably want information on the financial instrument and not a job description that includes "stocks shelves." I read that is considered an "AI-complete" problem (and thus very difficult).

Excellent post. I have a question or two though. ...

2017-01-02T22:56:53.177-08:00

Excellent post. I have a question or two though. What if you have 100k documents, and you can't create a matrix that is 91G in size? Do you use the words in the query to pull all documents that use those words as an initial Boolean type filter? Also, if the documents on average 4-5 words, such as an ontology, and the queries are similar 4-5 words, do you anticipate that this method with fail? I find hundreds of nearly perfect scores between terms that have no common words, in very large ontology. Thanks!

Really glad you got something out of it. This morn...

2016-11-06T08:31:26.523-08:00

Really glad you got something out of it. This morning I simplified the tfidf function so that it works on a term frequency vector alone and thus can be directly applied to the rows of the term frequency matrix. Since that matrix has the query column included, there's no need to add +1 like before. However, to your point, that's not really how it would work in a production system where a new query wouldn't automatically enter the corpus. I referenced your comment in the article and hopefully it gets the message across.

Yes, Ben, that'll work. I did it a bit differ...

2016-11-04T17:12:47.145-07:00

Yes, Ben, that'll work. I did it a bit differently (by bumping ndocs early then taking it down a notch later) but tomatoes tomahtoes :-)

This is a really terrific article btw: boils down what is unnecessarily complex elsewhere to its essence. And incredibly useful: lots of my clients looking for this kind of functionality lately.

BTW, in production your query's going to come from a different corpus, so you're going to need something like the code below after you've cleaned it up using identical code to the original corpus creation:

r1<-rownames(term.doc.matrix)
r2<-rownames(query.doc.matrix)

rearranger<-match(r2,r1) # List of new positions for r2 values
r2<-rearranger[!is.na(rearranger)] # List of the new positions omitting elements that don't get positioned
valuesToRearrange <- query.doc.matrix[!is.na(rearranger)] # Values to go into those positions
newcol<-rep(0,nrow(term.doc.matrix)) # Make all 0s to initialize new matrix col

# Place the values in their positions
newcol[r2]<-valuesToRearrange

# Append the query to the matrix
term.doc.matrix<-cbind(term.doc.matrix, newcol)

Thanks Lorien! About the bug, I'm a little rus...

2016-11-04T16:26:36.452-07:00

Thanks Lorien! About the bug, I'm a little rusty on my own article. Mind checking to see if I added +1 in the right place?

Fantastic article Ben. Found a bug: Need to bump N...

2016-10-29T21:50:02.045-07:00

Fantastic article Ben. Found a bug: Need to bump N.docs up by 1 for the query to be processed correctly.

Hi Nabi, I'm glad you got something out of the...

2016-06-25T14:09:25.125-07:00

Hi Nabi, I'm glad you got something out of the article. I finally got around to updating it so the code above should work now.

Hi Leebasky, I've recently updated the article...

2016-06-25T14:06:56.580-07:00

Hi Leebasky, I've recently updated the article. If you're still interested, the code should work for you now. Sorry for the delay, and glad you still got something out of it.

Thanks Aayush. I finally got around to updating.

2016-06-25T14:05:05.303-07:00

Thanks Aayush. I finally got around to updating.

Got it where i was wrong, here i have't consid...

2016-06-22T06:53:04.325-07:00

Got it where i was wrong, here i have't consider the Query command ...So i got the result ..Thank you for the Post really very helpful ..Looking forward for various recommendation engine tutorials ..Thank well in advance .

After Executing the command "query.vector <...

2016-06-21T23:11:18.351-07:00

After Executing the command "query.vector <- tfidf.matrix[, (N.docs + 1)]".
Also N.docs <- 103, i have calculated i dont know where i am wrong , here its correct why subscript out of bound error is giving.I am stuck

Hello Sir , I am getting following error , can you...

2016-06-21T22:49:04.537-07:00

Hello Sir , I am getting following error , can you plz guide me on this
Error in tfidf.matrix[, (N.docs + 1)] : subscript out of bounds

I get the following error when running the apply f...

2016-03-02T05:39:33.241-08:00

I get the following error when running the apply function. (R 3.2.1)
> tfidf.matrix <- t(apply(term.doc.matrix, 1, FUN = get.weights.per.term.vec))
Error in FUN(newX[, i], ...) : object 'tfidf.row' not found

Thank you, your post was very helpful

2016-03-02T05:37:44.206-08:00

This comment has been removed by the author.

It is also not possible to use tolower in the same...

2016-01-18T13:05:40.317-08:00

It is also not possible to use tolower in the same manner as done in the example above. This is due to changes in new versions of tm. Read this for more http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument

The library Snowball is renamed as SnowballC in R ...

2015-08-25T14:20:13.735-07:00

The library Snowball is renamed as SnowballC in R v3.1.
Mention about that as many people may face this problem.
With regards,

Perfect guide for building search engine and i wil...

2015-05-13T05:07:53.203-07:00

Perfect guide for building search engine and i will try to learn something from it.

Really cool post! Thanks for sharing!

2013-04-01T20:12:17.856-07:00

Really cool post! Thanks for sharing!

Ah, something useful was removed by stemming, sinc...

2013-03-27T17:51:44.434-07:00

Ah, something useful was removed by stemming, since the singular "cat" precedes "food" in the two most natural documents. It's interesting how the other two documents involving "cats" now have zero scores. Thanks for sharing!

Very interesting post! Without the stem step (I h...

2013-03-27T13:21:30.675-07:00

Very interesting post!

Without the stem step (I have not a jvm installed) I get better results:

doc5 0.460 Buy Brand C cat...
doc4 0.377 Brand A is the b...
doc6 0.150 The Arnold Class...
doc3 0.095 The best food in...
doc1 0.000 Stray cats are r...
doc2 0.000 Cats are killers...
doc7 0.000 I have nothing t...