nGram Dictionary

On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly.  The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.

An effective solution to this is an nGram Dictionary.  Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.

My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched.  Thus, indexing results as such:

3 -> ana -> ('banana')
ban -> ('banana', 'band')
4 -> bana -> ('banana')
band -> ('band')
anan -> ('banana')
5 -> banan -> ('banana')
...

Continue reading nGram Dictionary