nGram Dictionary

On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly.  The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.

An effective solution to this is an nGram Dictionary.  Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.

My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched.  Thus, indexing results as such:

3 -> ana -> ('banana')
ban -> ('banana', 'band')
4 -> bana -> ('banana')
band -> ('band')
anan -> ('banana')
5 -> banan -> ('banana')
...

To instantiate, just NGramDictionary new on: someCollectionofStrings from: 4 to: 10, for example.  To search, just foo search: ‘someString’.  If someString can be larger than the NGramDictionary size; search: will handle longer string searches by doing a select: scan.

How large to make the nGram Dictionary is a matter of how large and non-unique the string collection may be.  I recommend starting at the low end at 3 or 4 – searches of 1 or 2 character strings will return way too many results with most string collections to be useful.

There’s a few other special cases I kept this in this example code, to meet my particular needs; for example, this is not a strict nGram splicing – spaces, numbers and punctuation marks are excluded and become “word” boundaries.  But, the results are super-fast, enabling, for example, type-ahead select lists on a web-based form.

Object subclass: #NGramDictionary
instanceVariableNames: 'mySmallest myLargest myCollectionofStrings myNGrams'
classVariableNames: ''
poolDictionaries: ''
category: ''

at: aString
"returns strictly those strings that have an indexed nGram.  If the string is larger than
te largest nGram specified in this dictionary, return an empty collection"
|nGram|
nGram := self nGrams at: aString size ifAbsent: [nil].
^nGram notNil
   ifTrue: [nGram at: aString ifAbsent: [OrderedCollection new]]
   ifFalse: [Bag new]

nGramed: aCollectionofStrings n: length
"private.  Return a dictionary of all nGrams or order length for the given collection of strings, excluding nGrams with punctuation or numbers"
|ngram e words |

ngram := Dictionary new.
aCollectionofStrings
do: [:each |
   e := each asLowercase.
   1 to: e size - length + 1 do: [:i |
        words := (e copyFrom: i to: i + length -1) subStrings: '., :;- ()[]!!''&%/#+0123456789'.
        (words size > 0 and: [words first size = length]) "split it, save if still the same length"
       ifTrue:
         [(ngram at: words first ifAbsentPut: [OrderedCollection new]) add: each]
       ]
 ].
^ngram

nGrams
"Private. lazily initialized"
    ^ myNGrams ifNil:
[myNGrams := Dictionary new.
mySmallest to: myLargest do: [ :i |
   myNGrams at: i put: (self nGramed: myCollectionofStrings n: i) ].
   myNGrams yourself ]

on: aCollectionofStrings from: start to: finish
"initialization method"
myCollectionofStrings := aCollectionofStrings.
mySmallest := start.
myLargest := finish

search: aString
"returns a collection of strings that contain aString.  If there's not an nGram available, do a select: , which is slower of course'"
|nGram searchstring|
searchstring := aString asLowercase.
searchstring size <= myLargest
  ifTrue:
     [nGram := self nGrams at: searchstring size.
     ^nGram at: searchstring ifAbsent: [OrderedCollection new]]
  ifFalse:
     [nGram := self nGrams at: myLargest.
     ^(nGram at: (searchstring copyFrom: 1 to: myLargest) ifAbsent: [OrderedCollection new])
select: [: each | each asLowercase includesSubString: searchstring]
     ]

Published by

kevin

I'm the founder of Agoric Source, co-organizer of the Houston Python Meetup, director of technology at Newspaper Subscription Services, LP, technology advisor to InstaFuel, active board member of the Houston Area Model United Nations, and occasional volunteer to the Red Cross (during hurricanes or other local emergencies). I'm first and foremost still a software hacker, but with my economics background and business experience, I serve well as a project or program manager, technical visionary, etc.