software

nGram Dictionary

On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly.  The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.

An effective solution to this is an nGram Dictionary.  Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.

My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched.  Thus, indexing results as such:

3 -> ana -> ('banana')
ban -> ('banana', 'band')
4 -> bana -> ('banana')
band -> ('band')
anan -> ('banana')
5 -> banan -> ('banana')
...

To instantiate, just NGramDictionary new on: someCollectionofStrings from: 4 to: 10, for example.  To search, just foo search: ‘someString’.  If someString can be larger than the NGramDictionary size; search: will handle longer string searches by doing a select: scan.

How large to make the nGram Dictionary is a matter of how large and non-unique the string collection may be.  I recommend starting at the low end at 3 or 4 – searches of 1 or 2 character strings will return way too many results with most string collections to be useful.

There’s a few other special cases I kept this in this example code, to meet my particular needs; for example, this is not a strict nGram splicing – spaces, numbers and punctuation marks are excluded and become “word” boundaries.  But, the results are super-fast, enabling, for example, type-ahead select lists on a web-based form.

Object subclass: #NGramDictionary
instanceVariableNames: 'mySmallest myLargest myCollectionofStrings myNGrams'
classVariableNames: ''
poolDictionaries: ''
category: ''

at: aString
"returns strictly those strings that have an indexed nGram.  If the string is larger than
te largest nGram specified in this dictionary, return an empty collection"
|nGram|
nGram := self nGrams at: aString size ifAbsent: [nil].
^nGram notNil
   ifTrue: [nGram at: aString ifAbsent: [OrderedCollection new]]
   ifFalse: [Bag new]

nGramed: aCollectionofStrings n: length
"private.  Return a dictionary of all nGrams or order length for the given collection of strings, excluding nGrams with punctuation or numbers"
|ngram e words |

ngram := Dictionary new.
aCollectionofStrings
do: [:each |
   e := each asLowercase.
   1 to: e size - length + 1 do: [:i |
        words := (e copyFrom: i to: i + length -1) subStrings: '., :;- ()[]!!''&%/#+0123456789'.
        (words size > 0 and: [words first size = length]) "split it, save if still the same length"
       ifTrue:
         [(ngram at: words first ifAbsentPut: [OrderedCollection new]) add: each]
       ]
 ].
^ngram

nGrams
"Private. lazily initialized"
    ^ myNGrams ifNil:
[myNGrams := Dictionary new.
mySmallest to: myLargest do: [ :i |
   myNGrams at: i put: (self nGramed: myCollectionofStrings n: i) ].
   myNGrams yourself ]

on: aCollectionofStrings from: start to: finish
"initialization method"
myCollectionofStrings := aCollectionofStrings.
mySmallest := start.
myLargest := finish

search: aString
"returns a collection of strings that contain aString.  If there's not an nGram available, do a select: , which is slower of course'"
|nGram searchstring|
searchstring := aString asLowercase.
searchstring size <= myLargest
  ifTrue:
     [nGram := self nGrams at: searchstring size.
     ^nGram at: searchstring ifAbsent: [OrderedCollection new]]
  ifFalse:
     [nGram := self nGrams at: myLargest.
     ^(nGram at: (searchstring copyFrom: 1 to: myLargest) ifAbsent: [OrderedCollection new])
select: [: each | each asLowercase includesSubString: searchstring]
     ]