On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly. The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.
An effective solution to this is an nGram Dictionary. Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.
My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched. Thus, indexing results as such:
3 -> ana -> ('banana') ban -> ('banana', 'band') 4 -> bana -> ('banana') band -> ('band') anan -> ('banana') 5 -> banan -> ('banana') ...
To instantiate, just NGramDictionary new on: someCollectionofStrings from: 4 to: 10, for example. To search, just foo search: ‘someString’. If someString can be larger than the NGramDictionary size; search: will handle longer string searches by doing a select: scan.
How large to make the nGram Dictionary is a matter of how large and non-unique the string collection may be. I recommend starting at the low end at 3 or 4 – searches of 1 or 2 character strings will return way too many results with most string collections to be useful.
There’s a few other special cases I kept this in this example code, to meet my particular needs; for example, this is not a strict nGram splicing – spaces, numbers and punctuation marks are excluded and become “word” boundaries. But, the results are super-fast, enabling, for example, type-ahead select lists on a web-based form.
Object subclass: #NGramDictionary instanceVariableNames: 'mySmallest myLargest myCollectionofStrings myNGrams' classVariableNames: '' poolDictionaries: '' category: ''at: aString "returns strictly those strings that have an indexed nGram. If the string is larger than te largest nGram specified in this dictionary, return an empty collection" |nGram| nGram := self nGrams at: aString size ifAbsent: [nil]. ^nGram notNil ifTrue: [nGram at: aString ifAbsent: [OrderedCollection new]] ifFalse: [Bag new] nGramed: aCollectionofStrings n: length "private. Return a dictionary of all nGrams or order length for the given collection of strings, excluding nGrams with punctuation or numbers" |ngram e words | ngram := Dictionary new. aCollectionofStrings do: [:each | e := each asLowercase. 1 to: e size - length + 1 do: [:i | words := (e copyFrom: i to: i + length -1) subStrings: '., :;- ()!!''&%/#+0123456789'. (words size > 0 and: [words first size = length]) "split it, save if still the same length" ifTrue: [(ngram at: words first ifAbsentPut: [OrderedCollection new]) add: each] ] ]. ^ngram nGrams "Private. lazily initialized" ^ myNGrams ifNil: [myNGrams := Dictionary new. mySmallest to: myLargest do: [ :i | myNGrams at: i put: (self nGramed: myCollectionofStrings n: i) ]. myNGrams yourself ] on: aCollectionofStrings from: start to: finish "initialization method" myCollectionofStrings := aCollectionofStrings. mySmallest := start. myLargest := finish search: aString "returns a collection of strings that contain aString. If there's not an nGram available, do a select: , which is slower of course'" |nGram searchstring| searchstring := aString asLowercase. searchstring size <= myLargest ifTrue: [nGram := self nGrams at: searchstring size. ^nGram at: searchstring ifAbsent: [OrderedCollection new]] ifFalse: [nGram := self nGrams at: myLargest. ^(nGram at: (searchstring copyFrom: 1 to: myLargest) ifAbsent: [OrderedCollection new]) select: [: each | each asLowercase includesSubString: searchstring] ]