On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly. The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.
An effective solution to this is an nGram Dictionary. Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.
My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched. Thus, indexing results as such:
3 -> ana -> ('banana') ban -> ('banana', 'band') 4 -> bana -> ('banana') band -> ('band') anan -> ('banana') 5 -> banan -> ('banana') ...