Dennis Forbes on Pragmatic Software Development   Subscribe to RSS


About the Author
Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development, Linux development, and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 13 years.


Recent Entries


The Feed Bag

 
Monday, September 05 2005

[EDIT: 2005-09-08 - There seems to be a bit of confusion regarding this post, so I should clarify - These are neither unique or unsolvable problems. One could, for instance, achieve both of these tasks via hackery, as I mentioned below. I could of course use triggers and procedural logic and duplicitous columns to store reversed sets and decomposed strings. However that is the wrong solution, and pollutes my database with hacks to get around fundamental limits in Full-Text indexing]

Recently I've come across the need to do partial string searches within very large sets of data (e.g. 10 large columns of both ntext and nvarchar types, in a table contains hundreds of thousands, or even millions, of records).

For example I'd like to look for 0505 within a table, returning rows containing that value in any of the searched columns. For instance a row where one of the values contains REC995850505293. I could do this the bulk force way by doing a LIKE '%0505%' against all of the columns, however that's terribly inefficient, and will bring the largest of servers to their knees with the volumes of data that I'm talking about.

Of course the immediate solution one might imagine would be SQL Server's Full-Text Indexing, or even a third-party tool like Apache's Lucene full-text search. The problem is that both of these search engines can only match from the beginning of a word onwards (or, with a thesaurus, word variants). For instance they can search form REC9958*, returning REC995850505293, but they cannot search for *50505293. Because the index is ordered based upon the beginning of the word, it can only match non-beginning partial words through a full-scan, which is of no help at all.

From a technology perspective this is understandable, however there are a couple of pretty simple improvements to full-text indexing that would greatly improve their usability (albeit at the cost of storage and additional search maintenance processing, but that should be a choice that a user can decide).

  • Allow for the configuration of columns to be stored both forwards and backwards. For instance on the back-end the column REC995850505293 would be stored as REC995850505293, but also as 392505058599CER. The value, of course, is that now if I search for "*05293", it can reverse the search phrase to 39250* and compare it against the reversed set, quickly finding a match. Double the storage space, but a vastly more usable index. As it is you can hack this sort of thing yourself, storing mirrors of all of your fields and then running your queries through a processor to know what to invert, but it's a lot of hackery for something that should be a back-end solution.
  • Allow for the configuration such that all sub-parts of a word are also indexed. e.g. REC995850505293 is also stored as EC995850505293, and C995850505293, and 995850505293, and so on. Clearly the consumption in space would be tremendous, but there are scenarios where this would be extremely valuable. Now a search for *0505* and it can find the partial word matches, linking back to the primary word from there.

If anyone has any ideas of places where I might look for solutions to this sort of problem, please drop me a line.

  SQL 

Reader Comments

Add Comment

Name *:

Email Address:

(your email address is not displayed)
Website:

Comment *:


Dennis Forbes