[EDIT: 2005-09-08 - There seems to be a bit of confusion regarding this post, so I should clarify - These are neither unique or unsolvable problems. One could, for instance, achieve both of these tasks via hackery, as I mentioned below. I could of course use triggers and procedural logic and duplicitous columns to store reversed sets and decomposed strings. However that is the wrong solution, and pollutes my database with hacks to get around fundamental limits in Full-Text indexing]
Recently I've come across the need to do partial string searches within very large sets of data (e.g. 10 large columns of both ntext and nvarchar types, in a table contains hundreds of thousands, or even millions, of records).
For example I'd like to look for 0505 within a table, returning rows containing that value in any of the searched columns. For instance a row where one of the values contains REC995850505293. I could do this the bulk force way by doing a LIKE '%0505%' against all of the columns, however that's terribly inefficient, and will bring the largest of servers to their knees with the volumes of data that I'm talking about.
Of course the immediate solution one might imagine would be SQL Server's Full-Text Indexing, or even a third-party tool like Apache's Lucene full-text search. The problem is that both of these search engines can only match from the beginning of a word onwards (or, with a thesaurus, word variants). For instance they can search form REC9958*, returning REC995850505293, but they cannot search for *50505293. Because the index is ordered based upon the beginning of the word, it can only match non-beginning partial words through a full-scan, which is of no help at all.
From a technology perspective this is understandable, however there are a couple of pretty simple improvements to full-text indexing that would greatly improve their usability (albeit at the cost of storage and additional search maintenance processing, but that should be a choice that a user can decide).
If anyone has any ideas of places where I might look for solutions to this sort of problem, please drop me a line.
Several times over the past couple of years, in my role as a database consultant, I've come across very, very large databases, where a large percentage of the data is redundant. For instance consider the following two abbreviated tables
| Forms | |
| FormsID (PK) | nvarchar(255) |
| Hits | |
| HitsID (PK) | int identity(1,1) |
| FormsID (FK) | nvarchar(255) |
| Time | datetime |
Imagine that there are only a dozen forms values, each of them averaging about 30 characters in length (so 60 bytes or so, giving that it's unicode). If you have a million records in hits, that's 60MB just for the form value itself. If you have one hundred million records, and a dozen large FKs like this, well you get the picture. It vastly increases the amount of I/O to do searches in the Hits table, and even if Forms is indexed it's still much slower than it could be if Forms had an integer primary key.
While I personally wouldn't layout tables this way, it is an entirely credible and justifiable design - the designer simply decided to use a natural key rather than a surrogate key. Simplicity of design, and clarity of relationships when looking at the data, outweighed I/O concerns for this person/group. Such a design is not a question of normalization.
When you have a million+ records it suddenly becomes a concern, though. There are ways to refactor this design, including "normalizing" the original table a bit and hiding it behind a view, and then adding INSTEAD OF triggers on the view, however that is a leaky abstraction. SQL Server does not completely mimic a real table, and operations like INSERT FROM fail, not to mention oddities with @@IDENTITY and SCOPE_IDENTITY().
Given all of this, I would love if SQL Server had a behind the scenes method of collapsing redundant large field values into a hidden behind the scenes lookup table, similar to what Visual Studio does with string pooling. e.g. In this case it could replace FormID with a internal value to lookup against a tiny relational table. Obviously this should be manually configured, but it would be a relatively easy change that could tremendously improve a lot of existing database designs where a redesign isn't a priority, but I/O costs are onerous.
Apparently mySQL has something similar by way of enums, however it is a fixed set (what I'd like is that new values inserted into the table are automatically added into the behind the scenes set), and again there is some leakiness with the abstraction.
Finally got around to watching this show, while it was doing a marathon showing on the FoodTV channel here in Canada. Quite apart from the food (which is actually almost an overlooked element of the show - it is not a cooking show), this show is an excellent lesson in management. The lessons learned in dealing with peers, "employees" (the dinner ladies), and the kids is absolutely brilliant human nature stuff that everyone should watch and absorb. Very highly recommended.
char *string_value = new char[32];
In computer science we're quite accustomed to using powers of 2 whenever a numerical limit is required. e.g. The string can be 32 characters long, the filename can be 64 characters, while the number of entries in the listbox can number 1024.
These uses seldom require powers of 2 (e.g. while it makes sense for an ASCII string to be multiples of 4 bytes if it's long aligned and you care about that, it could just as efficiently be 28 or 36 characters long), but nonetheless it's ingrained into most developers' minds.
I chuckled seeing the commercial for some overpriced timed-interval air freshener. It allows you to select 9, 18, or 36 minute intervals between sprays. While not exactly compliant (I'll bet that it was originally 8, 16 or 32 minutes, but they added some lag to the minute counter to avoid it seeming computeresque), and in this case I can understand why the microcontroller developer chose powers, the spirit of the power of 2 lives on.
Over the past couple of days I've noticed hundreds upon hundreds of hits in my logs coming from www.skyscrapercity.com. After some analysis I determined that a user there rather rudely embedded an image on this site - a rather large picture of the Scotia Bank office tower in Toronto - in a discussion thread. Quite apart from the fact that the picture is being used unattributed (if it's good enough to use, then it's good enough to attribute), it's basically silently stealing my bandwidth quota. Very rude.
When people have done this in the past I've surprized them with delightful and entertaining image alterations, but in this case I'm just going to ignore it and let the thread die down. After looking at the source of the traffic, however, I've been reminded of the most common, and most successful, pure-.com internet play - put up a site about some sort of fly-by information (for instance skyscraper diagrams), and then add discussion links. Soon enough you'll have a robust community of users who share that interest, spending hours a day debating whether Chicago is a better looking city than Dubai. It seems like a pretty tenuous foundation for a community, but there it is.
Garbage collection in .NET has always rubbed me the wrong way. As a quick recap, garbage collection in .NET (as in Java) works by basically halting the application and scanning all references from the root reference on. It then looks on its heap to determine that every object has someone else pointing to it, and if not the object is freed (through a long process). The heap memory is then compacted and any references are rebased. The program then restarts until garbage collection happens again at some point in the future. This means, for instance, that if you create a System.IO.File object in a short method that opens a file in exclusive mode, and you fail to use the Dispose pattern or explicitly call Dispose (Dispose being a sort of "garbage collection has some gaps, so here's something that you can do to expedite at least part of the process"), the file will be locked until some unpredictable point in the future that garbage collection runs. You can, of course, force garbage collection, but that throws off the entire lifecycle management and can cause other resource management issues.
Ultimately it seems like the sort of solution that works for relatively small or isolated applications (where it works admirably - for web apps and web services, isolated services, and relatively small Windows Forms apps .NET is a fantastic technology, primarily insofar as it reduces development time), but not as a technology that scales up to large scale, highly responsive systems (where you want the loading, resource usage and response times to be predictable and consistent). This case seemed to be somewhat proven by many of the delays and issues with Longhorn (Windows Vista), and the backtracking and reduced reliance on .NET as a system pervasive technology (The Register isn't the most credible source, but it's just a reference to the sort of thing I've heard throughout the industry). Entirely predictable.
One change that I would like to see added to .NET - optional reference counted references, with a second heap specifically for classic, fragmeted allocation. Reference counting, the oft maligned technology behind COM (mostly because people didn't know how to use it properly), is a completely reliable and extremely predictable and useful technology in a completely managed environment for most uses (there are exceptions where reference counting breaks, but you don't have to throw out the baby to clean the bathwater). Python, for instance, works largely based upon reference counting.
Some might note that Visual C++ 2005 has added "stack" reference types. This really is a bit of syntactical sugar - basically it's just a variant of the Dispose pattern that, when compiled, adds an automatic call to dispose when the object leaves the scope. Not the same thing at all.
Several people have written to ask why I don't allow comments or trackbacks. The answer is not censorship, but rather it's due to the fact that I desired a totally static "blog", so the use of external comments and trackback servers is necessitated. Unfortunately the Radio Userland default ones are unbelievably slow, adding significantly to the time taken to load and render the page. I'm considering options, and if anyone has any suggestions please email me (the link is on the sidebar). Alternately, given that it's an open API, I might just write my own. Sounds like fun.