Dennis Forbes on Pragmatic Software Development   Subscribe to RSS


About the Author
Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development, Linux development, and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 13 years.


Recent Entries


The Feed Bag

 
Monday, September 05 2005

Several times over the past couple of years, in my role as a database consultant, I've come across very, very large databases, where a large percentage of the data is redundant. For instance consider the following two abbreviated tables

Forms
FormsID (PK) nvarchar(255)

Hits
HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of them averaging about 30 characters in length (so 60 bytes or so, giving that it's unicode). If you have a million records in hits, that's 60MB just for the form value itself. If you have one hundred million records, and a dozen large FKs like this, well you get the picture. It vastly increases the amount of I/O to do searches in the Hits table, and even if Forms is indexed it's still much slower than it could be if Forms had an integer primary key.

While I personally wouldn't layout tables this way, it is an entirely credible and justifiable design - the designer simply decided to use a natural key rather than a surrogate key. Simplicity of design, and clarity of relationships when looking at the data, outweighed I/O concerns for this person/group. Such a design is not a question of normalization.

When you have a million+ records it suddenly becomes a concern, though. There are ways to refactor this design, including "normalizing" the original table a bit and hiding it behind a view, and then adding INSTEAD OF triggers on the view, however that is a leaky abstraction. SQL Server does not completely mimic a real table, and operations like INSERT FROM fail, not to mention oddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind the scenes method of collapsing redundant large field values into a hidden behind the scenes lookup table, similar to what Visual Studio does with string pooling. e.g. In this case it could replace FormID with a internal value to lookup against a tiny relational table. Obviously this should be manually configured, but it would be a relatively easy change that could tremendously improve a lot of existing database designs where a redesign isn't a priority, but I/O costs are onerous.

Apparently mySQL has something similar by way of enums, however it is a fixed set (what I'd like is that new values inserted into the table are automatically added into the behind the scenes set), and again there is some leakiness with the abstraction.

  SQL 

Reader Comments

Add Comment

Name *:

Email Address:

(your email address is not displayed)
Website:

Comment *:


Dennis Forbes