Digg recently started transitioning parts of their platform to the Cassandra open-source, Facebook-originated NoSQL solution.
They're the perfect customer for NoSQL: The value per user and transaction is very low, demanding solutions that allow them to scale at minimal cost; some data loss or inconsistency can be accepted; and a lot of the data can be effectively siloed into islands.
Nonetheless, the article they posted about the move is filled with the sort of thinking that has littered the web with misinformation about the relational database.
The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes
The relational database "mindset" imposes no such burden.
Indexes, for instance, are a rudimentary tool of every competent database user. Each additional index adds an expense to every write to the table, forcing row changes to update every index in addition to the base table, in return easing certain read scenarios.
You apply as appropriate, striving for the perfect balance between read and write performance.
I posted parts I and II of a very simple "introductory to databases" article back in 2005 (never getting around to finishing part III), and I strongly encourage it for anyone who doesn't understand how indexes work, or how important concepts like covering indexes are (which I'll touch upon later in regards to the Digg scenario).
Many relational database users make heavy use of triggers and cascade activities that slow writes while lubricating reads. While many are wary of triggers in general (especially where business logic gets embedded in the data layer), this is common in the relational database world and makes an appearance in most solutions.
For Digg's particular scenario, however, the RDBMS analogy to their NoSQL approach is a basic materialized view (aka indexed view), which is a feature of most RDBMS products, from big to small.
Implementing materialized views adds a sometimes substantial cost to writes in return for supercharged reads. If I have a particular set of joins and functions that are queried often, I can materialize the view with the appropriate indexes and every change to any of the source tables automatically, as an added cost of the DML, updates the materialized view as well.
Some RDBMS systems support deferred materialized view updates where it automatically queues up the view changes without adding cost to the origination tables.
This is very old hat for virtually anyone competent with relational databases, though real-time materialized views need to be used judiciously because they fall under the auspices of ACID and can front-load write operations significantly.
Ignoring that obvious solution of materialized views (which, to be fair, aren't natively supported by MySQL despite being a basic feature of most other database products), it is revealed that they aren't using the database in the appropriate manner — or that MySQL is simply a broken platform and is turning people against the RDBMS when really they should be against MySQL — when they note that they are manually performing the joins in PHP, claiming that the join takes too long to run as a simply query.
A likely contributor to their poor performance is that while they've made the artificial key "id" their clustered index, their userid/friendid index is only a unique index, and I suspect, from the operation of their site, that they are likely making use of the denormalized friendname column in their consumer as well, forcing a full row lookup for every match.
If they retrieve columns outside of a non-clustered index (the most common mistake is doing a "SELECT *" when you don't actually care about all of the columns), on every lookup match the database server pulls the row id (in this case the primary key) and then has to do another lookup for the actual row data. In their case — given that the relationship is unique — they should have made the compound key of user_id/friend_id the clustered index and eliminated the id column altogether.
This oversight means that instead of doing a simple partial index scan by user_id and pulling the limited set, the query engine is forced to pull the list of rows, and then lookup each and every row individually. So someone with 400 friends yields 400 IO cycles, versus 1 with a proper index.
The same problem exists in the Diggs table, but is made worse. The userid index is of limited value given that again it only helps them look up the surrogate record key (again, why not a primary key on itemid/userid with a secondary index of userid/itemid? Surrogate keys are usually a mistake if there's another unique key on the table, though of course it depends upon the scenario: foreign-keys or numerous secondary indexes might make such a simple key the best choice). The query engine is forced to lookup the records by either the itemid or the userid (by the friendid) and then lookup the root record, and then compare the corresponding value.
So many developers are so blissfully ignorant of how databases work, quick to ascribe their own shortcomings to the platform. Most will wave their hands and talk about how hard to come by a "good DBA" is, which is akin to pushing brutal bubble-sort algorithms and just distributing them across a MapReduce deployment, claiming that a good "sort algorithm guy" is hard to find and "scaling out" is what the big boys do.
So they could see a major performance improvement by indexing properly (I'm allowing that maybe they just gave a bad example, though their atrocious query performance seems to validate its accuracy), but even then looking up hundreds of seemingly randomly distributed records can be a costly exercise.
Let's step back for a minute and ignore materialized views and appropriately created and used indexes and look at the core performance issue that Digg faced — looking up several hundred rows in the Friends table, and interrogating the Diggs table by userid/itemid for the same. Presume that the dataset is very large and it can't be cached in memory, which should be a normal design assumption.
Why is looking up several hundred randomly distributed records such a big deal?

That's why. Most hard drives can only manage to seek to different locations on the disk about a hundred times per second. If you're relying on Amazon's EBS you have it even worse, with an esimated 72 IOPS per second.
That's slow.
Imagine that the query engine has a hundred row locations in hand; It would take it a full second to jump over the disk to gather up the data necessary to retrieve the contents of those rows. That's a best case scenario because in the real world it usually has to walk the index b-trees, find the matching data, and in Diggs' inappropriately indexed table case do yet another lookup to find the actual row itself.
This is why database systems often completely ignore indexes if the estimated match count exceeds a relatively small percentage of the data, anemic storage systems forcing them to do expensive operations like full scans because in the end it's a cheaper choice. Why it often just reads and filters a burst of MBs of data rather than select a few sparse records from an index.
It's why it's desirable to have the data in RAM, and why database servers should be loaded with copious memory. [Sidenote: It's also why denormalizing can paradoxically slow down a database in many scenarios because it grows a database beyond RAM unnecessarily. In the Digg case note the username and friendname fields in the Friends table]
The IOPS weak-point is why most enterprise databases add SANs with ranks and ranks of hard drives, ganging them together in such a way that many seeks occur simultaneously, vastly increasing the I/O rate.
A more attainable and far more disruptive advance is moving into reality, however, and that is SSDs.
Take a look at the Anandtech review of the OCZ Vertex LE 100GB MLC SSD. In particular look at the 4KB random read - MB/s results on page 10. Near the bottom are a couple of magnetic disks, including the esteemed VelociRaptor, which are absolutely decimated by the SSDs.
That is the test that is most applicable to the Digg scenario, and it is clearly evident how big of an impact it would have on their situation.
Instead of 100 IOPS, they would be looking at 15,000 IOPS. Put 6 of these in a RAID-10 array and you'd have a yield of 45,000 IOPS and reliability. Even without learning how to properly index they could see an easy 5000x performance improvement in that class of RDBMS queries. Add a materialized view and...the speed would be so obscene it would get banned from the App Store.
Those units are just $400 a piece, and the technology keeps getting bigger and faster and cheaper. SSDs are a deeply, deeply disruptive change, especially to the large-scale database world.
The drive I mentioned is an MLC unit that isn't intended for the enterprise market, but in some ways it fits the same role as NoSQL — less reliable, but it gets the job done. The nature of the Digg table (that it is largely an additive table with likely little churn) is the perfect use-case for an MLC SSDs.
And really, 100GB is a lot of space for an operational database, even for a social media site. While it isn't appropriate for Facebook's 25TB "figure out how to sell you junk you don't want" daily activity log, it is certainly adequate for all of the Diggs and Friend relationships Digg would need, especially when removing denormalization that was put in place because of the poor IOPS of magnetic disks. And of course with the magic of RAID you can scale it up to whatever heights you'd like.
For $400.
Soon we'll have even faster, larger drives that are cheaper, and so on. The nature of flash technology is that they can keep making it more and more parallel, so the IOPS are going to keep going up and up and up.
Optimizing against slow seek times is an activity that is quickly going to be a negative return activity. Many who embrace NoSQL are seeking a solution to yesterday's problem. Digg, for instance, yields their entire NoSQL benefit from optimizing data locality — that all data for a given need is nicely bunched together, which of course is what materialized views do as well.
The people who really demand high levels of database performance usually have a lot of money. Which is why many of the products that deliver options like column-oriented storage (an implementation detail of a RDBMS that is primarily suited to very large-scale column aggregations. It isn't suitable for a OLTP DB), or MPP (Massively Parallel Processing), cost absurdly high amounts.
Greenplum, Vertica, TeraData, parAccel, Oracle RAC, Sybase ASE, DB2 MPP...these things are often priced out of all but the largest enterprise's reach.
Look at the pricing of the upcoming release of SQL Server 2008 R2, in particular the Parallel Data Warehouse product that brings MPP to that server. $58K per processor, which obviously excludes it from contention for the vast majority of applications.
Come on.
If there is one thing that I would like to see come out of the NoSQL advocacy movement, it would be that mainstream databases feel the pressure to push down the functionality that they currently limit to the people with the biggest bank accounts (which they sell using the "how much do you have?" pricing model).
I work in the financial industry. RDBMS’ and the Structured Query Language (SQL) can be found at the nucleus of most of our solutions.
The same was true when I worked in the insurance, telecommunication, and power generation industries.
So it piqued my interest when a peer recently forwarded an article titled “The end of SQL and relational databases”, adding the subject line “We’re living in the past”.
[Though as Michael Stonebraker points out, SQL the query language actually has remarkably little to actually to do with the debate. It would be more clearly called NoACID]
That series focuses on NoSQL as the challenger to the throne. It isn’t alone as the past year has yielded a bountiful crop of articles and blog entries declaring the imminent death of the decrepit relational database at the hands of this new innovation.
Most get posted with incendiary, absolute statements against the RDBMS.
The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software.
And they usually see later edits that blunt the original euphoria.
postnote: This isn’t about a complete death of the RDBMS. Just the death of the idea that it’s a tool meant for all your structured data storage needs.
Indeed.
Few hold the RDBMS as the only tool for all of your structured or unstructured data storage needs, though that strawman makes an appearance in many NoSQL advocacy pieces, adding some unintentional comedy (“irony”) given that the same entries usually call for the death of the RDBMS, with NoSQL declared the one true way to store and retrieve data.
Page 493 (as labelled by page) of the article “The Paradoxical Success of Aspect-Oriented Programming” includes a fantastic quote and graphic from an IEEE editorial by James Bezdek in IEEE Transactions on Fuzzy Systems.
[I quote indirectly given that the original source isn’t publicly available]
Every new technology begins with naive euphoria — its inventor(s) are usually submersed in the ideas themselves; it is their immediate colleagues that experience most of the wild enthusiasm. Most technologies are overpromised, more often than not simply to generate funds to continue the work, for funding is an integral part of scientific development; without it, only the most imaginative and revolutionary ideas make it beyond the embryonic stage. Hype is a natural handmaiden to overpromise, and most technologies build rapidly to a peak of hype. Following this, there is almost always an overreaction to ideas that are not fully developed, and this inevitably leads to a crash of sorts, followed by a period of wallowing in the depths of cynicism. Many new technologies evolve to this point, and then fade away. The ones that survive do so because someone finds a good use (= true user benefit) for the basic ideas.
In the case of the NoSQL hype, it isn’t generally the inventors over-stating its relevance — most of them are quite brilliant, pragmatic devs — but instead it is loads and loads of terrible-at-SQL developers who hope this movement invalidates their weakness.
Some sort of Fight Club ground zero wiping of the records, rewriting the rules of the game.
It doesn’t.
Nonetheless there is indisputably a lot of fantastic work happening among the NoSQL camp, with a very strong focus on scalability.
Scalability is a poorly-defined concept that, more often than not, is twisted to suit the speaker’s agenda. Scalability is often the excuse to engage in absurd hypotheticals to sell a particular blend of fanaticism.
Putting aside wordplay — or perhaps to engage in some of my own — scalability is pragmatically the measure of a solution’s ability to grow to the highest realistic level of usage in an achievable fashion, while maintaining acceptable service levels.
Imagine the scenario that you’ve built an internal help ticket tracking system for your branch office of Money Bags Corporation. If you had to describe the data needs in three points, they would be-
You decide to go against the hype and build it on a classic RDBMS system.
Will it scale to the real-world requirements?
There are some real scalability concerns with old school relational database systems. Adam Wiggins does a pretty good job of covering the techniques to scale a SQL database, though I strongly disagree with his end assertion.
You face those concerns on that glorious day the CEO calls to tell you that the board is super excited about your team’s help ticket system, built on SQL Server, and they want you to deploy it corporation wide. For data consistency purposes they want a single instance, instead of alternative deployment scenarios like pushing out an instance (“shard”) for each division.
Can you make it work?
Of course you can. Even on the maligned Windows platform.
From a vertical scaling perspective — it’s the easiest and often the most computationally effective way to scale (albeit being very inefficient from a cost perspective) — you have the capacity to deploy your solution on powerful systems with armies of powerful cores, hundreds of GBs of memory, operating against SAN arrays with ranks and ranks of SSDs.
The computational and I/O capacity possible on a single “machine” are positively enormous. The storage system, which is the biggest limiting factor on most database platforms, is ridiculously scalable, especially in the bold new world of SSDs (or flash cards like the FusionIO).
Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users in most usage and application scenarios (where generally clients talk to a farm of middleware servers).
Of course if you index poorly or create some horrendous joins you can screw it up, but with competency it will be good times for all. Even with billions upon billions of help tickets.
For the purposes of the application, the scalability requirement is completely satisfied — total scalability is achieved in the context of the application.
But it doesn’t end there.
From a horizontal scaling perspective you can partition the data across many machines, ideally configuring each machine in a failover cluster so you have complete redundancy and availability. With Oracle RAC and Sybase ASE you can even add the classic clustering approach.
Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve built a system for a large corporation, deployed in your own datacenter, with few constraints beyond the limits of technology and the platform.
Your solution will cost hundreds of thousands of dollars (if not millions) to deploy, but that isn’t a critical blocking point for most enterprises.
This sort of scaling that is at the heart of virtually every bank, trading system, energy platform, retailing system, and so on.
To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming evidence, defies all reason.
And you don't need to spend a million dollars. A mid-level Dell server can easily handle the vast majority of real-world database needs: No, your project likely isn't going to have the needs of Twitter, Flickr, or Facebook. You can grab a four CPU Dell server hosting a total of 24 cores of latest-tech computing power, with 128GB of RAM, for around $15,000. That is beefier than the systems that ran many enterprises just a few short years ago.
Imagine that you’re a start-up building your big new Social Media site
Obviously you don’t have your own datacenter, but instead you’re going with cloud servers to host your solution.
You don’t have the option (much less the finances) to buy and install a Unisys 7600R, or even a loaded Dell R905. You don’t have TBs of memory or massive I/O at your disposal.
Instead you have to go with the options available on a host like Amazon’s EC2, where the most powerful choice available is the High-Memory Quadruple Extra Large (!) option at $2.40 / hour (at writing), or about $21,024 a year, which is a fairly reasonable rate given that an equivalent purchased server would run you about ten thousand dollars up front.
This is very powerful compared to their historic maxed-out image — the puny large image that used to represent the top end — and is large compared to the max of many other cloud hosts, yet it is entry level in the RDBMS database world.
I/O on the EBS has been measured with a throughput in the 30MB/second range with about 72 IOPS per volume, which is one-half the speed that my Atom-based home NAS achieves. You can stripe multiple volumes into a software RAID array, but you quickly limit the I/O available to your instance.
For comparison we’re currently looking at an entry level $8K 36TB iSCSI device that would offer our database a dedicated 400MB/second throughput and about 1500 IOPS, and this is for a pretty humble low-criticality need with low-end magnetic drives.
As a speculative start-up you don’t want to commit $20K/year to have a single instance hanging around, especially given that your traffic is extremely variable and most of the time it will sit idle. You want to run the smallest database layer possible, ramping up if the need (fingers crossed) arises.
In an ideal world you could float along on a small instance economically until that big day when you get mentioned on Digg, at which point you spool up ten extra large instances, turning them off when the need passes.
These financial and artificial limits explain the strong interest in technologies that allows you to spin up and cycle down as needed. It’s why the old guard has largely remained quiet (because it solves a problem that they don’t have, notwithstanding any manufactured “my friend has a super-duper 512CPU Sun box and it is always overloaded!” scenarios), while a million hopeful start-ups with their small EC2 instances are loudly bleating about the limits of scalability with SQL systems.
The world of financial firms and retailers and other RDBMS users is very different than the popular social media scenario usually played out.
If you had to describe your social media data needs in three points, they would be-
MySQL originally lacked many traditionally mandatory RDBMS elements, such as transactions, without which it is extremely difficult to maintain a high level of data integrity. That didn’t dissuade many of its boosters who declared that it was an unnecessary cost for the purposes that they used it.
They were right. As MySQL has moved towards the values of traditional databases, it has moved away from its original bag-of-data values.
The truth is that you don’t need ACID for Facebook status updates or tweets or Slashdots comments. So long as your business and presentation layers can robustly deal with inconsistent data, it doesn’t really matter. It isn't ideal, obviously, and preferrably you see zero data loss, inconsistency, or service interruption, however accepting data loss or inconsistency (even just temporary) as a possibility, breaking free of by far the biggest scaling "hindrance" of the RDBMS world, can yield dramatic flexibility.
This is the case for many social media sites: data integrity is largely optional, and the expense to guarantee it is an unnecessary expenditure. When you yield pennies for ad clicks after thousands of users and hundreds of thousands of transactions, you start to look to optimize.
The same efficiency applies to highly relational schemas — if you can just serialize object graphs and that’s all you need, why bother normalizing? Many would argue that it’s a premature optimization, but if it’s all you need it might be the best choice.
Both of those decisions would be outrageously negligent in many other industries, but the rules that apply for a banking system have woefully little applicability to a social media site.
The point is one that I think all rational people already realize: The ACID RDBMS isn’t appropriate for every need, nor is the NoSQL solution.
A social media site is not an inventory system. A banking account management system is not a social news aggregator.
Picking and choosing database terminology from the Wikipedia entry on RDBMS’ doesn’t equip the speaker with an expert level of knowledge to declare the truth about the database industry.
Scalability noise based upon the limitations of a cloud vendor’s offerings needs to be put into context: They don’t apply to most of the users of relational databases.
MySQL isn’t the vanguard of the RDBMS world. Issues and concerns with it on high load sites have remarkably little relevance to other database systems.
And of course the SQL/RDBMS world is changing (sidenote: Few love SQL, but I’ve yet to see a viable replacement). Wouldn’t it be a grand world where every desktop (platforms that spend about 99% of their time completely idle) in a corporation was a part of the corporate cloud, all seamlessly acting as a part of the corporate information system in a reliable, redundant way? A simple SQL statement silently and transparently fulfilled by hundreds of distributed systems?
We’ll get there.
Aside: I'm currently building a solution (to fill this space) that significantly leans on Project Voldemort. I have somehow managed to remain rational.
This is one of those rants that strangely gets attention, with several taking it as anti-NoSQL, or even pro-RDBMS, I assume because positions so often seem to be polarized. It is neither, which is quite evident if read with an unbiased mind: Defending the real world practical scalability of the maligned RDBMS merely brings accuracy to the debate. Several have asked if I'm merely attacking a strawman: Aside from several specific links that I gave above (I am remiss to add more as I've engaged in the blog-to-blog arguments too many times before), I find it hard to believe that these people take part in any technology discussion forum or group, where NoSQL is being quite widely, and often without question, held as successor to the RDBMS...the new evolution of database systems.
The motivation of the post is that the discussion is, by nature of the venue, hijacked by people building or hoping to build very large scale web properties (all hoping to be the next Facebook), and the values and judgments of that arena are then cast across the entire database industry — which comprises a set of solutions that absolutely dwarf the edge cases of social media — which is really...extraordinary. It's a bit like moving to the bottom of the ocean and declaring that everyone should start using submarines to commute.
There have been edge conditions in the database world for as long as there has been an industry. High performance logging/data acquisition (often distributed), for instance, has always been a case where traditional RDBMS systems aren't suited, and thus should be jettisoned. The industry didn't rewrite the rules because of those fringe cases, however, for good reason.
For the stand-alone application developer targeting the Windows platform -- that increasingly endangered species -- when the need arose for a database backend, the default choice was historically the Microsoft Jet database engine, usually to interact with an mdb (e.g. Microsoft Access) database file (though Jet supported other, less-used options as well).
This goes all the way back to the terribly awkward DAO dll libraries callable from C, though interaction greatly improved through the years, to the ease and simplicity of today's ADO.NET.
When deployment time came, you could distribute with the freely redistributable Jet runtime (the only significant limitation being that you couldn't create a direct MS Access competitor, with table designers and the like), and while the client had no need for a Microsoft Access installation or license, if they did have it they could interact with the database directly where the need arose, or for extended functionality (e.g. reporting via Access). These mdb databases could also be accessed through other tools such as SQL Server linked servers, etc.
I write that in past-tense, as Microsoft has been beating Jet to death as of late.
The first punch was the complete lack of a 64-bit migration path -- existing or planned -- meaning that in a 64-bit instance of SQL Server you can't add linked Access databases, nor can you interact with Jet-supported databases from the 64-bit runtime of SSIS (though thankfully in that case you have the fallback of using the 32-bit runtime). And while Access 2007 runs atop a heavily modified version of Jet called ACE, the new library itself isn't redistributable being intended only for use from Access.
So what is the replacement? The primary replacement was the former MSDE (Microsoft Database Engine), called SQL Server Express in the latest iteration. Microsoft really wanted to push developers to the SQL Server platform from the smallest need to the largest need.
With a liberal, free redeployment, a very easy upgrade path to a "real" instance of SQL Server (and the easy integration and interoperation with other instances of SQL Server, taking part in functions such as replication), this is a compelling choice but for the fact that it is a resource-intensive overkill for many simple needs, with a multi-hundred megabyte install, a separate service which itself is far from slim (it is an actual instance of the same SQL Server database product that might be hosting the corporate HR databases, with only a couple of minor hardcoded limits differentiating it), and then the potential administration headaches in the future (the Slammer worm primarily infected MSDE installations that many users weren't even aware they had running. SQL Server's default configuration is far more intelligent now, such that by default it only listens on localhost, and it supports robust attaching/detaching of databases, but there still is an excessive surface-area for attack if only a basic database was desired).
If all you want is some basic table structures with simple indexing, the bulk of the database management system meant for large-scale corporate data warehouses just isn't reasonable.
Microsoft has another option now, albeit growing from a product path that has been around for a while, called SQL Server CE aka Compact Edition (confusingly you'll find it called Mobile Edition on many of the supporting documents, as that was a prior product name). Supporting a subset of T-SQL, and basic tables (no views, stored procedures, or triggers), it's an in-process, very lightweight option if you want a simple backend database in your application, and want it to easily interoperate with some other Microsoft technologies.
It's primarily unmanaged, but provides excellent .NET interoperability.
It isn't SQL Server, though. The code doesn't come from SQL Server. T-SQL is limited to a subset, the database format is completely different, and basic functionality like full-text search is missing. From an integration perspective, while it is manageable from SQL Server Management studio, the only real consistency is in the object explorer and the query analyzer, while index configuration and schema design occurs in completely new modules.
In its implementation some questionable decisions were made, such as the lack of non-unicode text types (there are a significant number of very legitimate uses for ASCII text. Going UCS-2 makes it more likely that surrogate keys get used where natural keys could have been the better choice at one-half the size, and is just a waste of space and performance if it isn't actually necessary).
So why am I talking about SQL Server CE at all?
The compelling feature is that it runs on the gamut of Windows targets, including mobile editions (e.g. smart cellphones and PDAs), including tools to sync between the devices. If you wanted to make a Getting Things Done task-tracking style app atop a reasonably robust, feature-rich-enough database, and you wanted it to be usable from a desktop and a PDA whether connected or not, it presents a very interesting option.
I'm going to play around with it a bit in the coming weeks when
I need a distraction from more critical work, and this is just the
intro piece to further analysis of this product. I may look at the
unfortunately named VistaDB
and other embedded, in-process options (I'd love to consider the
embedded PostgreSQL -- itself a fabulous RDBMS -- however
resources for it are few and far between).
Some time back I posted the following on a discussion board I frequented at the time, wishing for the described feature to bless the SQL Server database engine:
Automatic relational lookup fields. e.g. Someone has a table
where they store, for instance, a form name. They store this
nvarchar(64) (say averaging 30 characters, averaging 60 bytes a
record) over and over again, redundantly. In a table with millions
of records there are only a dozen variations of the form name, and
the end result is tremendous bloat of the data, indexes, and of
course lookups are that much slower as well because of the I/O
requirements. To some database designers this is a reasonable
design because it is using a natural key (rather than an
artificially generated autonumber), but it is extremely inefficient
with space.
I would love the ability to toggle a boolean switch on the column
and SQL Server will automatically setup a hidden lookup table;
which it will automatically maintain based upon values inserted.
Concievably it could also scale the relational value (e.g. tinyint,
smallint, int, bigint) based upon the number of values in the
lookup. Of course I normally do this myself, but when you walk into
a large system that exhibits this sort of issue pervasively, it's
difficult to fix - you can hide the table behind a view, and use
some INSTEAD OF triggers to do the auto-mungification, however that
is a leaky abstraction, as Joel would say (e.g. INSERT FROM fails,
Enterprise Manager doesn't use it properly, and so on). It's such a
brainless, rudimentary task, it is one of those simple but
effective features that would be worthwhile and would improve SQL
Server.
Back to your regularly scheduled program.
Most of the replies completely missed the point, going off on tangents about how the database should be normalized better (not always an option when you have largescale, widely deployed enterprise systems. Not to mention that using large natural keys can be completely normalized, but still benefitting from this sort of improvement). Others suggested that the database engine should be as dumb as possible, doing nothing beyond storing tuples in the simplest and most obvious manner.
This came back to mind seeing a whitepaper about IBM's just released DB2v9 ("Viper"), offering optional lempel-ziv row-level compression, yielding many of the benefits that I mentioned above. This sort of compression, similar to the "compression" described above, would do wonders for a Sugar CRM database, for instance. Of course other implementations have something similar, for instance MySQL's enum field type.
Why is it that "90% done" (and
its partner in crime - the ubiquitous "almost done!") is
the progress report for virtually any project, over virtually all
of its life-cycle?
Why has 90% become the fictional number of choice? Why not the more conservative 80%, or the bolder 95%? Given that it usually has little correlation with reality, they're just as real.
Projects should be reported as 87% done. Even when there's the ominous "we'll solve that problem when we get to it" task maliciously eyeballing you from later in the project plan, or the "it doesn't work and we have no clue why?" runtme reality, still say 87% with confidence and pride.
Joel Spolsky, the well-known blogger and ISV owner, kicked up quite a storm recently with his piece entitled Language Wars [for those following the `debate', yes, I'm late to the party on this. I make it a general standard to avoid responding to blogs on here -- the whole blog thing is entirely too recursive -- but some recent reactions to his piece pushed me to post].
The article leads off with some pragmatic wisdom, advising enterprise-y, low-risk type shops to use well-known and well-proven technology stacks -- solid advice that's hard to argue with -- yet he then ends the piece with a comment about an in-house, next-generation, super-duper language being used to develop FogCreek's premiere product, FogBugz.
The discord was so great that most readers presumed that the Wasabi thing was a joke, or alternately that the rest of the article was the joke (which would have been an awesome revelation). Much confusion ensued, to the point that Joel had to put up a post clarifying that he was actually serious about the Wasabi thing
Aside from the seeming hypocrisy, what really instantiated some JoelCritic<T> instances (via the BlogCriticFactory) were Joel's comments about Ruby, where he seemingly indicated that it wasn't ready for prime time.
...but for Serious Business Stuff you really must recognize that there just isn't a lot of experience in the world building big mission critical web systems in Ruby on Rails, and I'm really not sure that you won't hit scaling problems, or problems interfacing with some old legacy thingamabob, or problems finding programmers who can understand the code, or whatnot...
...I for one am scared of Ruby because (1) it displays a stunning antipathy towards Unicode and (2) it's known to be slow, so if you become The Next MySpace, you'll be buying 5 times as many boxes as the .NET guy down the hall.
I'm sure Joel anticipated the backlash. Perhaps it was even the motivation behind the posting: The resulting torrent of discussion brought quite a few visitors to his blog, and earned him a lot of inbound links, both of which have definitely helped with his new business ventures. No publicity is bad publicity, they say, especially if it's timed to coincide with the launch of a new job board (as an aside, Ruby, Wikipedia, OSX, Python, Lisp, and ERLang are all terrible! People with the letters J or P in their names are jerks!).
Ruby is still new enough, and with a small enough community, that many of its users double as evangelists -- think of the Amiga computer, the BeOS operating system, or any other contextually-superior alternative embraced by a small enough group that many feel an ego-intersection with the technology, motivated to defend and advocate it when the opportunity arises. Linux once had such an attack-dog core of rabid enthusiasts, though as the user base has grown, and it has become more pedestrian, you really have to target a Linux-niche (such as a little used distro) if you're aiming to stir up a hornet's nest.
That entire lead-up was just some context for the actual topic of this entry: So-called premature optimization.
A common response to Joel's complaint that Ruby is slow or resource inefficient is the frequently incanted declaration that such complaints are nothing but "premature optimization!"
I've seen the same deflection shield used to defend abhorrent database designs, convoluted, overly-abstracted class designs or message patterns, and virtually anything else where a realist might proactively ponder "but won't performance be a problem doing it like this?", only to yield the response "You know, premature optimization is a classic beginners mistake!"
If you don't want to be lumped in with beginners, the lesson goes, it's best to pretend that performance simply doesn't matter. We'll cross that bridge when we get to it.
Premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth
I remember the early days: I once spent about 16 work hours optimizing a date munging function, increasing its performance from something like 2 million iterations per second to 4 million iterations. In the grand scheme of things, the performance difference was completely negligible, but from the perspective of artificial benchmarks it seemed like tremendous progress was being made.
That was premature optimization.
Indeed, anyone who's done time in the software development industry can identify with what Mr. Knuth was saying, probably having been involved with (or responsible for) project plans gone awry when efforts focused on highly-complex caching infrastructures, or ultra-optimizing some seldom used edge function.
Yet what is arguable, and situation specific, is deciding what qualifies as premature, versus what is simply proactive, predictive, professional performance prognostications.
NOT ALL PERFORMANCE CONSIDERATIONS ARE PREMATURE OPTIMIZATION!
While there is no doubt that there is such a thing as premature optimization -- it is an evil distraction that sidetracks many projects -- there are critical decisions made early in a project that can cripple the performance potential (both resource efficiency, and resource maximum), making later optimizations enormously expensive, if not impossible without an entire rewrite.
Whether it's heavily normalizing the database (or its nefarious doppelgänger, the classic database-within-the-database: "This single table can handle anything! Just put a comma separated array of serialized objects in each of the 256 varbinary(max) columns! Look at the flexibility! Query it? Don't you bother me with your premature optimizations!"), creating an application design that's incongruent with caching, or choosing an inefficient platform.
There are credible performance considerations that need to be addressed at the outset, and revisited as development proceeds. It is absolute insanity, and entirely irresponsible professionally, to simply stick one's head in the sand and hope that some magical virtual machine improvements or subcolumn indexing decomposition and querying technology will occur before deployment, or before the economics of scaling come into play.
And speaking of scaling, the canard that the horizontal-scalabilty intrinsic with most web apps (unless you really screwed up the design -- as many people do -- and made horizontal scalability impossible) makes the problem a nonissue is absurd: Perhaps if your project has a high transaction value then you have the luxury of adding more servers to serve a small number of clients, yet for most real-world projects adding resources is a big, big deal. And it isn't simply the cost of a low-end Dell 1850: Whether you're colocating or hosting in an expensively rigged corporate server room, the cost of each server is substantial.
You end up in the dilemma that you're financially (or physically) limited to a set quantity of resources, having to limit or scale-back the functionality provided to each user due to the inefficiencies caused by early decisions. "Sorry we can't implement that cool AJAX type-ahead lookups because the callbacks would kill our servers - we're already saturating them with our stack of inefficiency, so there's no overhead left."
I think the lackadaisical attitude towards efficiency is a result of experience derived from countless unvisited or seldom used web apps deployed across millions of PCs, colocated with equally as spartanly used peers. When a site sees a dozen visitors in a day, it's easy to declare that performance is a seeming nonissue nowadays - that it's only a concern for game programmers and nuclear modelling engineers. Then one day the page gets mentioned on Digg or Reddit or Slashdot or BobOnHardware and in that potential moment of glory the app falls over and dies, again and again.
None of this really has anything to do with Ruby. Personally I haven't used it beyond the tutorials, though I do know that it does very, very poorly on the standardized benchmarks. However it is distressing seeing so many people dismiss Joel's comments (or comments about Python, or ERlang, or XML, or any other technology) as premature optimization.
Way back in junior high I had a good friend who was a huge fan of military aircraft.
His bedroom walls were covered with huge, hard to procure and often expensive posters of these deadly devices. His desk featured an actual (albeit non-functional) 20mm shell, of the variety used in the depleted-uranium spewing gatling gun.
His favourite military fighter jet happened to be the F-15 Eagle.
Feeling a little left out, I started pouring over his resources, carefully reading his encyclopedia's of fighter aircraft, absorbing all of their attributes. I decided that my favourite fighter jet was the F-14 Tomcat: Clearly its ability to land on carriers, its swing-wing engineering, and the long range phoenix missiles it supported, made it the superior aircraft.
There was no way the F-15 Eagle compared, I argued. The F-14 Tomcat was obviously the choice of those in the know. The enlightened ones, if you will.
Yet the reality -- and I think my friend Brian always knew it -- is that I chose the F-14 primarily because it wasn't the F-15. After picking a natural alternative, I started building layers and layers of justifications for my decision.
I see the same sort of thing fairly typically in software development: Big up front design versus agile designs; Getters/Setters versus fields; namespace naming guidelines of type A or type B; variable naming standards; stored procedure naming standards (or the religious "stored procedure versus dynamic SQL" argument that rages on in teams across the lands); the sorts of types to use for primary keys; the languages and platforms to choose; whether or not to use XML, and what to use it for.
So many times, it seems, people choose their positions based not on actual analysis and honest beliefs, but rather because they're countering someone else in their team -- especially when attempting to undermine authority, actual or perceived -- or they battling someone else in their organization (that dastardly team in Sector G that's trying to get kudos by setting the development guidelines!), or they're deriding someone in the industry.
Often They're just trying to be different and difficult, and the beauty of software development is that there are many, many right ways to do it, and it's easy to find allies in discussion groups to assure one that everyone else is idiots, and their new position is the One True Way.
It's easy to appeal to authority, given that there's some big name or organization that, in some form, promotes just about every software development practice and standard imaginable (Microsoft is a particularly good example of this, as throughout the organization they follow so many standards and practices, that one can easily find an example conforming with their dogma, using it as an example that it's the "Microsoft way", ignoring the many exceptions).
Of course all of this doesn't preclude disagreement on standards and processes and techniques -- people often truly disagree because they legitimately and rationally believe something different. In a full of intelligent, self-directed professionals, such disparate beliefs and conclusions can be enormously beneficial. The problem is when interpersonal issues materialize as technical disagreements.