For the stand-alone application developer targeting the Windows platform -- that increasingly endangered species -- when the need arose for a database backend, the default choice was historically the Microsoft Jet database engine, usually to interact with an mdb (e.g. Microsoft Access) database file (though Jet supported other, less-used options as well).
This goes all the way back to the terribly awkward DAO dll libraries callable from C, though interaction greatly improved through the years, to the ease and simplicity of today's ADO.NET.
When deployment time came, you could distribute with the freely redistributable Jet runtime (the only significant limitation being that you couldn't create a direct MS Access competitor, with table designers and the like), and while the client had no need for a Microsoft Access installation or license, if they did have it they could interact with the database directly where the need arose, or for extended functionality (e.g. reporting via Access). These mdb databases could also be accessed through other tools such as SQL Server linked servers, etc.
I write that in past-tense, as Microsoft has been beating Jet to death as of late.
The first punch was the complete lack of a 64-bit migration path -- existing or planned -- meaning that in a 64-bit instance of SQL Server you can't add linked Access databases, nor can you interact with Jet-supported databases from the 64-bit runtime of SSIS (though thankfully in that case you have the fallback of using the 32-bit runtime). And while Access 2007 runs atop a heavily modified version of Jet called ACE, the new library itself isn't redistributable being intended only for use from Access.
So what is the replacement? The primary replacement was the former MSDE (Microsoft Database Engine), called SQL Server Express in the latest iteration. Microsoft really wanted to push developers to the SQL Server platform from the smallest need to the largest need.
With a liberal, free redeployment, a very easy upgrade path to a "real" instance of SQL Server (and the easy integration and interoperation with other instances of SQL Server, taking part in functions such as replication), this is a compelling choice but for the fact that it is a resource-intensive overkill for many simple needs, with a multi-hundred megabyte install, a separate service which itself is far from slim (it is an actual instance of the same SQL Server database product that might be hosting the corporate HR databases, with only a couple of minor hardcoded limits differentiating it), and then the potential administration headaches in the future (the Slammer worm primarily infected MSDE installations that many users weren't even aware they had running. SQL Server's default configuration is far more intelligent now, such that by default it only listens on localhost, and it supports robust attaching/detaching of databases, but there still is an excessive surface-area for attack if only a basic database was desired).
If all you want is some basic table structures with simple indexing, the bulk of the database management system meant for large-scale corporate data warehouses just isn't reasonable.
Microsoft has another option now, albeit growing from a product path that has been around for a while, called SQL Server CE aka Compact Edition (confusingly you'll find it called Mobile Edition on many of the supporting documents, as that was a prior product name). Supporting a subset of T-SQL, and basic tables (no views, stored procedures, or triggers), it's an in-process, very lightweight option if you want a simple backend database in your application, and want it to easily interoperate with some other Microsoft technologies.
It's primarily unmanaged, but provides excellent .NET interoperability.
It isn't SQL Server, though. The code doesn't come from SQL Server. T-SQL is limited to a subset, the database format is completely different, and basic functionality like full-text search is missing. From an integration perspective, while it is manageable from SQL Server Management studio, the only real consistency is in the object explorer and the query analyzer, while index configuration and schema design occurs in completely new modules.
In its implementation some questionable decisions were made, such as the lack of non-unicode text types (there are a significant number of very legitimate uses for ASCII text. Going UCS-2 makes it more likely that surrogate keys get used where natural keys could have been the better choice at one-half the size, and is just a waste of space and performance if it isn't actually necessary).
So why am I talking about SQL Server CE at all?
The compelling feature is that it runs on the gamut of Windows targets, including mobile editions (e.g. smart cellphones and PDAs), including tools to sync between the devices. If you wanted to make a Getting Things Done task-tracking style app atop a reasonably robust, feature-rich-enough database, and you wanted it to be usable from a desktop and a PDA whether connected or not, it presents a very interesting option.
I'm going to play around with it a bit in the coming weeks when
I need a distraction from more critical work, and this is just the
intro piece to further analysis of this product. I may look at the
unfortunately named VistaDB
and other embedded, in-process options (I'd love to consider the
embedded PostgreSQL -- itself a fabulous RDBMS -- however
resources for it are few and far between).
Some time back I posted the following on a discussion board I frequented at the time, wishing for the described feature to bless the SQL Server database engine:
Automatic relational lookup fields. e.g. Someone has a table
where they store, for instance, a form name. They store this
nvarchar(64) (say averaging 30 characters, averaging 60 bytes a
record) over and over again, redundantly. In a table with millions
of records there are only a dozen variations of the form name, and
the end result is tremendous bloat of the data, indexes, and of
course lookups are that much slower as well because of the I/O
requirements. To some database designers this is a reasonable
design because it is using a natural key (rather than an
artificially generated autonumber), but it is extremely inefficient
with space.
I would love the ability to toggle a boolean switch on the column
and SQL Server will automatically setup a hidden lookup table;
which it will automatically maintain based upon values inserted.
Concievably it could also scale the relational value (e.g. tinyint,
smallint, int, bigint) based upon the number of values in the
lookup. Of course I normally do this myself, but when you walk into
a large system that exhibits this sort of issue pervasively, it's
difficult to fix - you can hide the table behind a view, and use
some INSTEAD OF triggers to do the auto-mungification, however that
is a leaky abstraction, as Joel would say (e.g. INSERT FROM fails,
Enterprise Manager doesn't use it properly, and so on). It's such a
brainless, rudimentary task, it is one of those simple but
effective features that would be worthwhile and would improve SQL
Server.
Back to your regularly scheduled program.
Most of the replies completely missed the point, going off on tangents about how the database should be normalized better (not always an option when you have largescale, widely deployed enterprise systems. Not to mention that using large natural keys can be completely normalized, but still benefitting from this sort of improvement). Others suggested that the database engine should be as dumb as possible, doing nothing beyond storing tuples in the simplest and most obvious manner.
This came back to mind seeing a whitepaper about IBM's just released DB2v9 ("Viper"), offering optional lempel-ziv row-level compression, yielding many of the benefits that I mentioned above. This sort of compression, similar to the "compression" described above, would do wonders for a Sugar CRM database, for instance. Of course other implementations have something similar, for instance MySQL's enum field type.
Why is it that "90% done" (and
its partner in crime - the ubiquitous "almost done!") is
the progress report for virtually any project, over virtually all
of its life-cycle?
Why has 90% become the fictional number of choice? Why not the more conservative 80%, or the bolder 95%? Given that it usually has little correlation with reality, they're just as real.
Projects should be reported as 87% done. Even when there's the ominous "we'll solve that problem when we get to it" task maliciously eyeballing you from later in the project plan, or the "it doesn't work and we have no clue why?" runtme reality, still say 87% with confidence and pride.
Joel Spolsky, the well-known blogger and ISV owner, kicked up quite a storm recently with his piece entitled Language Wars [for those following the `debate', yes, I'm late to the party on this. I make it a general standard to avoid responding to blogs on here -- the whole blog thing is entirely too recursive -- but some recent reactions to his piece pushed me to post].
The article leads off with some pragmatic wisdom, advising enterprise-y, low-risk type shops to use well-known and well-proven technology stacks -- solid advice that's hard to argue with -- yet he then ends the piece with a comment about an in-house, next-generation, super-duper language being used to develop FogCreek's premiere product, FogBugz.
The discord was so great that most readers presumed that the Wasabi thing was a joke, or alternately that the rest of the article was the joke (which would have been an awesome revelation). Much confusion ensued, to the point that Joel had to put up a post clarifying that he was actually serious about the Wasabi thing
Aside from the seeming hypocrisy, what really instantiated some JoelCritic<T> instances (via the BlogCriticFactory) were Joel's comments about Ruby, where he seemingly indicated that it wasn't ready for prime time.
...but for Serious Business Stuff you really must recognize that there just isn't a lot of experience in the world building big mission critical web systems in Ruby on Rails, and I'm really not sure that you won't hit scaling problems, or problems interfacing with some old legacy thingamabob, or problems finding programmers who can understand the code, or whatnot...
...I for one am scared of Ruby because (1) it displays a stunning antipathy towards Unicode and (2) it's known to be slow, so if you become The Next MySpace, you'll be buying 5 times as many boxes as the .NET guy down the hall.
I'm sure Joel anticipated the backlash. Perhaps it was even the motivation behind the posting: The resulting torrent of discussion brought quite a few visitors to his blog, and earned him a lot of inbound links, both of which have definitely helped with his new business ventures. No publicity is bad publicity, they say, especially if it's timed to coincide with the launch of a new job board (as an aside, Ruby, Wikipedia, OSX, Python, Lisp, and ERLang are all terrible! People with the letters J or P in their names are jerks!).
Ruby is still new enough, and with a small enough community, that many of its users double as evangelists -- think of the Amiga computer, the BeOS operating system, or any other contextually-superior alternative embraced by a small enough group that many feel an ego-intersection with the technology, motivated to defend and advocate it when the opportunity arises. Linux once had such an attack-dog core of rabid enthusiasts, though as the user base has grown, and it has become more pedestrian, you really have to target a Linux-niche (such as a little used distro) if you're aiming to stir up a hornet's nest.
That entire lead-up was just some context for the actual topic of this entry: So-called premature optimization.
A common response to Joel's complaint that Ruby is slow or resource inefficient is the frequently incanted declaration that such complaints are nothing but "premature optimization!"
I've seen the same deflection shield used to defend abhorrent database designs, convoluted, overly-abstracted class designs or message patterns, and virtually anything else where a realist might proactively ponder "but won't performance be a problem doing it like this?", only to yield the response "You know, premature optimization is a classic beginners mistake!"
If you don't want to be lumped in with beginners, the lesson goes, it's best to pretend that performance simply doesn't matter. We'll cross that bridge when we get to it.
Premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth
I remember the early days: I once spent about 16 work hours optimizing a date munging function, increasing its performance from something like 2 million iterations per second to 4 million iterations. In the grand scheme of things, the performance difference was completely negligible, but from the perspective of artificial benchmarks it seemed like tremendous progress was being made.
That was premature optimization.
Indeed, anyone who's done time in the software development industry can identify with what Mr. Knuth was saying, probably having been involved with (or responsible for) project plans gone awry when efforts focused on highly-complex caching infrastructures, or ultra-optimizing some seldom used edge function.
Yet what is arguable, and situation specific, is deciding what qualifies as premature, versus what is simply proactive, predictive, professional performance prognostications.
NOT ALL PERFORMANCE CONSIDERATIONS ARE PREMATURE OPTIMIZATION!
While there is no doubt that there is such a thing as premature optimization -- it is an evil distraction that sidetracks many projects -- there are critical decisions made early in a project that can cripple the performance potential (both resource efficiency, and resource maximum), making later optimizations enormously expensive, if not impossible without an entire rewrite.
Whether it's heavily normalizing the database (or its nefarious doppelgänger, the classic database-within-the-database: "This single table can handle anything! Just put a comma separated array of serialized objects in each of the 256 varbinary(max) columns! Look at the flexibility! Query it? Don't you bother me with your premature optimizations!"), creating an application design that's incongruent with caching, or choosing an inefficient platform.
There are credible performance considerations that need to be addressed at the outset, and revisited as development proceeds. It is absolute insanity, and entirely irresponsible professionally, to simply stick one's head in the sand and hope that some magical virtual machine improvements or subcolumn indexing decomposition and querying technology will occur before deployment, or before the economics of scaling come into play.
And speaking of scaling, the canard that the horizontal-scalabilty intrinsic with most web apps (unless you really screwed up the design -- as many people do -- and made horizontal scalability impossible) makes the problem a nonissue is absurd: Perhaps if your project has a high transaction value then you have the luxury of adding more servers to serve a small number of clients, yet for most real-world projects adding resources is a big, big deal. And it isn't simply the cost of a low-end Dell 1850: Whether you're colocating or hosting in an expensively rigged corporate server room, the cost of each server is substantial.
You end up in the dilemma that you're financially (or physically) limited to a set quantity of resources, having to limit or scale-back the functionality provided to each user due to the inefficiencies caused by early decisions. "Sorry we can't implement that cool AJAX type-ahead lookups because the callbacks would kill our servers - we're already saturating them with our stack of inefficiency, so there's no overhead left."
I think the lackadaisical attitude towards efficiency is a result of experience derived from countless unvisited or seldom used web apps deployed across millions of PCs, colocated with equally as spartanly used peers. When a site sees a dozen visitors in a day, it's easy to declare that performance is a seeming nonissue nowadays - that it's only a concern for game programmers and nuclear modelling engineers. Then one day the page gets mentioned on Digg or Reddit or Slashdot or BobOnHardware and in that potential moment of glory the app falls over and dies, again and again.
None of this really has anything to do with Ruby. Personally I haven't used it beyond the tutorials, though I do know that it does very, very poorly on the standardized benchmarks. However it is distressing seeing so many people dismiss Joel's comments (or comments about Python, or ERlang, or XML, or any other technology) as premature optimization.
Way back in junior high I had a good friend who was a huge fan of military aircraft.
His bedroom walls were covered with huge, hard to procure and often expensive posters of these deadly devices. His desk featured an actual (albeit non-functional) 20mm shell, of the variety used in the depleted-uranium spewing gatling gun.
His favourite military fighter jet happened to be the F-15 Eagle.
Feeling a little left out, I started pouring over his resources, carefully reading his encyclopedia's of fighter aircraft, absorbing all of their attributes. I decided that my favourite fighter jet was the F-14 Tomcat: Clearly its ability to land on carriers, its swing-wing engineering, and the long range phoenix missiles it supported, made it the superior aircraft.
There was no way the F-15 Eagle compared, I argued. The F-14 Tomcat was obviously the choice of those in the know. The enlightened ones, if you will.
Yet the reality -- and I think my friend Brian always knew it -- is that I chose the F-14 primarily because it wasn't the F-15. After picking a natural alternative, I started building layers and layers of justifications for my decision.
I see the same sort of thing fairly typically in software development: Big up front design versus agile designs; Getters/Setters versus fields; namespace naming guidelines of type A or type B; variable naming standards; stored procedure naming standards (or the religious "stored procedure versus dynamic SQL" argument that rages on in teams across the lands); the sorts of types to use for primary keys; the languages and platforms to choose; whether or not to use XML, and what to use it for.
So many times, it seems, people choose their positions based not on actual analysis and honest beliefs, but rather because they're countering someone else in their team -- especially when attempting to undermine authority, actual or perceived -- or they battling someone else in their organization (that dastardly team in Sector G that's trying to get kudos by setting the development guidelines!), or they're deriding someone in the industry.
Often They're just trying to be different and difficult, and the beauty of software development is that there are many, many right ways to do it, and it's easy to find allies in discussion groups to assure one that everyone else is idiots, and their new position is the One True Way.
It's easy to appeal to authority, given that there's some big name or organization that, in some form, promotes just about every software development practice and standard imaginable (Microsoft is a particularly good example of this, as throughout the organization they follow so many standards and practices, that one can easily find an example conforming with their dogma, using it as an example that it's the "Microsoft way", ignoring the many exceptions).
Of course all of this doesn't preclude disagreement on standards and processes and techniques -- people often truly disagree because they legitimately and rationally believe something different. In a full of intelligent, self-directed professionals, such disparate beliefs and conclusions can be enormously beneficial. The problem is when interpersonal issues materialize as technical disagreements.
I've been playing with Team Foundation Server, Whidbey (Visual Studio 2005), and Yukon (SQL Server 2005) since early in the beta cycles. All three of them are remarkable products, with enormous advances over their predecessors (in the case of TFS, I'm spuriously considering Visual SourceSafe the predecessor, although TFS is a elephant compared to the mouse of VSS), and all of them should be critical components for anyone developing in the Microsoft camp.
All three of them also happen to be a little unpolished, with odd little quirks and errata, hilariously incomplete documentation, and a tendency towards resource hoggishness.
One thing I've found remarkable, however, given that the three of them have been in final form for anywhere from two months to over half a year, is how little real information and first-hand accounts are available online. I'm continually hitting roadblocks where there are marginal functions or incomplete documentation, and it's surprizing to find zero references to the same problems or questions on any of the normal forums (e.g. Google Groups, online searches, etc). Among the development community, outside of the desperate-to-get-anointed-free-support-MVP crowd, they just don't have the aura of excitement they probably deserve.
Given that there are literally millions of developers and technology hobbyists out there, it's usually the case that any problems one faces are well trodden, and a quick search on the newsgroup usually yields exactly the answer one needs, so this dearth of time-travel support really is disconcerting.
The only conclusion I can draw is that there simply aren't that many developers seriously using these technologies. Visual Studio 2005 is of course seeing some use, but there are still huge armies of developers sticking with 2003 (given the break between .NET 1.1 and 2.0). A lot of SQL shops are still taking a wait-and-see approach with 2005. Team Foundation Server, primarily because of the cost of the Team editions, and the cost of a TFS Server license if you grow past a 5-user team, seems to be fairly rare.
Data security has been on my mind lately, mostly after learning that approximately 700,000 laptops are stolen in the US per year. Add the armies of desktops stolen, the backup tapes lost, and the system compromises that occur, and the situation starts to look pretty grim for data security.
How secure is your data?
If someone stole your desktop, or snatched your laptop from under you at a coffee shop, what confidential information could they gain?
While most thieves aren't of the capacity or motivation to crack the syskey or circumvent NTFS permissions (which is as easy as booting up with a knoppix disc. File ACLs only matter if the expected host operating system is in charge), your response should be to assume that they do, and that they are now reading all of your documents, looking at all of your shortcuts and form entry values, browsing your Outlook notes of account numbers and passwords, and are playing with your tax returns.
The real-world cost of such a compromise can be extraordinary. Losing an expensive piece of equipment can be annoying, but it pales compared to the wholesale loss of data privacy.
Do you use EFS (more information here)? Do you have a Data Recovery key with the private key stored offline in a protected location? Do you know what syskey does? Are you aware of the upcoming Secure Startup (which basically is whole volume encryption)?
Are you comfortable enough with your procedures that the physical loss of a computer to theft would be nothing more than a financial expense and setup hassle, with marginal or no data exposure?