Tuesday, May 25 2010

A reader wrote me regarding a performance issue they were having with PostgreSQL, and I thought the case study would make an interesting follow-up note on the whole SQL/NoSQL debate.

The scenario was that they needed to look up batches of geo-locations by postal code, passing in sets of 100 postal codes and retrieving the corresponding set(s) of latitudes and longitudes.

It is a real-world scenario, whether for mail processing system, census analysis, sales tracking, or many other common data processing needs.

You could simulate such a scenario with data like this. I did just that with the Canada Postal Code data.

In the reader's situation they were finding that batches of 100 postal code lookups took over a second.

That’s suboptimal, and not ideal for any system that needs to perform a large number of rapid lookups.

It is not the sort of task that I would normally do in a database. Hard to believe, perhaps, after the prior entries, but this is a process that I consider highly specialized – a unique snowflake, if you will.

The data is extremely static, and the usage is very specialized. Performance, rather than generalization, is paramount.

To validate the performance assumption I built a simple .NET test app that populated a Dictionary<string, List<Location>> with all 765,344 Canadian postal codes (there are, in some cases, multiple entries for a single postal code, so each dictionary element contains a list that contains 1..n results), and then looked up random sets of 4000 postal codes (hint: Create randomly sorted recordsets in SQL Server by ordering the results by NewId()).

It could lookup results at a rate of some 3,000,000+ lookups per second, with no parallelization running on a single mid-range core. Adding parallelization (extremely easy in .NET 4.0 using Parallel.ForEach) was of limited benefit as the reduce stage and thread safety efforts ate up any savings for all but the most unrealistically large test set.

That was an ultra simple solution with very few lines of code, specialized for the purpose. It did consume some 140MB of memory, but memory is bountiful and cheap.

Doing the same lookups in SQL Server, with optimized fill factors and a perfectly covering index — even after priming the cache (by doing a full select of the covering index) — yielding a return rate of approximately 5000 lookups per second on the same hardware, per core. The generalized execution engine simply isn't optimized for such a trivial lookup usage, and imposes significant overhead that isn't beneficial in this case.

3,000,000 versus 5,000 (per core) = an incredible reason to use a specialized solution, especially given that it’s the long solved problem of KV pairs.

The reader, after some correspondence, mentioned that they had settled on Redis, which is a solution that is midway between the custom in-application hashing solution and a generalized SQL solution (leaning much further to the former than the latter). The performance with such a solution will almost certainly be incredibly high, albeit bound by the overhead of IPC. Redis is a highly optimized solution for that task, and is quickly proving itself to be a viable part of most solutions.

It is the right solution for that problem. In no way is it a “new world order of social media and intraconnected graphs realigning the stars to herald the new way of using data”, but instead is a very appropriate use of the right tool. Redis, like Memcache, has a lot of metrics on its side, much unlike many of the other NoSQL solutions.

Using the right tool is what we should all strive to do.

 SQL  NoSQL 
   
Tuesday, May 11 2010

Some recent NPD data showing the Android platform overtaking the iPhone in the US has set the web on fire. Apple apologists like John Gruber are working overtime to try to spin it.

Remember back when Daring Fireball was actually an interesting site? Now it’s like I imagine Pravda was during the Cold War.

In any case, I was a little shocked to see the gains come this quickly. Android was obviously making a dent, but I expected another generation of products to flow through the market – like the Incredible and the EVO 4G, and eventually the long-anticipated onslaught of Dell vapourware – before Android really took hold with mainstream consumers.

And of course iPhones are still selling. They’re selling like hotcakes, for that matter: Apple reported a more than doubling of sales during the period in question (kind of undermining the notion that it’s some sort of calm before the 4G storm).

At this point smartphones have a limited market saturation, which is why it’s a critical juncture and why there’s a bit of a race to perform a land grab. It remains anyone’s game.

In a year we might be looking at the emergent dominance of Windows Mobile 7 phones. Or maybe the next iPhone model will storm the market. Or WebOS will be reborn into a real contender. Or maybe Blackberry – the strangely unmentioned leader of the pack – will really wow the market with their next cycle.

It is critically important that there is competition in the space. Competition is great for everyone, including for iPhone users (see the upcoming multitasking that is a virtual clone of the implementation in Android). The market can’t be dominated by one vendor, especially one with walled gardens.

Speaking of walled gardens, as much as HTML5 got talked up during the Great Jobs-Adobe War of 2010, in reality HTML5 performance on the iPhone and iPad is abysmal and is close to unusable for rich content.

The real alternative to Flash on the iDevices has been native, single-platform, completely-proprietary apps. Which is arguably a perfectly fine approach (take full advantage of the platform and all), but it was incredibly frustrating seeing HTML5 used like such a cheap sacrificial prop during the debacle when it really had perilously little to do with the debate. It was so grossly dishonest.

Of course I’m speaking outside of videos. For videos Flash is almost always a completely unnecessary wrapper around an h.264 stream (which is what many Flash videos come encoded in these days). Separating the video from the container is a no-brainer.

Obviously HTML5 needs to keep improving. With Android 2.2 around the corner there are wide expectations that they’ll introduce a JIT compiler for the Dalvik engine (“native” apps largely run in a VM runtime and are currently seriously hobbled by the lack of optimizations in the same), and it would be fantastic to see something like V8 for the browser as well. While the platform has far-and-away leading canvas and dynamic graphics support, it falls behind in scripting and that would be a nice gap to eliminate. It would be nice to see a decent dynamic HTML 5 capability built into iPhone OS 4, as currently it is seriously deficient.

   
Sunday, April 11 2010

Why the web should be destroyed, by Steve Jobs:

“We’ve been there before, and intermediate layers between the platform and the developer ultimately produces sub-standard apps and hinders the progress of the platform.”

I tried to ignore the whole section 3.3.1 debacle, as I’ve already commented on Apple’s moves and motives too many times, and really this change is just a new wrinkle of existing restrictions: developers for the iPhone are already subject to the capriciousness and fickle whim of Apple, with no recourse.

If Apple said that you have to wear a purple bow-tie while developing for the iPhone it would arguably be a change for the better. At least then you wouldn’t have uncertainty on the whole bow-tie color issue. From that perspective, Section 3.3.1 should be a welcomed clarifier.

While claims that it’s for the greater good of quality are discountable as a ludicrous smoke-screen, if you were gullible enough to believe that, and you accept the asinine notion that development technology dictates app quality, is Apple promising to filter app submissions by quality?

Given what is already in their app store they have a lot of pruning to do.

They could carpet-bomb out the crapulence, with acceptable collateral damage, by banning any apps made by small development shops. This would be great for everyone, right?

Kudos to them for removing consumer choice: If someone liked an app created via one of the targeted tools, clearly it’s because they don’t know what’s good for them. Personally I choose based upon reviews and user ratings, but it’s a win if these sorts of personal decisions are made for me.

Like I said, I’m not going to get drawn into this whole section 3.3.1 debate.

So let’s get back to Steve Jobs’ statement above.

“We’ve been there before, and intermediate layers between the platform and the developer ultimately produces sub-standard apps and hinders the progress of the platform.”

Where do web standards fit into this equation, given that web standards are almost perfectly defined as intermediate layers between the platform and the developer.

Stop. Pause. Seriously think about this. Imagine that Steve Ballmer was saying this about SVG (just use GDI and DirectX or at worst VML) and JavaScript and the canvas element, while discussing banning non-ActiveX controls from the Windows ecosystem.

Where does Steve Jobs think the web fits? Is it just a convenient stop gap?

We know that Apple’s existence through the lean years, and then its resurgence, was made possible because of the cross-platform web, and paradoxically because of the cross-platform Flash technology. iMacs sold to the general population only because they knew they could still use the open web, consuming and interacting with apps that could certainly have been built richer and better if they targeted just the Windows platform.

They could use the popular sites, read the news, do their banking, pay bills, and send eCards, despite being on a platform used by only the very few.

Yet now Jobs has made it plenty clear that the web is for trivial, simple stuff. For richer apps you need to target the iPhone alone, using a process that in no way can allow you to target other platforms as well.

Flash is a no go because it enriches the web – there’s a lot of hate out there by people who know Flash only as a simple video player, where it punted Apple's QuickTime it’s worth noting, and as a platform for irritating ads, however as a parent of young kids I see Flash as the enabler of an incredible array of rich and entertaining educational tools for kids (PBS Kids, TVO Kids, CBC Kids, Disney Kids, among countless others) – and Apple has done nothing in the mobile web space to make up for the gap…because at that point you’re supposed to bridge over to the iPhone market and embrace it with fervor and loyalty. Sorry, but no thanks.

   
Sunday, April 04 2010

The iPad represents a great opportunity for HTML5, with many large web properties already shifting gears to ensure that they take advantage of the platform. The iPad as a web consumer is all about modern, open standards that empower and enrich the user experience. It has one of the best mobile web experiences going.

This is expected, really, as Apple's very existence hinged upon an open standards web. There was a dangerous period in the late 90s when the web almost got Windowsified — look to South Korea as an example of this happening — and if that came to fruition Apple would have been dead in the water. Thankfully a few who saw past short term interests rallied around keeping the ecosystem open for innovative companies like Apple to thrive.

yafla is finally going to be born into a remarkable web application that exploits the rich functionality of the iPad and iPhone's web platform, along with Android, the Blackberry webkit browser, and virtually any other modern HTML 5 consumer, from big to small. I'm putting my actions where my mouth has been (this includes architecturally on the back-end, with decisions that I will document and explain along the way).

The arguably proprietary Flash platform has rough times ahead. While the actual numbers of iPhone and iPad users combined represent a small percentage of the web consumer ecosystem, it occupies a disproportionately large area of the mindspace.

Boo for the iPad! It bring us back to the obsolete era of walled gardens

While the iPad supports HTML5, that isn't the primary focus of content providers: The vast majority of them are rolling out solutions that target the walled garden of the iPad. Video content, books, magazines, or newspapers, you're entering the land of made-for-the-iPlatform solutions that nothing to do with the modern web.

And of course it isn't just big media. Many web sites — like Engadget, Digg, and so on — are rolling out apps for the platform as quickly as they can. In most cases those apps do nothing that can't be done as well or better as simple web apps, but such is the return-to-mistakes-of-the-past era that we're in. In the cases mentioned they also built Android apps (others, including Big Banks, are far more myopic about this), but there's still the question of why they built anything platform specific at all, beyond the obvious explanation that they're hoping on the bandwagon.

Everything old is new again.

John Gruber argues that the iPad represents a more, open and innovative ecosystem than with the Atari 2600 circa 1978. Hard to argue with that. But what about the 32 years in between, John?

It's a clear sign that there's something seriously wrong when you have to base your comparison on an early home game machine from three decades ago.

For the past 20 years we've had a computing market where anyone and everyone could build applications for the vast majority of devices. Since the incarnation of the web, those creators have had the ability to have just as much presence as makes like EA. There is nothing new there. The only "advantage" that the iPhone cum iPad offers the little guy is that the market was so nascent and novel that a million made-on-a-weekend apps could sell thousands. That early ease is quickly disappearing, and the natural size advantage of shops like EA is coming to fruition. Small-shop, single-trick apps are going to very quickly get crowded into an unlit corner.

The Apple app model is horribly, horribly broken, though they have enough goodwill, and still get by with many deluded into thinking that they're the underdog little company, that they'll be able to float with it for a while longer while apologists continue to present their questionable defense. The iPhone and now the iPad are not simple game machines (whether from 2010 or 1978) — comparisons with the Atari 2600 or even the Xbox 360 are highly deluded — but represent a serious movement into the domain of general computing, and against that they should be compared. Pretty remarkable how Apple has managed to retroactively turn Microsoft into the good guys.

The Apple web model is brilliant, with it representing a fantastic web appliance of the best kind.

Let's just hope the web survives through this, and there isn't a rush from open standards to the opposite-of-open-standards walled garden.

   
Thursday, April 01 2010

Just a couple of minor notes:

  • Tonight I switched data centers. There may be a small outage for some because I forgot to lower the TTL on the old DNS entries before the move: I guess being the, uh, "world's most pre-eminent domainologist" doesn't mean I'm always on top of such things, at least not for what now is just a wrapping around a blog.
  • I made the move because I'm finally making use of the domain, and the reason for needing a more powerful platform will become clear.
  • I often do minor edits of published pieces after the fact, without noting it as an edit. Often I'll read back through the archives and see typos, wording that I'd like to correct, etc, so I fix it for future readers. The most common edits see me removing parenthetical asides that add nothing. The software I wrote to run this does store every revision, so technically I could provide a diff'able history and have long considered doing that.
  • Speaking of blog software, one solved problems that strangely causes havoc again and again are dynamic sites that fall over whenever they get any attention. Over the past while I've had days with torrents of incoming visitors, and the CPU needle barely ever spikes to the point of being measureable. It is just inconceivable that a simple blog dies when it gets on Reddit or the like. Seriously - caching, figure out how to use it. If you can't get it into your software itself, stick your site behind an nginx instance and use its wonderful functionality.
  • Every entry that I post is automatically run through tidy. I really consider it important that mark-up claiming to subscribe to a given standard actually honors that standard.
  • I've withdrawn from the whole NoSQL debate because it got incredibly boring. My closing note is simply to say that the way that many are using NoSQL is like discovering the buggy whip at the beginning of the automotive era.

Incredible times ahead.

   
Saturday, March 27 2010

Joe Stump – the former Digg lead architect with the coolest name in tech – posted a peripheral response to my recent entry about SSDs and NoSQL.

Rebuttal in tl;dr; Form

The original post was motivated by claims found on Digg’s technology blog.

  • They say that the RDBMS “mindset” favours writes over reads: BLATANTLY WRONG CLAIM.
  • They show poor index and schema use: WRONG DATABASE USAGE.
  • They show that their database product can’t join: BAD DATABASE SERVER. RED FLAG.
  • They report very poor performance without adequate detail: MEANINGLESS PROPAGANDA.
  • They use this to show that the RDBMS can’t cope: SEE ABOVE.
  • They say that if you don’t use all of an RDBMS’ feature set, you’re essentially using NoSQL: ABSURD.
  • They describe scaling out issues with databases: TRUE FOR MYSQL.
  • They described their move to NoSQL: GREAT FOR THEM. THOUGH REALLY THEIR SOLUTION WAS EXTREME DENORMALIZATION.

And on Joe’s post.

  • You need an expensive DBA with the RDBMS, not with NoSQL: SPECIOUS, FLAWED REASONING.
  • Capital expenses suck. Services are better: BUSINESSES GENERALLY LEASE THESE DAYS.
  • $7,500 “just for disks”: FOR A SaaS BUSINESS THIS IS CHEAP.
  • 50 node cluster: 50 NODES IS A COMPENSATION FOR ABHORRENT I/O RATES.
  • SSD drives are expensive: NO THEY AREN’T. YOUR ARGUMENT IS OBSOLETE.
  • Commercial database products are pricey: VIGOROUS AGREEMENT.
  • NoSQL $/read and $/write win: MAYBE, MAYBE NOT. DIGG COULD LIKELY DO MORE WITH A COUPLE OF SSDs THAN THEY CAN WITH THEIR MASSIVE DENORMALIZATION

The Non-ADD Version

Joe has been in the Web 2.01 trenches. He built a solution that powers one of the top sites on the net.

Remember when getting "Slashdotted" was a big deal? Getting on the front-page of Digg makes a Slashdotting-at-its-peak look like a little traffic bump. There are probably a hundred PR reps busy trying to botnet their clients onto the front-page of Digg for every one punished into spamming Slashdot these days.

Far more people know Joe’s all-out-of-bubblegum name than will ever know mine, and rightly so.

A Strawman Built on Cliches and Appeals To...

Joe comes out of the gate resorting to the venerable old-versus-new tactic: "It's just those old-school DBAs upset that us kids are rewriting the rules," he says in not so many words, while nailing himself and his peers onto a cross, seeking pity for the flames they doth receive for their unconventional, rebellious ways.

This is a bit strange, really. Barely a day goes by lately without Hacker's News or Reddit’s /r/programming featuring another front-pager about how the Incredible NoSQL is rewriting the rules of, well, everything. The general demeanour is one that, I think, is far more sympathetic to completely unsupported and undemonstrated pro-NoSQL claims than it is to anything that questions the hype.

Countless NoSQL blogs have appeared (though if you browse them looking for actual content you’ll instead find that most feature few facts but lots of zealous punditry. Advocacy seems to be the primary focus right now). Anyone involved with any sort of NoSQL initiative is spinning off their own start-up to capitalize on this sure-win formula, acting like it’s some sort of magic ingredient that will assure them of success.

It is very reminiscent of the XML heyday – I’m a very big fan of XML in its place, as an aside – when countless start-ups appeared with business models that could be boiled down to “something to do with XML”.

The big database vendors have remained quiet, largely because the miniscule-budget operations all clamouring for their piece of the NoSQL pie aren’t worth bothering with.

But what about Google, Amazon, and Twitter!” you say. Joe resorted to that same appeal to authority by incanting the same magical trio (say it three times quickly and your TPS rate will quadruple!). Not really much to bother with there, beyond pointing out what a cargo cult is. Your bamboo headset won't make you successful like Google. It really won’t.

Unless you are targeting the same problem space as those companies – say like providing very low performance but highly “scalable” database solutions for countless low-value start-ups – their solution choices are utterly irrelevant.

I'm not a DBA (though knowing how indexes work now strangely qualifies one for such a title). I'm just a technically curious solutions guy that has an innate need to keep asking questions and probing deeper until the Want-To-Believe fog that often hides hype dissipates.

On Rinky-Dink Operations

In Joe’s entry he focuses a lot of attention on the costs of RDBMS solutions.

One such argument is that it’s better to use computing hardware as a service than to buy, seemingly implying that while you can buy good hardware to run a RDBMS, it is better to rent less-good virtual hardware to run your NoSQL instances.

Yet leasing is what all the cool kids are doing these days, largely for the same financial reason. Writing it all off beats dealing with depreciation BS, and it makes financial planning a lot easier.

On the leasing front, $600 a month gets you an insanely powerful, makes-an-Extra-Memory-Quadruple-Extra-Large-EC2-Instance-Look-Like-A-Pile-Of-Puke server.

You’ll probably be paying 20x that for every developer you have working on your solutions. Is this really so astronomically high?

That less-than-the-cost-of-the-office-cleaners price tag gets you a server that with a bank of striped SSDs that will almost certainly demolish your impressive-in-count-but-not-in-throughput big scale out cluster, at least with a non-broken RDBMS system.

No really, it will. Of course for any sort of reliable system you’d have to pay for some DB licenses (presuming you aren’t going with PostgreSQL), and then you’ll want to double everything up into mirrors or some other reliable setup, so triple the price.

And really, is the $7,500 spent by 37signals on a disk array really even worth mentioning? I suspect that sort of number ends up almost as a rounding error on their expense sheets, and given that it's pivotal to their operation – it sits under the very foundations of their business – I doubt they spent many sleepless nights over it.

What sort of rinky-dink operations are we talking about here? Does Digg still qualify as a start-up? Don't they have a payroll and all of that, yet they're clamouring to wire up a collection of discount bin servers?

I posted the SSD entry because SSDs really do fascinate me, and I do think they change a lot of the rules of the game. It just happened to dovetail nicely with my investigation of the Digg scenario, where Digg solved their very real I/O issue by essentially pre-caching every possible query result for a targeted need.

Through extreme denormalization they traded storage to reduce I/O needs.

This is a very important point, because it’s far more pivotal to Digg’s solution that the NoSQL versus RDBMS debate.

Call up your old Digg coworkers, Joe, and have them setup a real database server with a couple of SSD drives and see how it compares with their impressive cluster. I’ll bet Dell would happily lend them a real server.

All of this is a bit humorous, really: The whole point of my original entry on this NoSQL topic was simply to say "what is good for Digg isn't necessarily appropriate for all database needs”, so it’s a bit unfortunate that it has come to this, with Digg’s former architect justifying their decision when they were held as a scenario where it is likely the perfect solution.

Then, after seeing the Digg case-study, I felt obliged to respond to their RDBMS claims because I saw them as flawed, indicative that the movement should really be called NoMySQL instead of NoSQL. It still doesn’t diminish the correctness of their choice.

But really, while I originally entered into this debate believing simply that NoSQL is being oversold (it is grossly inappropriate for the vast majority of non web 2.0 projects), the more I investigate the more I’m coming to think that it is a solution for the rapidly disappearing problem of pathetic I/O rates, at least assuming that you aren’t running on several of the cloud solutions where that is your only choice.

There are many other differences that come with NoSQL (many strongly questionable, like the oft lauded “no schema” claim for some of the solutions), but the I/O restriction is by far what sold it on the high end, and the high end is what convinced the little guy that it’s the way to go.

Oracle, DB2, SQL Server, Teradata, Vertica, Greenplum, Sybase and Friends All Cost Way Too Much

I very strongly agree with Joe about one thing: the licensing costs of the big RDBMS products are way too high.

They know that 2% of their potential customer base have giant budgets, and that they can squeeze more from that 2% than they could ever get from the other 98% who then get relegated to fighting over scraps like MySQL.

Not really sure how to solve that problem, but I concede that it is a non-trivial issue. PostgreSQL is probably the best low-to-no-cost database server, but even then quite a few performance features are missing (like real-time materialized views or SQL Server style clustered indexes).

   
Saturday, March 27 2010

Seven years ago I had an article published in MSDN Magazine demonstrating how to target SVG from ASP.NET. It was a speculative submission that I sent in because I really believed in the technology and its importance to the web.

In my various drafts I had included details about Microsoft's participation in the various SVG groups, and how that bodes well to its eventual adoption by Microsoft. Those got edited out before publication.

Alas, while it took far longer than it should have, it is a wonderful development to see that Internet Explorer 9 finally adds SVG to the browser.

   


About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.





 
Earlier EntriesLater Entries

Dennis Forbes