Wednesday, March 24 2010

Fighting The NoSQL Mindset, Though This Isn't an anti-NoSQL Piece

Shocked by the incredibly poor database performance described on the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem.

While they posted that entry six months ago, they recently followed up with more statements on the NoSQL / RDBMS divide, and are now being heavily used as a citation of sorts.

For instance Dare Obasanjo held Digg's moves as a rebuttal of my prior entry on SQL scaling (though my entry actually explicitly excluded incredibly rare edge cases like Digg's, and my core point was that the majority of database uses don't have the needs of a site like Digg, I'm always one to take on a challenge), which then got picked up in other blogs.

Digg's case is an example of an entry-level RDBMS product used arguably suboptimally on under-powered hardware, and it seems questionable whether it proves anything of substance about either database technology. Yet it's held as demonstrative of something — in particular the failing of the RDBMS — which is why I focus on it. They are different tools in the toolbox, arguably for different purposes, and that isn't the focus of this entry.

So let's take a look at Digg's scenario.

I do this to evaluate their performance claims, to confirm my previous statements about indexing improvements, and to determine the impact that SSDs have on the problem space, because I strongly believe that SSDs (and cheap memory) completely change the equation.

The focus on this entry is not to question or answer whether NoSQL is the right choice for Digg — though there are some ramifications as SSDs take over, which is, I think, an interesting aside — or whether Google or Amazon or anyone else should use it, etc.

SQL Server 2008 Developer Edition, itself viewed as almost a training-wheels RDBMS by many, on Windows 7 was the most convenient platform for me when I ran this test, so I created a quick script to create what I think is a pretty accurate reproduction of the database described in their blog entry—

  • 500,000 users, having…
  • 10,000,000 friend relationships (using a power law distribution)
  • ..and 500,000,000 “Diggs”, randomly distributed among 500,000 virtual “items” (which might be comments, submissions, etc) with a date range covering four years.

The database weighed in at a svelte 30GB.

I ran this on a two-year-old desktop machine with a Q6600 processor and 6GB of RAM, on a standard 7200 RPM consumer drive. You can easily find laptops with more processing and I/O power.

I opted against running it on a real server (you know, like a 24-core, 128GB, banks-of-SSDs monster than many real databases run on) simply because I knew it wasn’t necessary, and went contrary to the demonstration that even a mediocre machine can beat their results.

DISCLAIMER: This is not a high-fidelity reproduction of Digg's situation, as is pointed out many times in many ways in this post. However Digg took the time to post metrics to support their claims that they are some sort of extreme case, at the edge of database limits, and I simply don't believe that is true. Digg's data quantities are relatively small and lend themselves to sharding. The second point, which again is hammered home many times, is that SSDs present a solution that changes the equation, and, I think, provides some interesting inputs to the situation.

The First Clue That Something Isn’t Right: You Can’t Do a Simple Join

The Digg blog entry detailed how they had to manually build an IN clause given their selected database product’s inability to adequately run a trivial join, with the resulting query taking 14 seconds to find the Diggs for a given user’s friends against a single selected item.

This yielded a results return rate of 0.07 per second.

If you can add an IN clause that solves a database join issue that functionally should achieve the same, there is a much larger underlying issue that needs to be dealt with. I'm not a MySQL user, but apparently it offers minimal plan investigation tools, so there aren't the options to fully flesch-out what the query engine is doing. Nonetheless, it is a warning sign of a foundational product issue.

I ran a similar query in SQL Server, albeit without the hand-coded SQL builder, looking up friend Diggs for randomly selected combinations of users and items. It returned so quickly relative to Digg’s experience, even from a cold cache, that I had to up the iteration count to 1000 to get good test durations.

SQL Server was returning a fairly constant 36 result sets per second, probing the friend table’s ten million relationships to find the selected user’s friends, and then probing the five hundred million Diggs for the pertinent records, sorting it in the manner that Digg sorted their results. The query needed to draw data from all over each of the respective table populations, ensuring that it wouldn’t benefit from localized hot-spot caching. To prove this, limiting SQL Server to only have access to 1GB of memory had a negligible impact on the performance.

CPU usage was marginal, with the limiting factor being the slovenly I/O of the lowly magnetic disk. The iterations were run sequentially, as parallel runs yielded no net benefit, the magnetic disk moving as quickly as it possibly could already.

Already we’re running at close to 500x the rate reported by Digg, without doing anything beyond using an arguably better database product, at least in that it can join properly. MySQL's many weaknesses are well, well known, so the core point from that is not to question Digg (though their indexes were suboptimal), but to put their database product under a cloud, as they themselves often do when posting about their move (usually openly declaring their restricted option set given that they limited themselves to open-source products, obviously eliminating from consideration many of the clusterable, very high performance RDBMS options, even if that were a better choice which is completely uncertain).

Their dataset distribution may be entirely different, however even if I doubled or quadrupled or octupled the count in every table it would only marginally impact performance.

At this point I implemented the indexing changes I described in my prior entry – removing the surrogate keys and cluster-indexing on the unique columns – and the lookup rate jumped to 71 result sets per second, or around 1000x the speed reported by Digg. If I massively increased the data quantity and return counts, the difference between their poor indexing and proper indexing would dramatically widen, with the proper indexed solution showing little difference with significantly increased data counts.

If the database was cached in memory those index changes would have had a much more profound impact.

What If Localized Data Isn’t Your Primary Optimization Strategy?

I had been meaning to get an SSD for Eclipse Android development, so when my new 100GB SSD arrived (it’s an MLC unit that did well on an Anandtech review, though I won’t mention specifics as it isn’t pertinent – any decent SSD will perform at a similar level. Of course for real-world production use you would want an SLC drive) I detached and moved the database files over.

A quick reattach later and the 30GB was very amply hosted in the 100GB MLC SSD.

I fired up the benchmark to be pleasantly surprised to find it returning results at a rate of 4100 result sets per second. The write performance, while not a focus of this test, also hit extraordinarily high levels (which would conveniently lubricate the use of copious indexes).

Correcting the indexes and moving the database to a single inexpensive consumer-grade SSD, running on a dated desktop, had results coming back at a rate 60,000x what Digg reported.

None of this is intended to be a serious benchmark of SQL Server (I don’t wish to fall on the wrong side of a DeWitt clause), or even Digg's use of MySQL: This is not a disciplined benchmark, and during parts of it I hopped into some windowed online matches of Battlefield:Bad Company 2 while tests ran, after seeing that it had a limited impact on the results. I knew that the primary weakness was simply the movement of the hard drive head, and different technology choices (NoSQL versus RDBMS, normalized versus denormalized, clustered versus heap, etc) primarily impact how often and how far that head has to move.

And of course I don’t have Digg’s data, so it is completely speculative on my part based upon some rough descriptions given in the Digg posting. Maybe he hugely underestimated their data counts, or their data entropy is vastly different.

This is a macro-benchmark: Digg’s claimed results were so poor that I went in knowing that the difference would be very large.

Their described data quantities are small in the world of large databases. Most decent relational database products don’t even start to sweat with tens or hundreds of millions, or billions, of rows.

The key, of course, is proper indexing, trading write performance for read performance targeting your specific needs.

Indexes could be viewed as ways of creating “virtual tables” that are maintained in lock-step with your base table. Decent database products like SQL Server even allow you to include unordered but included columns in your index to ensure that you have a covering index (the best kind) for all scenarios. And that’s before you even get to the magical world of materialized views.

So either MySQL is an atrociously bad product at the larger limits, which ample evidence seems to point as a truism, or the Digg staffers simply weren’t getting the most out of their systems, but it’s hard to take their statements about the RDBMS field with seriousness, and their arguments more correctly invalidates MySQL more than it invalidates the RDBMS.

The fact that Digg is a large site says nothing to their technical leadership or mastery. Their site has sped up by leaps and bounds over the past year, so I suspect that they know what they are doing, but I'm wary of any cargo-cult like "they did {x} and they're a big site therefore it must be a good option" appeals to example.

On The Role of the DBA

What is most disturbing about this whole database debate are the number of commentators who excuse horrific database usage (not in relation to Digg's issue, but as a general conversation point whenever people make comments about proper database use in virtually any online discussion), ascribing rudimentary database performance design and knowledge as something that is limited to the elusive “DBA”.

This is ignorant and frightful.

You don’t know what a b-tree is? Don’t know how indexes work? Don’t know what a red-black tree is? Please get away from the compiler and save the world from your monstrosities until you have some knowledge of these basic concepts.

This is not esoteric knowledge, and instead is rudimentary comp. sci. knowledge.

DBAs are the guys who setup user accounts and monitor security, schedule backups and determine macro-optimizations like how to allocate file groups on the SAN arrays. They might probe lowest-hanging fruit performance issues and flag offenders or offer up suggestions.

Rudimentary database design and proper usage is the basic responsibility of developers, and if you don’t know it then it is your responsibility to learn it. Alternately you can just clutch onto NoSQL and bleat about how it changes all of the rules anyways, which is the route quite a few have decided to pursue (I fully expect to get the standard angry responses from those who take this like a religion).

Is NoSQL a Solution for Yesterday’s Problems?

Database servers really like having a lot of RAM. Ideally you should have more RAM than you have data, allowing it to cache the entirety of your DB (or at least the working-set quantity of DB on that partition) making incredible read performance achievable.

Joining rows is not a hard activity for database servers. It can do it at unfathomable rates if the data can be fed to it at the appropriate pace and in the right form. Even heavily normalized databases can be high performance.

What normally makes joins a performance issue is data locality: if you have to load two rows from different places on the disk, that’s two seeks instead of just one (or three, four, five or more instead of one). When seeks are as costly as they are on a magnetic disk, you avoid it (either by striving for a database that fits in memory, which paradoxical often calls for heavy normalization, or by de-normalizing).

Writes are obviously important too, yet on a site like Digg I would guess that reads outweighs writes — from a user interaction perspective — by a factor of 10000:1 or more outside of logging (which usually goes to a log-specific technology anyways).

In contrast to all of the “everyone is a publisher and the internet changes everything” bluster that is used to herald the wave of change that NoSQL brings, the reality is that it’s a very small percentage of users that post submissions and add comments, or even that do the simplest possible action of clicking an arrow.

Users overwhelmingly simply consume data, whether it’s the latest tech news, Asthon Kutcher’s tweets, or just browsing through the comments on a Slashdot article to see if they add any additional insight.

Despite Digg’s recent claim that they are “write intensive” (maybe because they’ve decided to dramatically explode the number of writes a simple action causes?), at its root their platform is primarily read focused, which is why they pursued Cassandra in the first place. Take note that their NoSQL solution for friend Digg lookups is to take every Digg and massively explode the number of writes it causes to happen (in the case of a Digg by Kevin Rose, a single write becomes 40,000+ large writes).

Hardware Is Cheap. Manpower is Expensive

If I had 48GB of RAM in the test machine (which is fairly pedestrian outside of gerbil-sized cloud instances. Note that you can now add 128GB of RAM to servers for around $4000 in some cases), outside of the initial caching period the select rates would be stratospheric regardless of storage medium, though SSDs would still come in a very, very strong lead when it came to write performance.

For the same $4000 you could chain five Intel X25-E drives for 320GB of intensely high performance – and persistent – storage. Just keep going up until you have more throughput, I/O and storage than you could dream of.

Some high-end enterprise solutions now tier storage and automatically place data as appropriate, choosing between magnetic, SSD, and memory caching systems. The pages of the table that are never touched end up on the magnetic storage while the hot area – say Diggs within the last 6 months – are moved to SSDs or to huge banks of memory caching.

There are bountiful options to achieve incredible performance, even on a budget, balancing memory and high performance storage systems.

Throwing Storage at the Problem

I didn’t waste the disk space, but as mentioned before I could do a simple join between the tables, materialize the view, and the performance would be very high even on magnetic disk, although it would add a serious cost to writes: When a user with a large number of people who befriended them dug something, their record creation would branch out into the write of potentially thousands of records.

That was the route that Digg took: They are pre-computing the sets of data that a user might possibly want, even apparently for reams and reams of long abandoned accounts.

They do this because looking up data that can’t be cached in memory is an expensive operation. Yet as has been shown, SSDs, which are getting faster and cheaper regularly, completely flip the I/O equation.

SSDs change everything.

Turning a small amount of data into a massive amount of data to improve performance paradoxically makes SSDs much less attainable (because the cost per GB is so much higher), and humorously may thwart the end goal. It also reduces the ability to memory-cache the relevant data.

By pursuing this solution, Digg has limited their ability to choose other solutions that are clearly hitting the mainstream.

Coming Next – PostgreSQL versus Cassandra

There is a complete absence of objective measures of the performance of Cassandra. In place of real performance comparisons and load metrics are a lot of hand-waving and comparisons against completely broken database products (never, ever hold MySQL as the vanguard of the RDBMS world. It is comical to do that) running horrendously malignant queries.

Not anymore. I’m on the case.

My goal is not to belittle the product (which I think is elegant, beautiful and concise, and serves a very important role), but simply to bring some rationality to the argument, as it is currently missing.

[EDIT: The following statement has been proven to be a wrong interpretation, but I leave it here out of humbled shame]Digg claims that Cassandra brings them “linear scalability”, yet every one of their Cassandra nodes is 100% replica of the other, meaning that a write (or 40,000 writes) on one is communicated and then replicated on every single other instance.

Response to Criticisms - 2010-03-25

This entry got picked up on a couple of excellent tech-oriented sites: Hacker's News and Reddit r/programming. Included in the comments of a lot of very smart people are a couple of common criticisms that I thought worthy of specific response.

"Your benchmark stinks. How about you..."

My benchmark, if you can even call it that, was focused on O(n) complexity and the difficulty of joins among very large tables with a half-decent database product, with the core take-away being "it's a solved problem. With proper indexing and a decent database system most datasets are `small'."

On the topic of concurrency, I mentioned that in the entry, noting that executing many parallel runs of the test yielded the same net output on the magnetic disk, while it actually significantly improved performance on the SSD and then leveled off. Database servers are fairly smart about concurrency and task queing.

The top result of 4100 resultsets per second, which was achieved using many simultaneous runs, still wasn't fully exploiting the I/O capabilities of the SSD, owing to the tuned-for-magnetic-disks nature of the database server that I didn't bother resolving.

However the Digg case study lacked significant details beyond a couple of spurious size details there to indicate, I believe, that "we think our data is large and the RDBMS can't service our needs". What I based my run on was quantity-of-data (which is not large in the land of databases) and key phrases like "from a cold-cache" (which can be reasonably interpreted as "on a test instance"). There is a lack of details in the Digg benchmark, given that I don't think they were intending it to be a industry standard metric, so it isn't reasonable to expect so much more regimented discipline from mine. However let me say that I did take the meager stats that were given and, where possible, erred to the high end — where "hundreds of million" appeared, I went with 500 million (if I went with a billion it would have barely impacted the results, but I was impatient and didn't want to wait for the data setup script to run that long). Where they said "millions" I went with 10 million. Some of the responses are demonstrative of how fact-free the debate has become, so it's not particularly surprising that NoSQL blogs group-hug around it.

This is not a replication of Digg's runtime environment, and any illusions that it pretended to be intentionally misinterprets. Though if it were a serious apples-to-apples comparison I would have run it on a serious server with serious load simulations, where the only orders-of-magnitude would be the difference between the results and what I achieved on a dated desktop.

In fact, 36 or even 71 results per second is still far too slow for Digg's use (especially given that they are stuck with a web technology that forces synchronous database calls), and I'm not even purporting that to be a viable option for them as they add out a lot of data-intensive personalization options. It's simply to contrast against their abhorrent performance number which I think are grossly misleading.

"But Google and Facebook and..."

Sure. That has nothing to do with this.

"So you're an RDBMS guy who hopes SSD prevents change..."

I'm not an "RDBMS guy". In a prior outing I was declared a DBA because I didn't just roll over for the NoSQL propaganda, and now I'm cast as a guy who holds himself as a database expert. Actually neither are my primary competency, and I think that's the point: I don't purport to be Joe Celko, or even a remote approximation, yet even I can see some massive issues in Digg's case study.

I'm just a solutions guy that looks at technologies and tries to digg (har har) through the cruft and get to the truth, which can be tough amidst tech religions: Warp back to 2001 and try to have a rational discussion about XML. In the case of Cassandra (and many NoSQL solutions) there is stunning ease with which many make absolute statements about RDBMS, such as the many "relational databases can't handle large amounts of data, just look at Digg" claims that litter the web, while cheering on vaguery and unsubstantiated hype about NoSQL solutions.

Show me real performance numbers for NoSQL solutions: They are disturbingly rare. Instead the argument is dominated by noise comments and hand-waving about how grand NoSQL is because it just simply solves everything and makes everything great.

Digg's NoSQL performance advantage is achieved by localizing all of the data necessary for a given request — in this case "tell me all of my friends who Dugg this item/parent item" which they had precomputed and cast out — ignoring the problem of MySQL not competently doing joins (apparently it has troubles sorting as well).

That is overwhelmingly a storage seek issue, and Digg's solution was to turn many seek actions into one or two by massively exploding their core dataset so the data for every need is repeated and persisted for every possible use. I can say right now that there is no question that if I performed the same benchmark on Cassandra, drawing randomly distributed user-item buckets from the same magnetic disk, my performance would max out at the number of seeks per second of the disk, which in the case of a normal desktop drive is somewhere in the range of 100-200 seeks per second.

Of course NoSQL yields the same massive seek gain of SSDs, but that's where you encounter the competing optimizations: By massively exploding data to optimize seek patterns, SSD solutions become that much more expensive. Digg mentioned that they turned their friend data, which I would estimate to be about 30GB of data (or a single X25-E 64GB with room to spare per "shard") with the denormalizing they did, into 1.5TB, which in the same case blows up to 24 X25-Es per shard.

This is interesting, is it not? Maybe it rains a little on the NoSQL parade, but to me it's a pretty fascinating development.

"Why do you have to insult the Digg crew?"

I don't intend to insult them, but at the same time I don't fall in the camp that gives them credibility simply because they're behind a large site. Many of the largest sites on the internet made technology mistake after mistake, yet succeeded regardless because they have a good product: These are some serious examples where ideas beat out execution. PHP somehow formed the base of a good number of the internet's largest sites, yet are there many that will seriously argue for its technical superiority?

"These are two different tools. I'm sick of this argument! Let's get back to the NoSQL Propaganda Parade."

In many cases they are used to solve the same problem. In the specific entry I refer to for this post the whole point was "we were using an RDBMS, and now we're using a NoSQL and it's so much better", so is it really rational to claim that they're two completely different worlds?

"NoSQL solves different problems like scaling out, data centers, etc."

Orthogonal. Cassandra solves problems that you can't as easily do with MySQL. MySQL != the RDBMS industry.

   
Sunday, March 14 2010

Cassandra has gotten a lot of hype lately, having been recently chosen as the nucleus of the Digg upgrade simultaneous with Reddit taking baby-steps to the platform. Digg is promoting their revised technology stack as enabling a "wicked fast" experience that is much more individualized, while Reddit is thus far only really using it as a drop-in key/value replacement for MemcacheDB.

And of course Cassandra is well known for its use by Facebook and Twitter.

Naturally, given the white-hot hype, most want to see what the big deal is. Emerging web technologies often require that you either have a Linux box available (either a physical box or a virtual instance in a product like VirtualBox), however with just a couple of minor config changes and deviations from the docs, you can do a trial run and kick the wheels on Windows as a first class host, even if any possible production use would almost certainly see it deployed to Linux.

Cassandra is layered over Java, and of course a benefit of that platform is that it is inherently cross platform.

  1. Download and install the latest Java JDK. Ensure, post install, that the JAVA_HOME environment variable is pointed at the root of your JDK install (which on a 64-bit machine might be C:\Program Files (x86)\Java\jdk1.6.0_18).
  2. Download Apache Ant. Uncompress to the folder C:\Apache\Ant (giving you files like C:\Apache\Ant\bin\ant.bat).
  3. Download Cassandra. Given that you're probably going to be playing around for a while, go with the 6.0b2 copy, downloading the bin version. Uncompress the package into the location C:\Apache\Cassandra.
  4. Open a command prompt and navigate to the cassandra directory (e.g. after running C:, do a cd C:\Apache\Cassandra). Run the command C:\Apache\Ant\bin\ant ivy-retrieve. This will download Cassandra depedencies.
  5. Edit C:\Apache\Cassandra\conf\storage-conf.xml, updating lines 188 - 193 to replace each instance of /var/lib/cassandra/ references with C:\Apache\Cassandra\Files\.
  6. Copy the files from C:\apache\cassandra\build\lib\jars (which are the files that ant downloaded) to C:\apache\cassandra\lib. It isn't the most elegant solution but it's the most concise in point form.
  7. In a command prompt, after running C: and cd \Apache\Cassandra, run the command bin\cassandra. Cassandra should start up successfully (and if applicable the Windows firewall will ask if it should make an exception). If it doesn't start successfully you likely didn't follow one of the prior points correctly.
  8. In another console window navigate to C:\Apache\Cassandra and run the command bin\cassanda-cli. At the prompt run the command connect localhost/9160 and you should connect. You can now try out some of the simple set and get commands you can find in the README.txt.
  9. Start reading up on the Thrift API, the basics of data storage, what a "super-column is", and so on.

I've been playing around with various NoSQL* solutions for some time, however given the incredible hype — which is strangely coupled with a complete lack of any objective measure — I've decided to put it to the test. In the next couple of days a high-performance SSD will arrive and I'll gather some metrics for objective purposes, because the message being sold doesn't technically pass the B.S. test.

* - A better name than "NoSQL" is desperately needed. Backronyms and revisionist history — seriously, guys, "Not Only SQL?" — don't solve the problem that the name is incendiary, inaccurate, and a little ridiculous. KVDBMS works for some of the products, but isn't quite applicable to richer solutions like Cassandra.

Cassandra is a very, very cool product, and I immediately see lots of very interesting uses for it, but what is most striking is what is missing from the product. It is so intensely bare-bones at the moment, which is exactly how MySQL made inroads: When it first became the first-love of many of the same people and sites that now herald NoSQL (the same people who almost without fail rallied behind PHP which...well...enough said), it was almost comically deficient as a database product, but as it grew those features it grew away from its core contingent.

Exciting times regardless. There are many niches in the technology space, in which the appropriate solutions should be applied, so it is always worth keeping an open mind.

 NoSQL  SQL  Cassandra 
   
Tuesday, March 09 2010

Digg And Cassandra, sitting in a B-Tree

Digg recently started transitioning parts of their platform to the Cassandra open-source, Facebook-originated NoSQL solution.

They're the perfect customer for NoSQL: The value per user and transaction is very low, demanding solutions that allow them to scale at minimal cost; some data loss or inconsistency can be accepted; and a lot of the data can be effectively siloed into islands.

Nonetheless, the article they posted about the move is filled with the sort of thinking that has littered the web with misinformation about the relational database.

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes

The relational database "mindset" imposes no such burden.

It's All About Finding Balance

Indexes, for instance, are a rudimentary tool of every competent database user. Each additional index adds an expense to every write to the table, forcing row changes to update every index in addition to the base table, in return easing certain read scenarios.

You apply as appropriate, striving for the perfect balance between read and write performance.

I posted parts I and II of a very simple "introductory to databases" article back in 2005 (never getting around to finishing part III), and I strongly encourage it for anyone who doesn't understand how indexes work, or how important concepts like covering indexes are (which I'll touch upon later in regards to the Digg scenario).

Many relational database users make heavy use of triggers and cascade activities that slow writes while lubricating reads. While many are wary of triggers in general (especially where business logic gets embedded in the data layer), this is common in the relational database world and makes an appearance in most solutions.

For Digg's particular scenario, however, the RDBMS analogy to their NoSQL approach is a basic materialized view (aka indexed view), which is a feature of most RDBMS products, from big to small.

Implementing materialized views adds a sometimes substantial cost to writes in return for supercharged reads. If I have a particular set of joins and functions that are queried often, I can materialize the view with the appropriate indexes and every change to any of the source tables automatically, as an added cost of the DML, updates the materialized view as well.

Some RDBMS systems support deferred materialized view updates where it automatically queues up the view changes without adding cost to the origination tables.

This is very old hat for virtually anyone competent with relational databases, though real-time materialized views need to be used judiciously because they fall under the auspices of ACID and can front-load write operations significantly.

Digg Don't Do Indexing (properly)

Ignoring that obvious solution of materialized views (which, to be fair, aren't natively supported by MySQL despite being a basic feature of most other database products), it is revealed that they aren't using the database in the appropriate manner — or that MySQL is simply a broken platform and is turning people against the RDBMS when really they should be against MySQL — when they note that they are manually performing the joins in PHP, claiming that the join takes too long to run as a simply query.

A likely contributor to their poor performance is that while they've made the artificial key "id" their clustered index, their userid/friendid index is only a unique index, and I suspect, from the operation of their site, that they are likely making use of the denormalized friendname column in their consumer as well, forcing a full row lookup for every match.

If they retrieve columns outside of a non-clustered index (the most common mistake is doing a "SELECT *" when you don't actually care about all of the columns), on every lookup match the database server pulls the row id (in this case the primary key) and then has to do another lookup for the actual row data. In their case — given that the relationship is unique — they should have made the compound key of user_id/friend_id the clustered index and eliminated the id column altogether.

This oversight means that instead of doing a simple partial index scan by user_id and pulling the limited set, the query engine is forced to pull the list of rows, and then lookup each and every row individually. So someone with 400 friends yields 400 IO cycles, versus 1 with a proper index.

The same problem exists in the Diggs table, but is made worse. The userid index is of limited value given that again it only helps them look up the surrogate record key (again, why not a primary key on itemid/userid with a secondary index of userid/itemid? Surrogate keys are usually a mistake if there's another unique key on the table, though of course it depends upon the scenario: foreign-keys or numerous secondary indexes might make such a simple key the best choice). The query engine is forced to lookup the records by either the itemid or the userid (by the friendid) and then lookup the root record, and then compare the corresponding value.

So many developers are so blissfully ignorant of how databases work, quick to ascribe their own shortcomings to the platform. Most will wave their hands and talk about how hard to come by a "good DBA" is, which is akin to pushing brutal bubble-sort algorithms and just distributing them across a MapReduce deployment, claiming that a good "sort algorithm guy" is hard to find and "scaling out" is what the big boys do.

So they could see a major performance improvement by indexing properly (I'm allowing that maybe they just gave a bad example, though their atrocious query performance seems to validate its accuracy), but even then looking up hundreds of seemingly randomly distributed records can be a costly exercise.

Change Is In The Air

Let's step back for a minute and ignore materialized views and appropriately created and used indexes and look at the core performance issue that Digg faced — looking up several hundred rows in the Friends table, and interrogating the Diggs table by userid/itemid for the same. Presume that the dataset is very large and it can't be cached in memory, which should be a normal design assumption.

Why is looking up several hundred randomly distributed records such a big deal?

Hard Drive

That's why. Most hard drives can only manage to seek to different locations on the disk about a hundred times per second. If you're relying on Amazon's EBS you have it even worse, with an esimated 72 IOPS per second.

That's slow.

Imagine that the query engine has a hundred row locations in hand; It would take it a full second to jump over the disk to gather up the data necessary to retrieve the contents of those rows. That's a best case scenario because in the real world it usually has to walk the index b-trees, find the matching data, and in Diggs' inappropriately indexed table case do yet another lookup to find the actual row itself.

This is why database systems often completely ignore indexes if the estimated match count exceeds a relatively small percentage of the data, anemic storage systems forcing them to do expensive operations like full scans because in the end it's a cheaper choice. Why it often just reads and filters a burst of MBs of data rather than select a few sparse records from an index.

It's why it's desirable to have the data in RAM, and why database servers should be loaded with copious memory. [Sidenote: It's also why denormalizing can paradoxically slow down a database in many scenarios because it grows a database beyond RAM unnecessarily. In the Digg case note the username and friendname fields in the Friends table]

The IOPS weak-point is why most enterprise databases add SANs with ranks and ranks of hard drives, ganging them together in such a way that many seeks occur simultaneously, vastly increasing the I/O rate.

A more attainable and far more disruptive advance is moving into reality, however, and that is SSDs.

Take a look at the Anandtech review of the OCZ Vertex LE 100GB MLC SSD. In particular look at the 4KB random read - MB/s results on page 10. Near the bottom are a couple of magnetic disks, including the esteemed VelociRaptor, which are absolutely decimated by the SSDs.

That is the test that is most applicable to the Digg scenario, and it is clearly evident how big of an impact it would have on their situation.

Instead of 100 IOPS, they would be looking at 15,000 IOPS. Put 6 of these in a RAID-10 array and you'd have a yield of 45,000 IOPS and reliability. Even without learning how to properly index they could see an easy 5000x performance improvement in that class of RDBMS queries. Add a materialized view and...the speed would be so obscene it would get banned from the App Store.

Those units are just $400 a piece, and the technology keeps getting bigger and faster and cheaper. SSDs are a deeply, deeply disruptive change, especially to the large-scale database world.

The drive I mentioned is an MLC unit that isn't intended for the enterprise market, but in some ways it fits the same role as NoSQL — less reliable, but it gets the job done. The nature of the Digg table (that it is largely an additive table with likely little churn) is the perfect use-case for an MLC SSDs.

And really, 100GB is a lot of space for an operational database, even for a social media site. While it isn't appropriate for Facebook's 25TB "figure out how to sell you junk you don't want" daily activity log, it is certainly adequate for all of the Diggs and Friend relationships Digg would need, especially when removing denormalization that was put in place because of the poor IOPS of magnetic disks. And of course with the magic of RAID you can scale it up to whatever heights you'd like.

For $400.

Soon we'll have even faster, larger drives that are cheaper, and so on. The nature of flash technology is that they can keep making it more and more parallel, so the IOPS are going to keep going up and up and up.

Optimizing against slow seek times is an activity that is quickly going to be a negative return activity. Many who embrace NoSQL are seeking a solution to yesterday's problem. Digg, for instance, yields their entire NoSQL benefit from optimizing data locality — that all data for a given need is nicely bunched together, which of course is what materialized views do as well.

There Are Incredible MPP Options in the RDBMS World...But They'll Cost You

The people who really demand high levels of database performance usually have a lot of money. Which is why many of the products that deliver options like column-oriented storage (an implementation detail of a RDBMS that is primarily suited to very large-scale column aggregations. It isn't suitable for a OLTP DB), or MPP (Massively Parallel Processing), cost absurdly high amounts.

Greenplum, Vertica, TeraData, parAccel, Oracle RAC, Sybase ASE, DB2 MPP...these things are often priced out of all but the largest enterprise's reach.

Look at the pricing of the upcoming release of SQL Server 2008 R2, in particular the Parallel Data Warehouse product that brings MPP to that server. $58K per processor, which obviously excludes it from contention for the vast majority of applications.

Come on.

If there is one thing that I would like to see come out of the NoSQL advocacy movement, it would be that mainstream databases feel the pressure to push down the functionality that they currently limit to the people with the biggest bank accounts (which they sell using the "how much do you have?" pricing model).

 SQL  NoSQL 
   
Tuesday, March 02 2010

Getting Defensive

I work in the financial industry. RDBMS’ and the Structured Query Language (SQL) can be found at the nucleus of most of our solutions.

The same was true when I worked in the insurance, telecommunication, and power generation industries.

So it piqued my interest when a peer recently forwarded an article titled “The end of SQL and relational databases”, adding the subject line “We’re living in the past”.

[Though as Michael Stonebraker points out, SQL the query language actually has remarkably little to actually to do with the debate. It would be more clearly called NoACID]

That series focuses on NoSQL as the challenger to the throne.  It isn’t alone as the past year has yielded a bountiful crop of articles and blog entries declaring the imminent death of the decrepit relational database at the hands of this new innovation.

Most get posted with incendiary, absolute statements against the RDBMS.

The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software.

And they usually see later edits that blunt the original euphoria.

postnote: This isn’t about a complete death of the RDBMS. Just the death of the idea that it’s a tool meant for all your structured data storage needs.

Indeed.

Few hold the RDBMS as the only tool for all of your structured or unstructured data storage needs, though that strawman makes an appearance in many NoSQL advocacy pieces, adding some unintentional comedy (“irony”) given that the same entries usually call for the death of the RDBMS, with NoSQL declared the one true way to store and retrieve data.

Page 493 (as labelled by page) of the article “The Paradoxical Success of Aspect-Oriented Programming” includes a fantastic quote and graphic from an IEEE editorial by James Bezdek in IEEE Transactions on Fuzzy Systems.

[I quote indirectly given that the original source isn’t publicly available]

Every new technology begins with naive euphoria — its inventor(s) are usually submersed in the ideas  themselves; it is their immediate colleagues that experience most of  the wild enthusiasm. Most technologies are overpromised, more often  than not simply to generate funds to continue the work, for funding is an integral part of scientific development; without it, only the most  imaginative and revolutionary ideas make it beyond the embryonic stage. Hype is a natural handmaiden to overpromise, and most technologies build rapidly to a peak of hype. Following this, there is almost always an  overreaction to ideas that are not fully developed, and this inevitably leads to a crash of sorts, followed by a period of wallowing in the depths of cynicism. Many new technologies evolve to this point, and then fade away. The ones that survive do so because someone finds a good use (= true user benefit) for the basic ideas.

In the case of the NoSQL hype, it isn’t generally the inventors over-stating its relevance — most of them are quite brilliant, pragmatic devs — but instead it is loads and loads of terrible-at-SQL developers who hope this movement invalidates their weakness.

Some sort of Fight Club ground zero wiping of the records, rewriting the rules of the game.

It doesn’t.

Nonetheless there is indisputably a lot of fantastic work happening among the NoSQL camp, with a very strong focus on scalability.

So what is scalability, anyways?

Scalability is a poorly-defined concept that, more often than not, is twisted to suit the speaker’s agenda. Scalability is often the excuse to engage in absurd hypotheticals to sell a particular blend of fanaticism.

Putting aside wordplay — or perhaps to engage in some of my own — scalability is pragmatically the measure of a solution’s ability to grow to the highest realistic level of usage in an achievable fashion, while maintaining acceptable service levels.

Imagine the scenario that you’ve built an internal help ticket tracking system for your branch office of Money Bags Corporation. If you had to describe the data needs in three points, they would be-

  • Data is highly interrelated (relational)
  • High-value users and transactions
  • Data consistency and reliability is a primary concern

You decide to go against the hype and build it on a classic RDBMS system.

Will it scale to the real-world requirements?

There are some real scalability concerns with old school relational database systems. Adam Wiggins does a pretty good job of covering the techniques to scale a SQL database, though I strongly disagree with his end assertion.

You face those concerns on that glorious day the CEO calls to tell you that the board is super excited about your team’s help ticket system, built on SQL Server, and they want you to deploy it corporation wide. For data consistency purposes they want a single instance, instead of alternative deployment scenarios like pushing out an instance (“shard”) for each division.

Can you make it work?

When Money Is No Object

Of course you can. Even on the maligned Windows platform.

From a vertical scaling perspective — it’s the easiest and often the most computationally effective way to scale (albeit being very inefficient from a cost perspective) — you have the capacity to deploy your solution on powerful systems with armies of powerful cores, hundreds of GBs of memory, operating against SAN arrays with ranks and ranks of SSDs.

The computational and I/O capacity possible on a single “machine” are positively enormous. The storage system, which is the biggest limiting factor on most database platforms, is ridiculously scalable, especially in the bold new world of SSDs (or flash cards like the FusionIO).

Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users in most usage and application scenarios (where generally clients talk to a farm of middleware servers).

Of course if you index poorly or create some horrendous joins you can screw it up, but with competency it will be good times for all. Even with billions upon billions of help tickets.

For the purposes of the application, the scalability requirement is completely satisfied — total scalability is achieved in the context of the application.

But it doesn’t end there.

From a horizontal scaling perspective you can partition the data across many machines, ideally configuring each machine in a failover cluster so you have complete redundancy and availability. With Oracle RAC and Sybase ASE you can even add the classic clustering approach.

Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve built a system for a large corporation, deployed in your own datacenter, with few constraints beyond the limits of technology and the platform.

Your solution will cost hundreds of thousands of dollars (if not millions) to deploy, but that isn’t a critical blocking point for most enterprises.

This sort of scaling that is at the heart of virtually every bank, trading system, energy platform, retailing system, and so on.

To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming evidence, defies all reason.

And you don't need to spend a million dollars. A mid-level Dell server can easily handle the vast majority of real-world database needs: No, your project likely isn't going to have the needs of Twitter, Flickr, or Facebook. You can grab a four CPU Dell server hosting a total of 24 cores of latest-tech computing power, with 128GB of RAM, for around $15,000. That is beefier than the systems that ran many enterprises just a few short years ago.

Artificially Limited Scalability

Imagine that you’re a start-up building your big new Social Media site

Obviously you don’t have your own datacenter, but instead you’re going with cloud servers to host your solution.

You don’t have the option (much less the finances) to buy and install a Unisys 7600R, or even a loaded Dell R905. You don’t have TBs of memory or massive I/O at your disposal.

Instead you have to go with the options available on a host like Amazon’s EC2, where the most powerful choice available is the High-Memory Quadruple Extra Large (!) option at $2.40 / hour (at writing), or about $21,024 a year, which is a fairly reasonable rate given that an equivalent purchased server would run you about ten thousand dollars up front.

This is very powerful compared to their historic maxed-out image — the puny large image that used to represent the top end — and is large compared to the max of many other cloud hosts, yet it is entry level in the RDBMS database world.

I/O on the EBS has been measured with a throughput in the 30MB/second range  with about 72 IOPS per volume, which is one-half the speed that my Atom-based home NAS achieves. You can stripe multiple volumes into a software RAID array, but you quickly limit the I/O available to your instance.

For comparison we’re currently looking at an entry level $8K 36TB iSCSI device that would offer our database a dedicated 400MB/second throughput and about 1500 IOPS, and this is for a pretty humble low-criticality need with low-end magnetic drives.

As a speculative start-up you don’t want to commit $20K/year to have a single instance hanging around, especially given that your traffic is extremely variable and most of the time it will sit idle. You want to run the smallest database layer possible, ramping up if the need (fingers crossed) arises.

In an ideal world you could float along on a small instance economically until that big day when you get mentioned on Digg, at which point you spool up ten extra large instances, turning them off when the need passes.

These financial and artificial limits explain the strong interest in technologies that allows you to spin up and cycle down as needed. It’s why the old guard has largely remained quiet (because it solves a problem that they don’t have, notwithstanding any manufactured “my friend has a super-duper 512CPU Sun box and it is always overloaded!” scenarios), while a million hopeful start-ups with their small EC2 instances are loudly bleating about the limits of scalability with SQL systems.

The Needs of a Bank Aren’t Universal

The world of financial firms and retailers and other RDBMS users is very different than the popular social media scenario usually played out.

If you had to describe your social media data needs in three points, they would be-

  • Largely unrelated islands of data
  • Very low value user/transaction value
  • Data integrity is not critical. If you lose a Status Update, or several thousand of them, it will likely go unnoticed, or at least won't cause a major situation.

MySQL originally lacked many traditionally mandatory RDBMS elements, such as transactions, without which it is extremely difficult to maintain a high level of data integrity. That didn’t dissuade many of its boosters who declared that it was an unnecessary cost for the purposes that they used it.

They were right.  As MySQL has moved towards the values of traditional databases, it has moved away from its original bag-of-data values.

The truth is that you don’t need ACID for Facebook status updates or tweets or Slashdots comments. So long as your business and presentation layers can robustly deal with inconsistent data, it doesn’t really matter. It isn't ideal, obviously, and preferrably you see zero data loss, inconsistency, or service interruption, however accepting data loss or inconsistency (even just temporary) as a possibility, breaking free of by far the biggest scaling "hindrance" of the RDBMS world, can yield dramatic flexibility.

This is the case for many social media sites: data integrity is largely optional, and the expense to guarantee it is an unnecessary expenditure. When you yield pennies for ad clicks after thousands of users and hundreds of thousands of transactions, you start to look to optimize.

The same efficiency applies to highly relational schemas — if you can just serialize object graphs and that’s all you need, why bother normalizing? Many would argue that it’s a premature optimization, but if it’s all you need it might be the best choice.

Both of those decisions would be outrageously negligent in many other industries, but the rules that apply for a banking system have woefully little applicability to a social media site.

SQL is Scalable and NoSQL Isn’t For Everyone

The point is one that I think all rational people already realize: The ACID RDBMS isn’t appropriate for every need, nor is the NoSQL solution.

A social media site is not an inventory system. A banking account management system is not a social news aggregator.

Picking and choosing database terminology from the Wikipedia entry on RDBMS’ doesn’t equip the speaker with an expert level of knowledge to declare the truth about the database industry.

Scalability noise based upon the limitations of a cloud vendor’s offerings needs to be put into context: They don’t apply to most of the users of relational databases.

MySQL isn’t the vanguard of the RDBMS world. Issues and concerns with it on high load sites have remarkably little relevance to other database systems.

And of course the SQL/RDBMS world is changing (sidenote: Few love SQL, but I’ve yet to see a viable replacement). Wouldn’t it be a grand world where every desktop (platforms that spend about 99% of their time completely idle) in a corporation was a part of the corporate cloud, all seamlessly acting as a part of the corporate information system in a reliable, redundant way? A simple SQL statement silently and transparently fulfilled by hundreds of distributed systems?

We’ll get there.

Aside: I'm currently building a solution (to fill this space) that significantly leans on Project Voldemort. I have somehow managed to remain rational.

Postnote

This is one of those rants that strangely gets attention, with several taking it as anti-NoSQL, or even pro-RDBMS, I assume because positions so often seem to be polarized. It is neither, which is quite evident if read with an unbiased mind: Defending the real world practical scalability of the maligned RDBMS merely brings accuracy to the debate. Several have asked if I'm merely attacking a strawman: Aside from several specific links that I gave above (I am remiss to add more as I've engaged in the blog-to-blog arguments too many times before), I find it hard to believe that these people take part in any technology discussion forum or group, where NoSQL is being quite widely, and often without question, held as successor to the RDBMS...the new evolution of database systems.

The motivation of the post is that the discussion is, by nature of the venue, hijacked by people building or hoping to build very large scale web properties (all hoping to be the next Facebook), and the values and judgments of that arena are then cast across the entire database industry — which comprises a set of solutions that absolutely dwarf the edge cases of social media — which is really...extraordinary. It's a bit like moving to the bottom of the ocean and declaring that everyone should start using submarines to commute.

There have been edge conditions in the database world for as long as there has been an industry. High performance logging/data acquisition (often distributed), for instance, has always been a case where traditional RDBMS systems aren't suited, and thus should be jettisoned. The industry didn't rewrite the rules because of those fringe cases, however, for good reason.

 SQL  NoSQL 
   
Wednesday, February 24 2010

More Apple/Android Junk?

I have so many things I’d like to write about – topics having nothing whatsoever to do with Apple or Android, like IoC, ARM assembly, rational NoSQL, and of course pragmatic software architecture – but various mobile issues keep mentally demanding that I hop up on a soapbox about them for a bit. I have to get this out of the way.

Mobile computing is absolutely the most important realm for this industry over the coming decade, so pay attention to what is happening because it really, really matters.

CIBC – Leader and Innovator, or Me-Too Wannabe?

CIBC app
The CIBC Platinum Card

CIBC, a large Canadian bank, just launched a nationwide advertising campaign to promote their newly released iPhone banking application. You can see the video on YouTube.

It's notable that CIBC isn't targeting iPhone-only venues with these ads, which they could easily and cost-effectively do, but instead they are promoting this during primetime Olympics coverage. They're putting it front and center on their website.

If you caught the commercial you might have mistook it for an Apple ad, given that the strongest takeaway is a subtle "there's an app for that" message, followed by the implicit declaration that iPhone customers – a small minority of CIBC customers – are the elite, their walled garden needing its own special flowers.

Maybe Apple subsidized the ads and the product itself. It’s a little surprising for a large and profitable bank to look for ad subsidies, but it’s the only conclusion I can draw.

It’s either that or CIBC has an Apple fanboy wreaking havoc from the executive level.

In any case, big deal: CIBC makes an app for the iPhone (first bank in Canada to do so, they proudly boast). Just serving customers, right?

Consider that CIBC has never offered a rich-client Windows application for banking, which is a statement that is true for every Canadian bank as far as I know.

They will let you download data to Quicken or Money or what have you, on whatever platform you’d like, but if you want to bank electronically as an end-user the cross-platform web browser is and has always been your electronic banking tool, even when it limited them to a very simplistic interface.

They knew not to fall into the ActiveX quagmire like, say, South Korea. The banks have always supported just about any modern client equally.

Think about that for a bit: They have never directly supported the rich interface of the overwhelmingly dominant client platform for PCs.

And for very good reasons.

Yet now they have special, premiere support for one far-from-dominant Smartphone.

Given that history of device and platform treatment, it’s natural to presume that they have some fantastic and compelling reason for making this change: Maybe they’re using amazing 3D graphics of money flow or something on the device. Maybe a breathtaking augmented reality experience that allows you to visualize your debt load increasing when the camera is pointed at that new must-have device you really want at the electronics superstore, a virtual banker sternly shaking his head no.

There must be something that they just couldn't do without "going native", right?

Nope.

Their iPhone app features shockingly basic functionality. The single place where it could use something even remotely client-rich – to get the user's location to find the nearest branch – they screw it up and force you to type your location in.

This Is What Web Apps Are Made For

Really HTML5
This Is An HTML5 App on the iPhone

This application was made for HTML 5, which humorously would easily allow them to use the Geolocation API to get the user's current location for richer and more intuitive mapping.

And let’s give credit where it is definitely due: The iPhone features excellent web app support, arguably best of class, likely because that was originally the only way to create applications for the device.

Jobs’ original vision was that the phone would offer a native Apple experience enhanced by a rich and robust web application ecosystem. That was the phone that they originally delivered.

That web richness allows you to make apps that look and act just like an iphone application with some simple targeted styles and scripting, offering rich and robust functionality and features.  It also allows you to avoid going hat-in-hand through Apple's app review process for every update, as is demanded when you publish via the App Store.

So imagine a world where CIBC decided that they didn’t need to kneel in worship before Apple, trying to suck some Apple-idolizing droppings from the dirty ground, and they’d release this as an HTML 5 app.

It would feature the same look and feel, could easily support all of the same functionality (without breaking a sweat). It would almost certainly be far more maintainable, and could function like a minimized version of the web app they already have for PCs, without necessarily demanding new public-facing web service APIs.

Win all around, right? Well that’s just the start.

If CIBC did it the correct way – as an HTML 5 app – it would also work on Android devices (including crazy features like local databases and geolocation and all of the snazzy dynamics), such as the hot new Milestone coming to Telus, and the anticipated Acer Liquid and Sony X10 coming to Rogers, and more importantly – this is the land of RIM – it would work on the newly revamped Blackberry webkit browser coming shortly, which is worthwhile given that Blackberry remains atop Apple in the “smartphone” category, especially here in Canada.

Remember that Apple far from dominates the Smartphone category, and competition is only getting fiercer. Canada has never been as iPhone-crazed as the US, and a number of compelling non-iPhone smartphones are just hitting our shores.

So if they went the HTML 5 route, they could offer a rich experience on all capable devices, easily stylable and feature-scaling to optimize the platform experience. Anything would be better than the WAPishly rudimentary "everyone else" dumpbin interface they currently support for every other mobile device.

Didn’t They Listen To Steve Jobs?

Juxtapose Steve Jobs telling us that the iPad doesn't need Flash because HTML 5 makes it irrelevant – a premature statement, but the time will come when his words will seem prophetic – with organizations like CIBC porting absolutely rudimentary web functionality into native apps, wasting time and resources and cash, primarily benefiting Apple, while undermining marketplace choice.

Very backwards move, CIBC. It doesn't make you look hip and on-the-ball, but instead makes you look like Apple-salesmen hoping for your little bit of me-tooism hipster credibility.

Given how boastful CIBC is about being the first bank to feature an iPhone interface, it would be delightful to see another Canadian bank, such as my old workplace RBC, take the high road and come out with rich and robust mobile web apps that don’t favour one walled garden without cause.

They could show it running just as richly on a Blackberry, gaining benefit from the glow of Canadian patriotism.

A Place for Web Apps and a Place for Native Apps

While Jobs is quick to declare the end of Flash in pitching the iPad, the reality is that there are serious gaps in what HTML 5 web apps are capable of.

Graphical games, for instance, aren’t a web app reality without either adding Flash to the equation or going native (e.g. OpenGL on Android or the iPhone). Some day down the road it will be possible, but that isn’t reasonably the case right now, aside from some tech demos that make a high-end desktop grind to a halt.

Apps that exploit special features and functions of the hardware generally can’t be web apps either, at least until the feature is so common and prolific that it gets baked into the shared standards, as geolocation has. I’m sure at some future point we’ll have “camera” and “webcam” and “DSP” APIs to access from JavaScript, but for now those are native app domains.

Mr. Jobs’ statement could more honestly be worded as “Flash isn’t necessary when you have HTML....and apps from the Apple App Store”.

Platform specific apps are needed for a lot of solutions. That goes without saying.

Still, porting absolutely rudimentary functionality to native apps is a backwards repeat of mistakes made in the past, walling the garden off for no logical reason.

CIBC is hardly alone in making this foolish, foolish move, but given that they seem to be so proud of this mistake they deserve particular criticism.

So You Own An iPhone and You Don’t See The Problem

Hanging with My Youngest
My Youngest Doesn't Have an iPhone...yet

Even if you own an iPhone, and you happily imagine a world where your children’s children will have iPhones, you should still view moves like CIBCs with intense cynicism.

Not only are they limiting your choice unnecessarily if you ever decide to consider alternatives (as everyone should always be doing), even if you’ve declared fealty to Apple forever and ever the movements of organizations like CIBC are diminishing Apple’s need to be competitive.

Recall what happened to Internet Explorer after so much of the web (outside of Canadian banks, notably) decided that they would treat IE users as first class citizens, everyone else ignorable chumps. Once that lock-in was established Microsoft had little incentive to work on their browser and they took their users for granted. We’re still trying to pull ourselves out of that mistake.

Apple isn’t Microsoft, but by the end of 2010 they will likely exceed Microsoft’s market capitalization, which is absolutely shocking. Corporate sludge is inevitable at some point. If something happens to Steve Jobs, for instance, and they recruit Ballmer to run the place, you might decide to consider alternatives only to find that you're tied to the platform in a thousand seemingly minor but cumulative ways.

Competition is good. Building up the walled-garden of the iPhone undermines competition, and encourages a foolish Windowsification of the mobile world.

   
Thursday, January 28 2010

Reporting On A Twitter Feed Live

I passively monitored Apple’s much anticipated announcement yesterday via a TechCrunch live feed. Apple makes a lot of brilliant products, and their announcements have a big impact, so it's beneficial for anyone in this industry to keep interested.

The TechCrunch show consisted of a couple of people monitoring the twitter feed of someone actually invited to the event while incompetently dealing with technical challenges like “show a graphic” or “don’t abruptly inject a floor audio feed without warning”.

One of the hosts demonstrated why so many of us have an automatic skepticism about the critical reception of new Apple products: As the picture of the product came onto their feed – carried down from the mountain by Jobs – her reaction was “Uhhhhh....it’s gorrrrrrrrgeous!

Her observation is only shared by the truly faithful, though surely the rest of the Apple herd will inevitably come around. You can be sure that going forward this nondescript rectangle will become the new benchmark of product beauty.

Everything that follows will either be ugly in comparison, or declared a rip off. I just discovered a digital photo frame beside me which I sadly must report is a rip-off of the pure, blessed genius of Apple.

Early Prototype
Early Prototype

A Big iPhone

We now know that the iPad is essentially an iPhone with a larger (low resolution, 4:3 ratio) screen, minus voice. Clearly it runs an ARM-derived processor, with performance likely very similar to a Snapdragon 1Ghz. Apple is talking a big game about the A4 system on a chip (saying things like “Intel is looking to do this with their Atom”, ignoring all that came before to pretend that they lead the pack. It's like coming in last in the marathon yet talking about how you finished before next year’s winner), so it would be interesting to see it put to the test against, for instance, the Tegra 2.

One other feature of the iPad is that you can change the background. Apparently that’s a pretty big deal.

The iPad seems to be the continuation of Apple’s platform royalty play, and may be subsidized in the same way that Microsoft or Sony sell their consoles. With this device Apple is going upscale, moving beyond the repackaged web pages and novelty water cooler apps that overwhelmingly dominate the app store. Getting a cut of magazines and books and even more media will surely pad their pockets.

To repeat what I said before, Apple and Sony would be a perfect union. Their modus operandi is virtually identical. Aside from the common quest to act as the troll under the bridge collecting a toll, they share a profound propensity for endlessly reinventing things that came before, cluttering their devices with proprietary plugs and connectors and cards and slots.

The iPad puts into focus why Apple has been so vigilant about maintaining their strict ecosystem command and control of the iPlatform. While some points were debatable with the iPhone (and were cause for much stupidity when otherwise intelligent technical commentators made ridiculous excuses for the restrictions and limitations of the platform, trying to sell some piss water as lemonade), with the iPad it’s clear that it’s for the same reason that the console makers lock down their platform, though the lame excuses are already being doled out.

It certainly isn’t to benefit the consumer. We had shades of this years back as Microsoft built out the trusted-computing platform, and one feared possibility at the time was that we'd end up with a dominant platform where software had to pay a fee and pass a gatekeep ("DENIED! Competes with Excel!"). Thankfully the massive chill was unfounded, or the objection was so loud that it discouraged that initiative.

Alas, the iPad is real. The faithful are pouring forth to tell us that it’s the end of netbooks. It’s the end of eReaders. It’s the future of computing! While usually it’s the Mac faithful that preach the message, in this case it’s the tech media that is pouring on the unabashed praise with no critical perspective. They’re all afraid of posting something negative, only to be mocked when Apple inevitably succeeds. They point nervously at the Slashdot summary of lore.

As Jobs creepily says during his demonstration, “It’s that easy.” Then again Jobs also told us that it will be the “best browsing experience you’ve ever had”, while showing us the device rendering websites like the NY Times, sans Flash or other accoutrements, much slower and less usably than it takes for virtually any PC, including higher resolution, vastly more capable $400 netbooks, to do the same.

Flash is so yesterday! HTML 5 is the future!” you say. I agree with you, at least if you’re talking through a wormhole from about two or three years in the future, and with a vastly more powerful device. JavaScript and the canvas element can almost yield usable Flash similes on a PC many magnitudes more powerful than this device. Even just for video it’s grossly premature, though Apple will be overjoyed if you’re restricted to their little ghetto, paying your toll while thanking them for it.

Alas, such is the pure innovation of the sort that only Apple can bring us.

A Blog Exclusive!

As a reader-of-my-blog exclusive, I want to let you all into some secret iPad specifications I stumbled across.

http://www.tabletpc2.com/Review-HPTC1100.htm

I knew I couldn’t fool you. That’s actually a tablet PC from 6 years ago. It’s a follow-up of tablet and hybrid PCs that existed since the turn of the century (and of course supermarkets and science centers have had touch screens, including the revered multi-touch, for much longer. Am I the only one who finds that people endless pinching and unpinching on the screen look positively ridiculous?)

Of course it was far more expensive than the coming iPad. It weighed more too, and had a much shorter battery life.

Then again, it was probably faster than the iPad. It was completely open and could run hundreds of thousands of very rich applications (applications not gimped to a smartphone). It also had lots of standard expansion ports and capabilities.

The market generally didn’t care for it or its ilk because the only people who really wanted a screen like that are inventory takers at Home Depot. Most of the demonstrations of it were laughable.

Despite all of Bill Gates’ prayers before he went to bed, the format floundered. They're trying once more to make it stick.

Of course that device used a screen technology that required a stylus. Apple is into the capacitive touch screen technology, so maybe that is super new and innovative for a device like this?

No, it isn’t unique. That touchscreen device came out before Apple’s very first iPod (you know the one. It was the “me too” music player that saved Apple from dying at the hands of a failed computer business — though some gimmickry with the iMac kept it on life-support for a while longer — which they’ve since rebirthed by rebadging PC components, amazingly fooling the faithful into believing that these somehow came from the premium bin).

Where is the Innovation?

The iPad isn't innovative. Everything it does has been done many times before. Claiming that its restrictions are a benefit are like saying North Korea has a more refined sense of freedom.

Executing well is not innovation. Apple executes very well indeed, and they put incredible care and attention into their products. That is hugely laudable and worthwhile, but it isn’t innovative.

As to predictions that the iPad will take over the eReader market, while it may come to pass it ignores precedent.

People don’t read books on LCD screens for the simple reason that people couldn’t accept that as a substitute for print when they wanted print. That led to the creation and adoption of e-Ink, mirroring how actual reflective print works. I have no doubt that a lot of teary-eyed iPad adopters will tell us that it’s the cat’s meow, but we’ve been down this path many times before. Yes, even with IPS screens.

That’s Apple innovation for you. If standards change for your product, how can you fail?

Of course, all of this is for naught. Apple has a precedent of going into markets with products that cost more while doing less, and achieving remarkable success. So this is my final cry before I smile and nod politely as told about how Apple invented IPS display technology, the ARM reference processor, flash memory, and so on. The leader is truly wise and great.

 Apple  netbook  tablet 
   
Thursday, January 21 2010

The NAS Gets a New PSU

In March of last year I wrote about replacing the home NAS with a custom-built Linux box.  

Almost a year in and the device has served the purpose well, providing a solid foundation for a connected home. I’ve been very satisfied with the change.

The only downsides of the unit are the higher power consumption (averaging around 38W), and the groan of the two fans inside: the power supply and chipset fans. The audible part isn’t really an issue given that it’s stashed away, but considering that a probable failure point on most new electronics is the fans, it becomes a reliability concern.

I junked a laptop because of an impossible to repair broken fan. I’ve lost several video cards for the same reason.

I can even hear the irritating whirring of my blu-ray player’s fan (do not buy the Samsung BDP1600. The thing is complete junk even without factoring in the noisy fan trying to upstage the even noisier optical unit. Speaking of junk, the Sony alpha-200 is another garbage product that made me regret ever turning my back on Canon).

As promised in the original entry, I got around to replacing the power supply with a PicoPSU 90W unit, which was basically a plug and play swap.

In my original entry I estimated a 4-8W power reduction, which turned out to be an underestimation. With this PSU the power consumption dropped a whole 10W, going down to a constant 28W (only slightly spiking under load), making me feel a little less enviro-guilt. There’s still the noisy chipset fan, but that’ll be another project.

The case was built around the expectation of a power supply fan exhausting heat, so some extra natural ventilation was required. With that the sensor readings now hover at low operating levels.

Economically this is a change that will not pay off. From NCIX the new PSU cost me $73.49 all in. Given a savings of 0.01kWh per hour, and a fully loaded electric cost around $0.16/kWh, it would take 5 years for the 10W to pay for the change.

It would be nice if all power supplies were mandated to be efficient (they aren’t for most devices because they know it plays zero part in your purchasing criteria. It’s unfortunately one of those areas where legislation is really the only effective solution), because right now inefficiency is the standard. Of course environmental choices don't always yield the expected results.

The Dream is Over...Wake Up With New Phone

In July of last year I wrote about choosing a new smartphone to replace the MotoQ that I had been using. While the MotoQ served a good tour of duty, it was seriously showing its age and was falling behind in the empowering mobile revolution.

While I’d been using variants of Windows CE since before the turn of the century, Windows Mobile was obviously lost in the wilderness. Not only was each equipped device essentially abandoned right after being released, the clearest sign that Microsoft lost the plot could be seen in PocketIE, where the preloaded bookmarks to various Microsoft Mobile pages led to 404 errors.

The team moved onto something new and shiny and had no concern at all for the existing base. Microsoft has a very short attention span to products that don't earn them Windows Office type revenue numbers, so it wasn't a surprise.

For various reasons I did not want an iPhone (we don’t need another restrictive and innovation crushing Microsoft scenario playing out, and I want to develop for the device without embracing the whole cult), despite it being the easy choice. I opined in the first entry that Android seemed to have a very bright future ahead, which is a prediction that seems quite obvious now given that it is the platform of so many incredible devices recently released or on the horizon.

The future is so bright for Android that the robots have to wear shades.

The options in Canada were (and remain) limited, so I went with an HTC Dream (G1) given that it had a keyboard and otherwise had largely the same specs as the newer HTC Magic, aside from what seemed like a minor difference in memory capacity.

 I have to confess to being disappointed with the device.

Functionally it is amazing, and even with Android 1.5 the platform is simply brilliant. When everything operates correctly I am over the moon with the device.

The problem is that everything didn’t operate correctly. For whatever reason the device seems to be horrendously overloaded, so even with virtually no apps installed and nothing beyond the base system running, most actions are plagued by obnoxious pauses, even on a fresh start-up.

I hate pauses.

I stopped using brilliant apps like Weatherbug because they seemed to make the situation worse.

Alas, my long term plan was always that I would buy one of the newer, faster phones when they came to market, while using the starter device for development purposes until that time. If an unlocked Nexus One or Droid/Milestone worked on Rogers’ wireless band, I’d grab one of those when it was a possibility.

Nonetheless, I was pleasantly surprised recently to find that Rogers was offering all HTC Dream owners a free HTC Magic for $0, with the caveat that your term length pushes out. Given that Dream owners can only possibly be 6 or 7 months into their term, that isn’t that tough of a demand. I am on a very reasonable family plan that allows me 5GB / month (which I seldom use more than 1% of), so I feel fairly future-proofed with that foundation and for me it was all win.

So the next day a Magic arrived in the mail and moments later I was up and running with it. With the SIM card removed my existing Dream still works on wifi, where it can browse the web and play media and respond to emails and take pictures, and I can of course put another card in it and continue using it online. I’ll likely install Cyanogen on it now.

Quite pleased about that.

The most shocking thing, though, is that this Magic is much more responsive. It has the same processor as the Dream, so that doesn’t explain the difference. If I had to guess, I’d point to RAM, which on this device comes in at 288MB, compared to the 192MB in the Dream. For comparison both the Droid and the iPhone 3GS feature 256MB of RAM.

The extra headroom over the base OS seems to make all the difference in the world. On the Magic I can see that the free memory is usually less than 90MB, even on a fresh start-up, which notably would put it over the limits of the Dream.

HTC and Rogers claim that they’ll release Android 2.1 for this device in the near future, which makes me especially pleased.

Great move, Rogers. The new HTC Sense update and free month of data is icing on the cupcake.

Firefox 3.6 Released – Web Worker Performance Remains the Same

Back in June I wrote about Web Workers, a fantastic new method to move processing out of the UI thread. To support the entry I posted a variation of the SunSpider benchmark I named Moonbat.

Safari kicked Firefox around in this benchmark. I just tried it with the just released 3.6, and it doesn’t look like much has changed: FF 3.6 does 10 iterations with 4 threads in ~11 seconds, Chrome does it in 2.6 seconds, while Safari leads the pack at 2.3 seconds.

Alas, web worker performance isn’t a critical factor in choosing a browser (my favourite browser remains Firefox), but it would be nice to see it moving in the right direction.

Celebrating My First Home High Speed Overage

Got the cable bill — a bill that pushes into the $250 range per month these days — to find a surprising $11.25 "internet overage fee". Apparently I used 67.5GB last month, while my limit is 60GB. The Steam sales, several purchased HD movies and a couple of on-demand games for the kids on the 360, added to the normal internet usage apparently really added up to a very atypically throughput-intensive month.

I'm not going to cry many tears about it, even though I do think $1.50 a GB is a bit absurd (in an average month I doubt I use 10GB, so now I almost feel obligated to max it out), given that I think by usage pricing would lead to a far better, more open, more honest system for everyone.

   


About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.





 
Earlier EntriesLater Entries

Dennis Forbes