Dennis Forbes on Software and Technology   Subscribe to RSS


About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.




The Feed Bag

 
Tuesday, May 25 2010

A reader wrote me regarding a performance issue they were having with PostgreSQL, and I thought the case study would make an interesting follow-up note on the whole SQL/NoSQL debate.

The scenario was that they needed to look up batches of geo-locations by postal code, passing in sets of 100 postal codes and retrieving the corresponding set(s) of latitudes and longitudes.

It is a real-world scenario, whether for mail processing system, census analysis, sales tracking, or many other common data processing needs.

You could simulate such a scenario with data like this. I did just that with the Canada Postal Code data.

In the reader's situation they were finding that batches of 100 postal code lookups took over a second.

That’s suboptimal, and not ideal for any system that needs to perform a large number of rapid lookups.

It is not the sort of task that I would normally do in a database. Hard to believe, perhaps, after the prior entries, but this is a process that I consider highly specialized – a unique snowflake, if you will.

The data is extremely static, and the usage is very specialized. Performance, rather than generalization, is paramount.

To validate the performance assumption I built a simple .NET test app that populated a Dictionary<string, List<Location>> with all 765,344 Canadian postal codes (there are, in some cases, multiple entries for a single postal code, so each dictionary element contains a list that contains 1..n results), and then looked up random sets of 4000 postal codes (hint: Create randomly sorted recordsets in SQL Server by ordering the results by NewId()).

It could lookup results at a rate of some 3,000,000+ lookups per second, with no parallelization running on a single mid-range core. Adding parallelization (extremely easy in .NET 4.0 using Parallel.ForEach) was of limited benefit as the reduce stage and thread safety efforts ate up any savings for all but the most unrealistically large test set.

That was an ultra simple solution with very few lines of code, specialized for the purpose. It did consume some 140MB of memory, but memory is bountiful and cheap.

Doing the same lookups in SQL Server, with optimized fill factors and a perfectly covering index — even after priming the cache (by doing a full select of the covering index) — yielding a return rate of approximately 5000 lookups per second on the same hardware, per core. The generalized execution engine simply isn't optimized for such a trivial lookup usage, and imposes significant overhead that isn't beneficial in this case.

3,000,000 versus 5,000 (per core) = an incredible reason to use a specialized solution, especially given that it’s the long solved problem of KV pairs.

The reader, after some correspondence, mentioned that they had settled on Redis, which is a solution that is midway between the custom in-application hashing solution and a generalized SQL solution (leaning much further to the former than the latter). The performance with such a solution will almost certainly be incredibly high, albeit bound by the overhead of IPC. Redis is a highly optimized solution for that task, and is quickly proving itself to be a viable part of most solutions.

It is the right solution for that problem. In no way is it a “new world order of social media and intraconnected graphs realigning the stars to herald the new way of using data”, but instead is a very appropriate use of the right tool. Redis, like Memcache, has a lot of metrics on its side, much unlike many of the other NoSQL solutions.

Using the right tool is what we should all strive to do.

  SQL   NoSQL 

Reader Comments

Can I inquire as to how you tested it in SQL Server? Was the query optimized with a TVP or was it 5000 straight SELECT statements?

I wouldn't normally choose a database as a solution for this sort of problem, either, but as long as we're comparing, it helps to know that all tools under comparison are being used as effectively as possible. I suspect that you could achieve much better performance than 5000 lookups/sec if you did the lookups in batches of a few hundred (or thousand).
Aaron G @ 5/26/2010 8:25:58 AM
Why not to use MS Velocity - Windows Server AppFabric (http://msdn.microsoft.com/en-ca/windowsserver/ee695849.aspx)?
I had a lot of success with it.
Arthur @ 5/26/2010 8:43:45 AM
@Aaran-

It was a TVP. I inserted the random set into the TVP, started the timer, and then pulled the inclusion set. SQL Server is very fast, and this task is highly parallel, but the execution overhead is hugely overwrought for such a simple task.

@Arthur-

I hadn't realized they'd renamed Velocity, and hadn't really paid much attention to it but I am taking a look now. It is unfortunate, though, that off the bat I see that there's significant coupling in the product with various other products, so immediately I'm bound to PowerShell 2, .NET 4, IIS 7 hosts, and so on. I happen to have that environment, but still wish these things (especially a caching system) had more of a capacity to live alone. Even Redis isn't ideal on Windows, though, as it couples you to the Cygwin stack and all of the irritations that it brings along. Thank you very much for the heads up.
Dennis Forbes @ 5/26/2010 10:16:00 AM
I'm very interesting in how AppFabric does compare. If you run some tests please pass the metrics along. Thanks!
Jeff A @ 5/26/2010 2:13:37 PM

Add Comment

Name *:

Email Address:

(your email address is not displayed)
Website:

Comment *:


Dennis Forbes