Dennis Forbes on Pragmatic Software Development   Subscribe to RSS


About the Author
Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development, Linux development, and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 13 years.


Recent Entries


The Feed Bag

 
Tuesday, September 27 2005

Telecommuting remains a fringe activity, particularly in software development.

Would you allow your manager to monitor your remote activities through a webcam 9-5? Would you allow them to monitor your computer screen?

Despite all of the technology that we now have available. Despite pervasive, inexpensive, high-speed communications networks. Despite the rising cost of transportation, and literally choking urban congestion. Despite all of these changes, we still live in a world where the luxury of telecommuting is afforded to only a few. Indeed, we live in a world where most organizations are more likely to outsource work to a faceless team half a world away before they'd let an employee work more than a marginal amount of time on-the-clock at home.

Many organizations would counter this by claiming that they do offer some sort of telecommuting foundation. Many, for instance, offer Virtual Private Networks (VPNs), allowing you to hook into the corporate network from home. However the motivation driving the adoption of such technologies isn't generally to replace in-office work, but rather to enable workers to work additional hours from home. It's also to ensure that there isn't a minute of downtime when on the road.

So why is legitimate telecommuting so rare?

The most common suspicion in the industry, and I believe that it's accurate, is the need among managers for face time of their charges. If you're physically present and occasionally visible, regardless of output, the presumption is that you're "doing work". Your contribution is measured not by your actual output, but rather by your mere presence (and your grunting and groaning and presentation of great exertion). If you aren't physically present, however, the presumption is that you're slacking. When you're remote your contribution is measured only by actual output, and you're an easy target for the non-telecommuters to use to explain their troubles.

The root problem is that output in the software development industry, indeed in most industries, is often less than predicted (which is why estimates are almost universally low-balled). For every delivered component, there were likely half-a-dozen false starts. For every neat looking GUI, there is a massive bulk of praiseless code behind the scenes that actually makes it happen. For every technology chosen there were dozens evaluated, and for every design agreed upon there were hours of planning. Outside of some assembly line coding, this is an industry where a lot of effort and brainpower goes into even the smallest creations.

Thus when you're a manager and after a week your remote worker submits a small component, it's much less satisfying than seeing a team holding onerous daily meetings, ruminating loudly around the water cooler about the trouble that damn Microsoft caused them, and so on. Face time, and the illusion of work, is a powerful force in overcoming the slow pace of most development.

The Solution

There is absolutely no doubt that this is a serious problem, and it will be exascerbated as fuel prices rise.

The solution is fairly simple:

  • An accessible webcam monitoring the remote employee's workspace during work hours.
  • An accessible screen monitor allowing managers to monitor computer activities.

Would you allow your manager to monitor your remote activities through a webcam 9-5? Would you allow them to monitor your computer screen?

Most tech workers would gasp at these conditions, however I think they are the necessary groundwork for pervasive remote tech work. Allow the managers throughout the land the ability to assauge their fears that their remote employees are playing Battlefield 2 in their underwear during work hours, and give the remote workers all of the advantages that pseudo-"face time" affords. Win/win for everyone.

Tuesday, September 27 2005

Here's a super high-value C# console application that's going to corner the GUID creation market.

using System;
using System.Windows.Forms;

namespace yafla.yaflaGuid 
{
 class GuidToClipboard 
 {
  [STAThread]
  static void Main(string[] args)
  { 
   string guidString = System.Guid.NewGuid().ToString("B");
   Clipboard.SetDataObject(guidString, true); 
   Console.WriteLine(guidString);
  }
 }
}

I'm constantly in need of GUIDs ("Can you spare a GUID please, sir?"), and generally use guidgen.exe, included with the Visual Studio tools. The problem with it, however, is two fold - it's too many steps, and it includes a line feed on the end of the notepad copied guid. Annoying.

Finally I made the incredibly complex console application that you find above, generating a GUID (with no tailing linefeed), copying it to the clipboard for use elsewhere, and then spitting it out on the console (if I didn't want the console, err, "feature", this would be one unique line over the standard console wizard output). \

I'm even benevolent enough to include a binary (requires .NET 1.1 Framework. Interesting that the compiler padded it up to 16KB - will have to look at why it did that). 

Enjoy this revolution in GUID creation.

Wednesday, September 28 2005

A couple of simple rules that can help ensure code clarity, maintainability, performance, and stability:

  • Avoid using exceptions for flow control

    This is a seemingly obvious hint, yet many applications are tossing tens or hundreds of thousands of exceptions a second. While exceptions are a legitimate part of the flow control of the platform and language, contrary to the claims of the language-pruning self-annoited purists, they should be an exceptional flow control mechanism, not an alternative to a simple if condition.

    There are some performance costs to exceptions, but more importantly they obfuscate the code flow, wrapping multiple scenarios that should be handled separately into one master grab-all.

    The most evil incarnation of exception usage is the lazy catch(Exception) { (or just catch {), an artifact that should be banished from your code (while it's a benefit that .NET doesn't required checked exceptions, it would have been nice if Microsoft more clearly documented the exceptions that methods in the .NET framework raise. It's this lack of clarity that leads developers to use all encompassing exception handlers).

    Debug your applications with first-chance exceptions turned on (Debug/Exceptions, and set Common Language Runtime Exceptions to Break into the debugger). If you find that you're constantly hitting continue, to the point that it's irritating, then you're probably relying too heavily on exceptions. Some exceptions do happen as a matter of course, and you can filter those out individually in the Debug/Exceptions panel.

  • Strive towards zero compiler warnings (set your compiler to treat warnings as errors)

    This is especially critical in teams: As developers hash out ideas, they're often prone to sticking incomplete code in the codebase. Soon enough you have oodles and oodles of poorly thought out, poorly implemented warts thoughout your application (for instance loads of unused variables), generating dozens or hundreds of warning messages. The problem with accepting this, and it's similar to abusing exceptions, is that within all of the noise real problems will get lost.

    Warnings can serve as extremely valuable warning sign of coding problems, so make sure they can communicate these issues to you clearly.
Friday, September 30 2005
I'm going to implement a basic, but high performance, comment system this weekend. I hope that it, along with the survey component, will be made available for download for those who are interested. I have a lot of interesting thoughts I hope to expand upon about coding practices, .NET technologies, and upcoming technologies. It's all just a matter of finding time...
Friday, September 30 2005

[UPDATE: Also see the entry relating to sequential GUID values]

This topic has come up in discussions quite a few times, with many database designers and developers seemingly believing that GUIDs are the way to go simply because they're large and intimidating. While GUIDs can be the best choice, in many cases they're simply wasteful extravagance.

GUID Basics

Let's start with some definitions - A GUID is a Globally Unique IDentifier, and it's the Microsoft take on the OSF's UUIDs (Universally Unique IDentifier. I guess Microsoft is more pessimistic than the Open Software Foundation). It is a 128-bit (16-byte) pseudo-randomly generated value that, if the algorithm is correct, should theoretically never collide with another GUID generated in any other place at any other time (at least until the year A.D. 3400). You can generate GUIDs in T-SQL Using NEWID(), while in .NET you can use the System.Guid value type's static method NewGuid(). In the Win32 world you can call CoCreateGUID().

Suburbia

Historically GUIDs used the network card's MAC address as the starting point, incorporating a time component and then incrementing a value part sequentially with each generation, however privacy concerns abolished that standard (because GUIDs, such as those embedded in a Word document, could be associated with a specific network card - and thus a specific computer - presuming the MAC address wasn't cloned. Note that you can still generate GUIDs in this old-fashioned way via UuidCreateSequential in rpcrt4.dll). Nowadays GUIDs are virtually random (there is a time component, but from the perspective of a user each value generation seems unrelated).

For instance I just generated the following 6 GUIDs in a row.

FD202BEE-05EC-42FF-A9DE-153C507CAC60
BA5C300C-61DC-4AFB-9DDF-2EDEFED533F2
57BEB108-80D3-40B9-8CFD-0406E544156C
3848803F-01DB-4D14-B1DF-AFBFB3A7544B
23BD5BD5-B2AA-496A-B365-24E02224369B
13A03BD6-9521-4C72-AFF9-121F941EF0DC

Not a very logical progression.

Globally Unique

In cases where you need global uniqueness, however, GUIDs are critical. There were (and still are) tens or hundreds of thousands of development shops throughout the world creating COM components for the Windows platform, with no central registry where they could register specific component names to ensure that no more than one of them was creating an Image.Processor component. As such, COM components were early-bound to GUIDs instead of names (late-binding still had the name conflict issue), so at yafla we might tag our Image.Processor COM component with the generated GUID A2358A9A-C96D-4D72-B0E4-B732332408D6. It was very unlikely that another vendor would unintentionally collide with that value. This sort of global uniqueness, through the use of GUIDs, has carried over to a lot of other technologies as well, including .NET.

GUIDs In Your Database

This same sort of global uniqueness can play a role in our databases as well. Primarily when we want to merge the contents of several databases or database sites, maintaining the relational integrity and without changing the keys, and where our primary keys are surrogate keys rather than natural keys. In this case using a GUID as a primary key can allow for the relatively painless merging of datasets, and while the represented data may include logical duplicates (e.g. Bob Jones might be in both the SuperHyperMart database and the HyperMegaMart databases when they merge), the relational integrity and source keying will remain. For planned distributed databases a GUID isn't actually necessary, though: If you have a sales computer logging sales in New York, and another in Tokyo, and nightly you merge these databases, you can avoid collisions by automatically or manually assigning ranges to each database (e.g. New York autonumbers starting at 0, while Tokyo numbers from 1,000,000,000 on). Merge processes can easily rebase keys where necessary as well, again eliminating the need for GUIDs, however the keys will differ from those on the source system.

The Cost of GUIDs

Morning Car Dew

GUIDs don't come for free. The algorithms used to create GUIDs are relatively intensive, for instance, and even then the global uniqueness is largely theoretical (I'm a cynical sort, and view the idea of an alogorithm that generates pseudo-random numbers "guaranteed" unique across space and time suspiciously). Due to this overhead, inserting into a database using generated GUIDs can be onerous, and far slower than inserting in a table using autonumbers. There are some webpages out there advocating techniques of creating sequential GUIDs, eliminating the onerous GUID creation cost, but then the point of using a GUID in the first place is lost (and it is then more accurately a 128-bit number, and there is no rational assurance of global uniqueness).

On top of that, GUIDs are data pigs, taking up 16-bytes each. While this sounds miniscule in an era of monster memory and collosal hard drives, when you're dealing with enterprise databases with hundreds of tables with millions of rows each, such an overhead becomes extraordinary. Several adhoc benchmarks exist "proving" that the overhead of GUIDs has little impact compared to an int, however these comparisons almost always deal with query loads and datasets that entirely fit within the memory cache. The story would be vastly different dealing with a real, highly-relational enterprise system. In such systems it is the norm that the I/O system is the weak link (even with extremely expensive SAN systems), with the I/O pipe saturated continually. Unfortunately I don't have the resources to setup a high performance enterprise SAN-backed system to demonstrate this point, however I've dealt with some large enterprise systems where the storage I/O was overwhelmingly the weak point.

When used as primary keys that also serve as the clustered index, GUIDs can also lead to significant page splitting, as rows are constantly being inserted amidst the exist data. Compare this to an autonumber where in the same scenario each new record would be added to the end (historically that led to a hot spot of heavy contention, but all modern database systems deal with it very elegantly).

GUIDs - The Pros

  • Globally unique. Immediately usable in merge scenarios
  • Can be generated on the middle tier, allowing developers to build all relational rows before pushing it simultaneously to the data store. Compare this to an autonumber scenario where the root row often needs to be pushed to the database to generate an autonumber to be used by the related rows
  • Very large data space - it's unlikely that your database will run out of GUIDs
  • In some cases a GUID can provide a bit of security through obscurity - if you publish a special command executable only by passing a secret GUID, it would take an average of 1.7e+38 attempts to get the right value. Of course this is a pretty marginal advantage

GUIDs - The Cons

  • It's processor/time intensive to create real GUIDs compared to autonumbers
  • GUIDs are data hogs, taking 4x the space of an int, and twice the space of a bigint. In a highly relational design this data bloat is amplified, impacting related tables and indexes as well
  • Pseudo-random GUIDs lead to significant data and index fragmentation
  • GUIDs can be unintuitive for developers - it's easier to remember SurveyID 72 than it is SurveyID 11DF30D5-FAAF-4896-83D4-C781ACDBB899. Likewise GUIDs can be unintuitive for users - a URL with a GUID is much uglier than the same with an integer

Conclusion

Every situation is different, and there most certainly are appropriate times and places for GUIDs (a universally unique time and place!). Just don't toss rational evaluation in the wind and adopt the GUID by default under the illusion that it's any more "Enterprise" ready: In reality the opposite is often the case.

[RELATED ARTICLE : See High Performance SQL Server]

  .NET   IT   Software Development   SQL 
Saturday, October 01 2005
Today's family outing was to the Caledonia Fall Fair. We love these fall fairs (there's something very wholesome and fun about them, and there is a remarkably rural atmosphere at fairs located just minutes from the city), but it's been a very, very busy late-summer/fall thus far so this is the first one we actually had a chance to visit this year. Turns out that it's actually one of the last of the year. It was a great choice as this was one of the better fall fairs we've been to. We do plan on hitting the Rockton Fair next weekend after hearing some very positive things about it, and pictures will of course follow.


Vegetable Competition

Mardi Gras Ride

Vegetable Competition

More photos from today's outing can be found here.
  Personal 
Sunday, October 02 2005

One of the justified concerns when using an int identity as your surrogate primary keys is that you'll exceed the capacity of the data type. e.g. if you accept the defaults, with your autonumbers seeded at 1 with an increment of 1, you have the capacity to store 2,147,483,647 records. While that sounds like a lot of records, and it most certainly is far beyond the lifetime size of most databases, it does have the potential of being exhausted in massive databases, or databases that see lots of rolled-back transactions (which still use up identity values). If it's a realistic possibility that you'll exceed 2 billion records, consider using one of the larger data types, such as a bigint. Avoid using the larger data types unless realistically necessary, however, as there is a storage and I/O cost that needs to be factored in.

Another potential solution is to take advantage of the negative range of the signed int. You could do this by seeding your identity values with -2147483648, incrementing from there. This will make your first record IDs less human friendly (e.g. CustomerID -214783648 instead of CustomerID 1), however it will double the identity range available, offering up 4 billion+ identity values.

You could also do this in already existing and populated tables by resetting the seed to a negative value, for instance

DBCC CHECKIDENT ('YourTableName', RESEED, -2147483648)
However this will lead to insert issues (as it'll be inserting at the head of the data if you've cluster indexed on your primary key), and the ident will get reset the next time you call
DBCC CHECKIDENT('YourTableName')
  IT   SQL 

Earlier EntriesLater Entries

Dennis Forbes