A couple of simple rules that can help ensure code clarity, maintainability, performance, and stability:
[UPDATE: Also see the entry relating to sequential GUID values]
This topic has come up in discussions quite a few times, with many database designers and developers seemingly believing that GUIDs are the way to go simply because they're large and intimidating. While GUIDs can be the best choice, in many cases they're simply wasteful extravagance.
GUID Basics
Let's start with some definitions - A GUID is a Globally Unique IDentifier, and it's the Microsoft take on the OSF's UUIDs (Universally Unique IDentifier. I guess Microsoft is more pessimistic than the Open Software Foundation). It is a 128-bit (16-byte) pseudo-randomly generated value that, if the algorithm is correct, should theoretically never collide with another GUID generated in any other place at any other time (at least until the year A.D. 3400). You can generate GUIDs in T-SQL Using NEWID(), while in .NET you can use the System.Guid value type's static method NewGuid(). In the Win32 world you can call CoCreateGUID().
Historically GUIDs used the network card's MAC address as the starting point, incorporating a time component and then incrementing a value part sequentially with each generation, however privacy concerns abolished that standard (because GUIDs, such as those embedded in a Word document, could be associated with a specific network card - and thus a specific computer - presuming the MAC address wasn't cloned. Note that you can still generate GUIDs in this old-fashioned way via UuidCreateSequential in rpcrt4.dll). Nowadays GUIDs are virtually random (there is a time component, but from the perspective of a user each value generation seems unrelated).
For instance I just generated the following 6 GUIDs in a row.
FD202BEE-05EC-42FF-A9DE-153C507CAC60
BA5C300C-61DC-4AFB-9DDF-2EDEFED533F2
57BEB108-80D3-40B9-8CFD-0406E544156C
3848803F-01DB-4D14-B1DF-AFBFB3A7544B
23BD5BD5-B2AA-496A-B365-24E02224369B
13A03BD6-9521-4C72-AFF9-121F941EF0DC
Not a very logical progression.
Globally Unique
In cases where you need global uniqueness, however, GUIDs are critical. There were (and still are) tens or hundreds of thousands of development shops throughout the world creating COM components for the Windows platform, with no central registry where they could register specific component names to ensure that no more than one of them was creating an Image.Processor component. As such, COM components were early-bound to GUIDs instead of names (late-binding still had the name conflict issue), so at yafla we might tag our Image.Processor COM component with the generated GUID A2358A9A-C96D-4D72-B0E4-B732332408D6. It was very unlikely that another vendor would unintentionally collide with that value. This sort of global uniqueness, through the use of GUIDs, has carried over to a lot of other technologies as well, including .NET.
GUIDs In Your Database
This same sort of global uniqueness can play a role in our databases as well. Primarily when we want to merge the contents of several databases or database sites, maintaining the relational integrity and without changing the keys, and where our primary keys are surrogate keys rather than natural keys. In this case using a GUID as a primary key can allow for the relatively painless merging of datasets, and while the represented data may include logical duplicates (e.g. Bob Jones might be in both the SuperHyperMart database and the HyperMegaMart databases when they merge), the relational integrity and source keying will remain. For planned distributed databases a GUID isn't actually necessary, though: If you have a sales computer logging sales in New York, and another in Tokyo, and nightly you merge these databases, you can avoid collisions by automatically or manually assigning ranges to each database (e.g. New York autonumbers starting at 0, while Tokyo numbers from 1,000,000,000 on). Merge processes can easily rebase keys where necessary as well, again eliminating the need for GUIDs, however the keys will differ from those on the source system.
The Cost of GUIDs
GUIDs don't come for free. The algorithms used to create GUIDs are relatively intensive, for instance, and even then the global uniqueness is largely theoretical (I'm a cynical sort, and view the idea of an alogorithm that generates pseudo-random numbers "guaranteed" unique across space and time suspiciously). Due to this overhead, inserting into a database using generated GUIDs can be onerous, and far slower than inserting in a table using autonumbers. There are some webpages out there advocating techniques of creating sequential GUIDs, eliminating the onerous GUID creation cost, but then the point of using a GUID in the first place is lost (and it is then more accurately a 128-bit number, and there is no rational assurance of global uniqueness).
On top of that, GUIDs are data pigs, taking up 16-bytes each. While this sounds miniscule in an era of monster memory and collosal hard drives, when you're dealing with enterprise databases with hundreds of tables with millions of rows each, such an overhead becomes extraordinary. Several adhoc benchmarks exist "proving" that the overhead of GUIDs has little impact compared to an int, however these comparisons almost always deal with query loads and datasets that entirely fit within the memory cache. The story would be vastly different dealing with a real, highly-relational enterprise system. In such systems it is the norm that the I/O system is the weak link (even with extremely expensive SAN systems), with the I/O pipe saturated continually. Unfortunately I don't have the resources to setup a high performance enterprise SAN-backed system to demonstrate this point, however I've dealt with some large enterprise systems where the storage I/O was overwhelmingly the weak point.
When used as primary keys that also serve as the clustered index, GUIDs can also lead to significant page splitting, as rows are constantly being inserted amidst the exist data. Compare this to an autonumber where in the same scenario each new record would be added to the end (historically that led to a hot spot of heavy contention, but all modern database systems deal with it very elegantly).
GUIDs - The Pros
GUIDs - The Cons
Conclusion
Every situation is different, and there most certainly are appropriate times and places for GUIDs (a universally unique time and place!). Just don't toss rational evaluation in the wind and adopt the GUID by default under the illusion that it's any more "Enterprise" ready: In reality the opposite is often the case.
[RELATED ARTICLE : See High Performance SQL Server]



One of the justified concerns when using an int identity as your surrogate primary keys is that you'll exceed the capacity of the data type. e.g. if you accept the defaults, with your autonumbers seeded at 1 with an increment of 1, you have the capacity to store 2,147,483,647 records. While that sounds like a lot of records, and it most certainly is far beyond the lifetime size of most databases, it does have the potential of being exhausted in massive databases, or databases that see lots of rolled-back transactions (which still use up identity values). If it's a realistic possibility that you'll exceed 2 billion records, consider using one of the larger data types, such as a bigint. Avoid using the larger data types unless realistically necessary, however, as there is a storage and I/O cost that needs to be factored in.
Another potential solution is to take advantage of the negative range of the signed int. You could do this by seeding your identity values with -2147483648, incrementing from there. This will make your first record IDs less human friendly (e.g. CustomerID -214783648 instead of CustomerID 1), however it will double the identity range available, offering up 4 billion+ identity values.
You could also do this in already existing and populated tables by resetting the seed to a negative value, for instance
DBCC CHECKIDENT ('YourTableName', RESEED, -2147483648)
However this will lead to insert issues (as it'll be inserting at
the head of the data if you've cluster indexed on your primary
key), and the ident will get reset the next time you call
DBCC CHECKIDENT('YourTableName')
Some time back, around a year ago, I released a relatively simple command line utility - PureJPEG - to filter EXIF data (along with application data blocks, thumbnails, and so on) out of JPEGs. The utility took me an evening to throw together (it's a pretty straightforward C++ app), and was actually just a research branch of some image search algorithms I was working on - a project that I need to return to at some point.
Since I released that tool I've had literally 10s of thousands of downloads...
Nonetheless, it really filled a need: An enormous number of people were unaware of the types of EXIF data in their images, or the impact that it had on their data size (many images on the net are over 50% EXIF and thumbnail data, in cases where it is just extraneous waste). Since I released that tool I've had literally 10s of thousands of downloads (interesting note: a largely disproportionate number of the downloads are from people located in Russia - I have to guess that it piqued the interest of a Connector [terminology courtesy of the Tipping Point - great book] in Russia, and they spread it to their network. I think I'll get a Russian translation of the page put up). It is enormously satisfying as a software developer when I see something I've done has helped someone, however marginally.
In any case, on the topic of EXIF - I recently upgraded to a Canon Digital Rebel XT. I absolutely love this camera, but for whatever reason it sticks the camera serial number in the EXIF. Perhaps the serial number of a camera isn't really top secret, but nonetheless this seems like a completely needless piece of info to be sitting in every image I put online or elsewhere. It just seems like a piece of info that could be used in insurance fraud, retail deception, or some other nefarious activity. Perhaps it isn't secret, but it really isn't the sort of thing you shout from the rooftops (similar to how people and organizations obscure license plate numbers, yet they're really ridiculously un-private).
A much more profound privacy concern could come up when cameras finally start making use of the geographical coordinate points sitting largely unused in EXIF currently. These data points, storing items like latitude, longitude, and altitude, will make for absolutely brilliant geocoded photo databases once cameras start incorporating a GPS. For instance many cell phones are starting to incorporate a GPS to accommodate e911 requirements, and of course many cell phones already have onboard cameras, so it's inevitable that the technologies will collide. Imagine having the ability to search in a tool like Picasa for photos taken at a particular house, or in a particular park, without having the hassle of manually adding keywords categorizing each photo. Imagine a shared service like Flickr with brilliant locational searches.
Even better if cameras also stored the attitude and direction each photo was taken - Imagine seeing a cityscape with view cones emanating out, with colour coded focus zones (which can be determined by a variety of other EXIF data points). With a clean GPS signal, you could tell which photos were taken of someone sitting on the steps of city hall, out of the CN Tower looking towards Toronto Island, or towards the leaning tower of Pisa looking West at dusk during late August, all without relying upon haphazardly scatter-shotted user categorizing and captioning.
When this technology finally hits the mainstream - the merging of quality digital cameras and GPSs (likely through our phones) - the impact is going to be absolutely profound, and it will completely change how we archive and access our images.
[SEE FOLLOWUP - http://www.yafla.com/dforbes/2005/10/06.html#a100, and http://www.yafla.com/dforbes/2005/11/29.html#a201]