[The static location of this piece can be found here]
This entry is a follow-up to Interesting Facts About Domain Names.
It's time for some bubble chart fun in the data analysis of the .COM domain space (analyzing the 47.7 million+ .COM domains, distilled from the 150 million+ row zone file). While my original entry was purely to satisfy personal curiousity, and as a test-bed of a publicly obtainable mid-sized dataset, the surprizing interest has me revisiting this topic (while finalizing another, much more interesting comparison against logical sequences and dictionary values). This outing is far more obscure than the first entry, and the charts are nowhere near as instantly informative, but I found the results fascinating nonetheless. The next entry on this subject will be much more immediately consumable.

This chart needs a bit of an explanation -- usually a bad sign as charts should normally be self-explanatory, but in this case it's graphing something a bit more complex -- so some clarifications are in order.
Length is of course the length of the domain. While 0 is plotted on the axis, only domains 3 or more characters long are charted. For instance yafla.com is a 5-character domain, as I'm excluding the TLD (top-level domain, which in this case is .com) portion.
Diversity is a measure of how repetitious a domain name is, with the vertical scale going from those domains comprised of a single repeating character (e.g. aaaaaaaaaaaa.com) at the bottom, to domains where every character is unique (abcdefghijkl.com) at the top (in this case the diversity calculation was implemented as a C# .NET scalar function, used directly from the SQL set operations). The bubble sizes vary based upon the number of samples that match a particular diversity and domain length, and of course bubbles that are too small are not displayed.
For instance reddit.com has a calculated diversity of 80%, while the shorter yafla.com has a calculated diversity of 75%.
The bubbles have been normalized such that the bubbles are sized relative to the total count at that length, so less popular lengths are intentionally disproportionately large (otherwise they would be drowned out). At the smaller lengths the logical, "legitimate" domains vastly outnumber the repetitious or random domains, whereas at the longer lengths a larger percentage of the domains are repeating characters or random sequences, and this is evident on the chart.
As the length of the domain increases, the probability of character collisions mathematically increases, explaining why the diversity declines at a fairly predictable rate. The highly diverse domains at longer lengths are usually seemingly nonsensical domains (such as 9876543210ZYXWVUTSRQPONMLKJIHGFEDCBA.com), as are the low diversity domains (e.g. 401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K.com, A-------------------------------------------------------------A.com, FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE.com).

Using a bubble chart again, this chart details the clusters of domains starting with various numeric characters at differing lengths. The more domains of a given starting character and length, the larger the bubble.
What intrigued me about this chart was the fact that some numbers have odd distribution patterns. For instance 8 as a domain starting character sees a generally declining prevalence as the length increases, with 3,352 domains starting with the character 8 having a length of 9 characters, but then suddenly there are 8,940 domains starting with 8 at a length of 10. Looking at the actual matching data made it instantly clear -- 1-800 numbers. Dropping the 1 and dashes, 1-800 numbers are 10 characters in length (e.g. 8004INJURY.com).
Similarly, 1 holds steady on a gradual decline as the length increases, but then suddenly at 11 characters it spikes (from 18,328 instances at 10 characters long, to 24,993 instances at 11 characters). This is for the same reason that 8 spiked, but in this case with the 1 prefix.
6 spiking at 8 characters long is an oddity, but I discovered that Netflix registered a huge array of largely sequential 8 character values starting with 6 (e.g. 60142240.com, 60155520.com, etc), letting them sit as parked pages. Not sure what the speculation is on these (SKUs perhaps?). Maybe they're going to give every customer their own domain by customer id..

On the same theme, this bubble chart shows the population distribution of domains starting with alpha-characters (from A to Z) at various lengths (A to Z from left to right. The charting tool completely disallowed characters on the X-axis, and I haven't had time to image them in). This is pretty much as expected. S is the fattest teardrop, 8 in from the right.
International Domain Names have been filtered out of both results.
Speaking of S, many speculated that the reason S was the dominant starting character for domains was due to sex related domains. While that's a reasonable guess, domains starting with sex actually only comprise 80,277 of the 4,330,172 such domains (of course there are more that mask it in variations like S-E-X, but they're relatively few in comparison). Instead S is just a popular starting character, particularly among domains starting with STA, SAN, SOU, SHO, STE, SHA, SUP, STO, and STR (which together comprise 1 million of the domains).

Of course that chart naturally begs the question of which 3-letter sequence is most prevalent.

Just some cute charts while I find time to complete the more interesting, human-interest domain name analysis.
For those who are interested, the following is a non-exhaustive, unsorted list of some of the sites that have been linking in over the past week. I've discovered some great sites in the process of seeing where people are coming from, and thought this might be interesting for others as well.
http://www.reddit.com
http://www.waxy.org
http://www.digg.com
http://www.popurls.com
http://joel.reddit.com
http://www.newsgator.com
http://www.stumbleupon.com
http://www.instantdomainsearch.com
http://www.netvibes.com
http://bella.blog.sme.sk/
http://www.namespros.com
http://jwz.livejournal.com (contains
some images that are probably NSFW)
http://del.icio.us
http://www.pageflakes.com
http://www.lifehacker.com
http://www.scripting.com
http://www.bluesnews.com
http://www.megite.com
http://meneame.net
http://tech.memeorandum.com
http://grumpygamer.com
http://blog.guykawasaki.com
http://xo.typepad.com
http://grabun.com
http://oink.elrellano.com
http://www.namedevelopment.com
http://www.scoopeo.com
http://www.icannwatch.org
http://www.hivelogic.com/
http://weblogtoolscollection.com/
http://www.boingboing.net
http://sethgodin.typepad.com
http://www.zefrank.com
http://www.vilaweb.cat
http://www.ollo.net
http://www.usemycomputer.com
http://www.webdesigntimes.com
http://www.plastic.com
http://www.lostremote.com
http://linkfilter.net
http://dirty.ru
http://sandbox.sourcelabs.com
http://www.geekpress.com
None of these are quid pro quo links (though I have gotten several
"we've linked you, so please link us" emails, I
automatically trash anything along those lines), and all are based
on actual referrals.
I apologize if I've missed anyone -- there was a manual process where I verified that they weren't spam sites with mechanical referral stuffing, so these aren't simply a direct copy from the logs. That manual process may have introduced errors and discrepencies.
Yesterday was quite a traffic day here on yafla.
After seeing a continuous low-level amount of interest coming from Reddit.com, continuing from Wednesday afternoon, mid-yesterday I happened to check the stats to find it reporting ~3500 simultaneous visitors (of course they weren't actually simultaneous GETs, but rather were simultaneous from the perspective of sessions). Initially I presumed that it was a software defect, but I quickly discovered it was real, and was due to the domain name entry appearing on Digg's front page. Along with a number of other great sites linking in, later in the day add to that some Slashdotting as another entry was apparently referenced in a story there.
Many impressive sites feeding in thousands upon thousands of visitors an hour. All told, from when the interest really kicked off early in the afternoon some 36,000 visitors came through by midnight, browsing well over 100,000 pages. This high level of traffic has continued through today.
I want to give my host a breather (the excellent ISQSolutions), so I'm going to hold off publishing the follow-up to the domain name entry (where I include stats such as dictionary values, phrase variations, etc) for a day or two to let the traffic settle a bit.
Through it all the site never failed to serve up pages (during the height of it the server continued to serve pages virtually instantly), courtesy of the fact that I publish these pages rendered into static form. Not only does this avoid the unnecessary overhead of script interpreting or database access, it also allows IIS 6 to kernel-cache the pages, allowing it to serve cached pages without even leaving kernel mode.
I should also say that Digg's influence is vastly greater than I postulated previously. I've had several pages as the primary focus of Slashdot stories before, and they didn't yield the simultaneous influx that a Digg front page did.
You've thought up a brilliant idea for a new Web 2.0, AJAX-enabled web app, or you're about to release a thus-far-unnamed killer software app. Now you just need to find the perfect domain name for it to live at (and, in true new-economy fashion, you'll base your corporate name upon whatever available domain name you find... PILLAGEANDPLUNDR Corporation).
You pull up GoDaddy and start punching in clever names, along with their many variations, only to find that they're all seemingly taken.
"This can't be!" you cry. "Has every possibility already been registered?"
Given that there are approximately 50 million .COM domains registered, it is indeed true that the low-hanging fruit domain names are overwhelming taken, and your chances of lucking upon an unnoticed available three-letter acronym (TLA) are close to zero, and your only recourse would be to haggle with domain speculators.
If you want one of the 676 possible two-letter sequences, for instance for an acronym or abbreviation, you're out of luck: They're all taken. Even allowing for digits, giving 1296 combinations, again every single variation is taken.
Of course, that's ignoring the fact that .COM registrars now mandate a 3-character minimum length, so it wouldn't be an option anyways.
Of the 17,576 possible three-letter sequences, again every single one is already taken. Adding digits to the mix (note that I'm intentionally ignoring obtuse dashes for such short domain names, though technically they are legal from the second character onwards), giving 46,656 permutations, yields a larger number of garbage domain entries (either REGISTRAR-LOCKED, REDEMPTIONPERIOD, or with no nameservers), giving a false hope of 228 seemingly open domains, yet they aren't actually available.
If you're dying to acquire great domains like 8VZ.com or Q6X.com, they'll free up within a month, though it seems evident that there are swaths of domain speculators acquiring every variant when they come available, so they won't go without a fight.
Stepping up to four letter sequences, choosing among the 456,976 combinations, yields a vastly greater availability -- perhaps the set is a bit too large for domain speculators and their unlikely success with random sequences -- with 97,786 showing as open. A quick check verifies that most are legitimately available. "Choice" domains, such as AGJV.com, EIYK.com, GZVW.com, and QFEV.com. Adding digits into the mix and there are a massive 1.16 million open domains, so long as you're looking for something like 7RG8.com, or U3JZ.com. Choose one and then manufacture a ridiculous backronym to explain it.
Going to 5-letter sequences (yet another five-letter acronym? YAFLA?), and of course the possibilities are rich, again presuming that you're willing to accept an arbitrary sequence of letters and/or digits, creating a backronym to match. Using just letters you have a rich 11,881,376 possibilities, of which approximately 11,015,028 are unclaimed.

Of course many of the registered domains are seldom, if ever, visited, with a huge percentage having nothing more than a parked page (users pay domain registrars to put up ads for themselves). Thus, analyzing the domain database without taking into account popularity/traffic is of limited value, but it does provide for a bit of entertainment.
As mentioned, 100% of 2 and 3 letter domain names are taken, but it starts to free up as the number of possibilities expodes, all the way up to 63-character domain names. The most popular registered domain name length is actually 11 characters long, tailing off from there.

The fun doesn't end at 31 characters, however. There are 253,000+ non-IDN domains that are 32 characters or longer, including 538 that are 63 characters long.

These include such superlative domains as ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ.com, WEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEB.com, and DIDYOUKNOWTHATYOUCANONLYHAVESIXTY-THREECHARACTERSINADOMAIN-NAME.com.
The US Census Bureau has some handy common name files available on their site, so I thought I'd see how one's luck would be trying to register their own name(s).
If you're looking for a masculine domain name, you'll be disheartened to learn that of the 1219 male names listed by the US Census Bureau, every single one is registered. If you're looking for something feminine, you're in luck: As I type this, of the 2841 female names listed by the Census, you can soon grab the lucrative recently expired Erlinda.com, or the sitting in purgatory Shanita.com, though both are technically currently taken.
On the family name front, 100% of the top 10,000 family names are registered.

Cross joining the top 300 male names with the top 300 family names finds that ~10,112 of the 90,000 possibilities aren't registered, to the benefit of anyone named Antonio Hughes and Lawrence Torres out there! Similarly, cross joining the top 300 female names with the top 300 family names finds that ~14,103 possibilities are unclaimed.
On the love front, 1958 (68.9%) of the 2841 possible 'ILOVE'-prefixed female names (using the census set of names) sit unclaimed, which is surprizing, as only 665 (54.5%) of 1219 'ILOVE'-prefixed male names remain available.

Continuing down that path, the seedier side of the internet is hardly a secret, and it's evident in the DNS database as well. 268,971 domains contain the sequence SEX (11,333 of them also containing the sequence FREE), while 143,683 domains contain the sequence LOVE.

The most common letter to start a domain is S, with relatively few domains starting with Q, X, Y or Z.

While the most common digit to start a domain is, unsurprizingly, 1.

Every successful company has remoras and haters, so it was interesting to look at the number of suffixed alternatives for some well-known domains. While some of these are actually owned by the root domain owner, most are hanger-ons and critics.

Samples include GOOGLE-AMERICA, GOOGLE-BUDDY, MICROSOFT-EBOOKS, SLASHDOTREVIEW, SLASHDOTSLASH, and YAHOO2007.
Hopefully this was a bit entertaining, and maybe even informative. I'm doing a much more intriguing, large-scale analysis (again, it's a nice opportunity to demonstrate some of the new SQL Server 2005 functionality) that I'll publish soon, but these were the low-hanging fruit.
[Also see Domain Name Analysis - More Fascinating But Entirely Useless Charts]
[The static location of this piece can be found here]
While I live and work in the Greater Toronto Area -- in beautiful Burlington, a town that we love and have adopted as our home -- my origins were in the humble, working-class town of St. Thomas, Ontario.
With a population of 33,000, St. Thomas' claim to fame, oddly, remains that it was the place where Jumbo the elephant was killed by a passing train in 1885. I remember being a kid in the town when they decided to further celebrate this infamy. After much debating, they ordered a statue at the grand cost of $50,000, holding a parade to celebrate its arrival.

For years the speculation raged about which high school football team would paint the elephant pink first, though I can't recall it ever actually happening. Strangely, the elephant's rear is pointed at one of the main roads coming into town.
As St. Thomas is located just a couple of hours down the highway from here (15 minutes South of London, which is another town that my wife and I lived in for several years. Thus far we've remained Ontarians, although Calgary has oft enticed us), we still visit regularly to see friends and family, and did so this weekend to celebrate my son's first birthday. Thanks for hosting it, Michelle. Hopefully you don't find green-icing anywhere unexpected.
I took a couple of pictures of St. Thomas, usually while one or both of my children slept in the backseat. These were generally taken out of the car window, or with my son in my arms, so manage expectations accordingly. Most were taken very early on Sunday morning, after my son decided to wake up at 5:30am, and then promptly fell asleep the moment we left for a coffee run.

The relatively early hour explains the dearth of activity in the city (after a summertime all-night game of Empire during my mid-teens, my friends and I would hop on our bikes and ride down the center of the main street right after sunrise. It was surreal having no other human anywhere in sight).






Growing up in St. Thomas, at least in the pre-teenage years, was fantastic, though I didn't appreciate it as much then: The city hosts numerous natural areas, most with significant elevation features. One great area was, and probably still is, affectionately known as "suicide" because of its narrow, ultra-steep trails, bordered by countless head-cracking trees. Nothing matched the exhileration of wailing down one on an old, brakeless BMX bike, relying on the perfectly applied use of one's foot in the front wheel forks for any stopping power...hoping to avoid locking the wheel and catapulting headfirst through the air. Another youth delight was courtesy of the railroad history of the town: an extensive networks of train tracks and train bridges crisscrossing the town. These railroad networks were the sneaker highway, the fossil searching grounds, and playground. The bridges were the source of many unintentional Stand By Me re-enactments (or rather pre-enactments).

For an adventurous, energetic child with more liberal parents, St. Thomas couldn't be beat.
For the child of "fearing-their-child-playing-in-an-inaccessible-place-called-suicide" parents, things were still pretty good: The main city parks, Waterworks and Pinafore, are both superlative public areas.





Waterworks features a swimming and wading pool, the natural fun of Kettle Creek (I marvel that I used to swim in that muddy, slug and snapping-turtle filled river), and a "waterfall" (a very small dam), and was a frequent lunchtime playground when I went to the nearby Lockes Public School. Pinafore is a large, open-area park with a very small animal exhibit (Tito the deer and friends, birds, some visiting swams, ducks, and so on), an extensive playground, bandshells, ball diamond, and picnicking areas, and a reservoir and small stream.
Many fond childhood memories involved family get-togethers in the parks, or picking up a bucket of KFC and heading down to Waterworks for some swimming. During the summer the city works department does an amazing job with the flowers in both parks, and they're spectacular shows. It's something you don't appreciate when you're young, but when I see it in the summer now I'm amazed.
The general feel of the core of St. Thomas is of a somewhat decayed, almost US rust-belt type town, which makes sense given that it saw the same boom-bust cycle of those US towns: After the geographical happenchance boom of railway faded, the city relied heavily upon the automotive industry (the vulnerable St. Thomas Ford Assembly Plant sits nearby, and many supply firms were located in St. Thomas). This reliance on one of the most vulnerable sectors of the economy meant that every economic cycle was greatly amplified.
The smallest economic blip meant automotive industry shutdowns that reverberated through every industry in the town. During every downturn St. Thomas usually hosts an unemployment rate far above the average.
The general demeaner of the town was often of paranoia and fear mongering, with the long predicted closing of the Ford plant being the primary worry for years on end. The general attitude was often one of being impotent bit-players, awaiting the moves of some far off business executive to cast one's life asunder.
Throughout the town, gorgeous old turn-of-the-20th-century brick houses have been slashed into numerous apartments, and/or are falling apart due to neglect. Much of the city's retail space sits empty, or full of very low-end retailers (e.g. dollar stores). Other areas of retail have seen a resurgence (such as the building of a new retail complex featuring a new Walmart, Canadian Tire, among many others).
In other parts of the town, old houses are starting to be reclaimed and rejuvenated, and new subdivisions are sprouting up. St. Thomas is becoming a suburb of London.
Amazing to consider this -- when I was a kid in this town, London was some far off mythical place that one only ventured to under extreme conditions. Now it's rightly a short commute, and some wise Londoners are getting a steal for some structurally sound, beautiful old houses in St. Thomas' core.
Nonetheless, St. Thomas' industrial roots still show through. The town has far more big-old-pickups per capita than anywhere -- often featuring the name of the driver and his S.O. stenciled on the doors -- and smoking is far more prevalent than you find in the GTA. Owing to its automotive history, the town has an extraordinarily high prevelence of so-called "domestic" makes, with Hondas and Toyotas being a very rare sight indeed (something I never noticed until we invited a friend to a get together in St. Thomas. He asked if he should worry about his Accord getting vandalized. The concept seemed alien to me, but looking around I was amazed to realize that probably 99% of cars on the road were of the "big 3". Here in Burlington, I doubt the big 3 account for more than 50% of cars on the road).
During a great breakfast at a local restaurant, I overheard a conversation that brought back flashbacks of why I was so desperate to leave the town as a teenager: A group was debating who was a "Chevy Person" and who was a "Ford Person". When I was a youth, such inane conversations, and bizarre corporate loyalties, accounted for most overheard conversation, and sales were brisk of "Calvin urinating on the opposing camps corporate allegiance". This was the St. Thomas version of crips and bloods.
The dearth of real opportunity, coupled with too much inane chatter (such as car brand allegiances) had me fleeing St. Thomas at the first opportunity, but looking at it now I've grown a fondness for it, warts and all.
After announcing some more delays with Vista, and then a delay with Office 2007, and then a critical hole in IE, and then a restructuring of the entire Windows division, and then some negative press about Vista's usability, Microsoft is reeling right now, and things are looking down. 2006 isn't the year of Microsoft.
As much as I appreciate and understand that they're working on projects of a scope that dwarfs the largest projects most of us will ever touch, one thing that amazes me is seeing people continually defending Microsoft, saying "Well isn't it better that they hold it until they get it right?". Sure, but you're talking about the best choice at the end of a lot of terrible choices. Vista has been a disaster, and surely after this debacle Microsoft will take a cue from Apple and learn how to stream out incremental releases, underpromising and overdelivering.
About a year back, a Microsoft rep, as some sort of standard questionnaire, asked me what I thought the greatest problem with Microsoft was. My reply was that Microsoft ties too many of their products together, in a dangerous cross-relationship where each development group is riskily trying to design for the other, and each is critically endangered when there is a fault or delay in the other (e.g. rather than the OS team making the best OS, and the .NET group making the best application layer platform, and the video group making the best video group...each is trying to cater to the needs of the other during the design stage. It sounds great in theory, but it SELDOM works in reality).
Give me a call, Bill. I'll help you set things straight.
I'm awaiting the availability of an updated .NET/.COM zone file for performance demonstration purposes (e.g. many of the samples for part III use the whole of the .COM/.NET DNS directories as performance samples). This is public data that people can replicate themselves, rather than confidential internal or client data, or manufactured data, so I thought it a good foundation.
I hope to finish up this series in the next couple of days.