Friday, March 31 2006

Yesterday was quite a traffic day here on yafla.

After seeing a continuous low-level amount of interest coming from Reddit.com, continuing from Wednesday afternoon, mid-yesterday I happened to check the stats to find it reporting ~3500 simultaneous visitors (of course they weren't actually simultaneous GETs, but rather were simultaneous from the perspective of sessions). Initially I presumed that it was a software defect, but I quickly discovered it was real, and was due to the domain name entry appearing on Digg's front page. Along with a number of other great sites linking in, later in the day add to that some Slashdotting as another entry was apparently referenced in a story there.

Many impressive sites feeding in thousands upon thousands of visitors an hour. All told, from when the interest really kicked off early in the afternoon some 36,000 visitors came through by midnight, browsing well over 100,000 pages. This high level of traffic has continued through today.

I want to give my host a breather (the excellent ISQSolutions), so I'm going to hold off publishing the follow-up to the domain name entry (where I include stats such as dictionary values, phrase variations, etc) for a day or two to let the traffic settle a bit.

Through it all the site never failed to serve up pages (during the height of it the server continued to serve pages virtually instantly), courtesy of the fact that I publish these pages rendered into static form. Not only does this avoid the unnecessary overhead of script interpreting or database access, it also allows IIS 6 to kernel-cache the pages, allowing it to serve cached pages without even leaving kernel mode.

I should also say that Digg's influence is vastly greater than I postulated previously. I've had several pages as the primary focus of Slashdot stories before, and they didn't yield the simultaneous influx that a Digg front page did.

   
Wednesday, April 05 2006

For those who are interested, the following is a non-exhaustive, unsorted list of some of the sites that have been linking in over the past week. I've discovered some great sites in the process of seeing where people are coming from, and thought this might be interesting for others as well.

http://www.reddit.com
http://www.waxy.org
http://www.digg.com
http://www.popurls.com
http://joel.reddit.com
http://www.newsgator.com
http://www.stumbleupon.com
http://www.instantdomainsearch.com
http://www.netvibes.com
http://bella.blog.sme.sk/
http://www.namespros.com
http://jwz.livejournal.com (contains some images that are probably NSFW)
http://del.icio.us
http://www.pageflakes.com
http://www.lifehacker.com
http://www.scripting.com
http://www.bluesnews.com
http://www.megite.com
http://meneame.net
http://tech.memeorandum.com
http://grumpygamer.com
http://blog.guykawasaki.com
http://xo.typepad.com
http://grabun.com
http://oink.elrellano.com
http://www.namedevelopment.com
http://www.scoopeo.com
http://www.icannwatch.org
http://www.hivelogic.com/
http://weblogtoolscollection.com/
http://www.boingboing.net
http://sethgodin.typepad.com
http://www.zefrank.com
http://www.vilaweb.cat
http://www.ollo.net
http://www.usemycomputer.com
http://www.webdesigntimes.com
http://www.plastic.com
http://www.lostremote.com
http://linkfilter.net
http://dirty.ru
http://sandbox.sourcelabs.com
http://www.geekpress.com

None of these are quid pro quo links (though I have gotten several "we've linked you, so please link us" emails, I automatically trash anything along those lines), and all are based on actual referrals.

I apologize if I've missed anyone -- there was a manual process where I verified that they weren't spam sites with mechanical referral stuffing, so these aren't simply a direct copy from the logs. That manual process may have introduced errors and discrepencies.

   
Wednesday, April 05 2006

[The static location of this piece can be found here]

This entry is a follow-up to Interesting Facts About Domain Names.

It's time for some bubble chart fun in the data analysis of the .COM domain space (analyzing the 47.7 million+ .COM domains, distilled from the 150 million+ row zone file). While my original entry was purely to satisfy personal curiousity, and as a test-bed of a publicly obtainable mid-sized dataset, the surprizing interest has me revisiting this topic (while finalizing another, much more interesting comparison against logical sequences and dictionary values). This outing is far more obscure than the first entry, and the charts are nowhere near as instantly informative, but I found the results fascinating nonetheless. The next entry on this subject will be much more immediately consumable.

Domain Diversity

Diversity Distribution

This chart needs a bit of an explanation -- usually a bad sign as charts should normally be self-explanatory, but in this case it's graphing something a bit more complex -- so some clarifications are in order.

Length is of course the length of the domain. While 0 is plotted on the axis, only domains 3 or more characters long are charted. For instance yafla.com is a 5-character domain, as I'm excluding the TLD (top-level domain, which in this case is .com) portion.

Diversity is a measure of how repetitious a domain name is, with the vertical scale going from those domains comprised of a single repeating character (e.g. aaaaaaaaaaaa.com) at the bottom, to domains where every character is unique (abcdefghijkl.com) at the top (in this case the diversity calculation was implemented as a C# .NET scalar function, used directly from the SQL set operations). The bubble sizes vary based upon the number of samples that match a particular diversity and domain length, and of course bubbles that are too small are not displayed.

For instance reddit.com has a calculated diversity of 80%, while the shorter yafla.com has a calculated diversity of 75%.

The bubbles have been normalized such that the bubbles are sized relative to the total count at that length, so less popular lengths are intentionally disproportionately large (otherwise they would be drowned out). At the smaller lengths the logical, "legitimate" domains vastly outnumber the repetitious or random domains, whereas at the longer lengths a larger percentage of the domains are repeating characters or random sequences, and this is evident on the chart.

As the length of the domain increases, the probability of character collisions mathematically increases, explaining why the diversity declines at a fairly predictable rate. The highly diverse domains at longer lengths are usually seemingly nonsensical domains (such as 9876543210ZYXWVUTSRQPONMLKJIHGFEDCBA.com), as are the low diversity domains (e.g. 401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K.com, A-------------------------------------------------------------A.com, FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE.com).

Length Distribution

Domain Length Distribution

Using a bubble chart again, this chart details the clusters of domains starting with various numeric characters at differing lengths. The more domains of a given starting character and length, the larger the bubble.

What intrigued me about this chart was the fact that some numbers have odd distribution patterns. For instance 8 as a domain starting character sees a generally declining prevalence as the length increases, with 3,352 domains starting with the character 8 having a length of 9 characters, but then suddenly there are 8,940 domains starting with 8 at a length of 10. Looking at the actual matching data made it instantly clear -- 1-800 numbers. Dropping the 1 and dashes, 1-800 numbers are 10 characters in length (e.g. 8004INJURY.com).

Similarly, 1 holds steady on a gradual decline as the length increases, but then suddenly at 11 characters it spikes (from 18,328 instances at 10 characters long, to 24,993 instances at 11 characters). This is for the same reason that 8 spiked, but in this case with the 1 prefix.

6 spiking at 8 characters long is an oddity, but I discovered that Netflix registered a huge array of largely sequential 8 character values starting with 6 (e.g. 60142240.com, 60155520.com, etc), letting them sit as parked pages. Not sure what the speculation is on these (SKUs perhaps?). Maybe they're going to give every customer their own domain by customer id..

Length Distribution of Alpha Domains

On the same theme, this bubble chart shows the population distribution of domains starting with alpha-characters (from A to Z) at various lengths (A to Z from left to right. The charting tool completely disallowed characters on the X-axis, and I haven't had time to image them in). This is pretty much as expected. S is the fattest teardrop, 8 in from the right.

International Domain Names have been filtered out of both results.

The Letter S

Speaking of S, many speculated that the reason S was the dominant starting character for domains was due to sex related domains. While that's a reasonable guess, domains starting with sex actually only comprise 80,277 of the 4,330,172 such domains (of course there are more that mask it in variations like S-E-X, but they're relatively few in comparison). Instead S is just a popular starting character, particularly among domains starting with STA, SAN, SOU, SHO, STE, SHA, SUP, STO, and STR (which together comprise 1 million of the domains).

S Domain Starting Sequences

Prevalent Starting 3-Letter Sequences

Of course that chart naturally begs the question of which 3-letter sequence is most prevalent.

Domain Starting Letters

Just some cute charts while I find time to complete the more interesting, human-interest domain name analysis.

   
Wednesday, April 12 2006

I've been extremely busy professionally over the past week, so I apologize for the lack of content. The quiet is also admittedly because it's hard to follow-up the domain name entries, given the extraordinary level of interest they received.

Apart from ~30,000+ visitors to those entries, per day, continuing for about a week (and still tapering off), I was also phone interviewed on National Public Radio (broadcast throughout the US), quoted in an ezine, translated to other languages, linked by several hundred other sites (including the blogs of several people in this industry who I've admired for many years), and parts of the entry is going to be published in a reputable magazine.

That interest completely shocked me.

I obtained the domain name database for purely functional reasons, and threw up the entry of observations purely because I found a couple of the stats interesting (I love digging into data and finding interesting correlations and insights. I imagine how interesting it would be to delve in some of the large datasets like grocery store databases: Who doesn't look in the cart  in line ahead of them, drawing conclusions about the personality and lifestyle of the individual based upon their purchases? Imagine all of the fun observations one could derive from the entire database of purchases).

At most I thought the regulars would find it interesting, and was shocked to see the level of traffic. Apart from all of the wonderful comments I've received, and publicity for my consulting/software development business, the benefit to PageRank has been tremendous, and search engine referrals are through the roof.

In any case, I have several entries almost ready for publication, so content should ramp up again shortly.

Have a fantastic day and week ahead.

   
Monday, April 24 2006

Data security has been on my mind lately, mostly after learning that approximately 700,000 laptops are stolen in the US per year. Add the armies of desktops stolen, the backup tapes lost, and the system compromises that occur, and the situation starts to look pretty grim for data security.

blade

How secure is your data?

If someone stole your desktop, or snatched your laptop from under you at a coffee shop, what confidential information could they gain?

While most thieves aren't of the capacity or motivation to crack the syskey or circumvent NTFS permissions (which is as easy as booting up with a knoppix disc. File ACLs only matter if the expected host operating system is in charge), your response should be to assume that they do, and that they are now reading all of your documents, looking at all of your shortcuts and form entry values, browsing your Outlook notes of account numbers and passwords, and are playing with your tax returns.

The real-world cost of such a compromise can be extraordinary. Losing an expensive piece of equipment can be annoying, but it pales compared to the wholesale loss of data privacy.

Do you use EFS (more information here)? Do you have a Data Recovery key with the private key stored offline in a protected location? Do you know what syskey does? Are you aware of the upcoming Secure Startup (which basically is whole volume encryption)?

Are you comfortable enough with your procedures that the physical loss of a computer to theft would be nothing more than a financial expense and setup hassle, with marginal or no data exposure?

   
Tuesday, April 25 2006

Petr Krèmáø has graciously put up a Czechoslovakian translation of the domain name entry, available at http://www.root.cz/clanky/zajimava-fakta-o-domenovych-jmenech/.

Thank you Petr, and I'm honored that you thought it worthy of the effort. Seeing one's words in a different language

   
Friday, May 05 2006

Came across the following video yesterday, and it serves as a mildly humorous worst-case scenario of the "How Secure Is Your Data?" entry from a bit back.

http://media1.break.com/dnet/media/content/stolenlaptop.wmv

As laughably over-the-top this professor's claims and grandiose threats are, most concerning to me was the obvious lack of confidence he holds in the integrity of data on his computer (a mobile computer no less, of the sort that close to a million per year are stolen in the US alone).

This computer was obviously stolen while unattended, and if even the rudiments of security best practices were followed -- use of some sort of encrypted file system, be it PGP disk, EFS in Windows, or similar technologies -- he should be able to write it off as a costly and inconvenient loss of some hardware. Instead, his hysterical threats make it out to be a matter of national security, to which every scary government agency will soon swoop down in the black helicopters. The perpetrator(s), we are told, must prove that the data hasn't been tampered with, and that it hasn't been copied (how, pray tell, does one prove that? It's the sort of negative proof that's rather difficult to contrive), and maybe then they won't be sent off to secret Eastern European prisons. Okay, I made that last bit up, but it's along the lines of the hyperbole.

From a professional perspective, I find the diatribe by this professor very self incriminating, hinting at terrible neglect in the management of data (purportedly other people's data as well, which should rightly make those third parties very angry). While it is almost certainly a ruse to scare a reluctant thief into confessing, it's akin to claiming that the guy who stole your car is in big trouble, because you just happen to store nuclear warheads in the trunk -- I'd have more of a problem with the guy with nukes in his trunk than with a petty thief.

Protect your data. Acting surprized when hardware loss occurs isn't acceptable, and is tantamount to gross neglect.

[Miles Archer has rightly pointed out in the comments that this video is a couple of years old. Nonetheless, we've had powerful encryption options for a long, long time. A decade ago I got the senior management, accounting and HR departments of a firm using PGPDisk for confidential data, separating the administration of systems (e.g. system ACLs) from the need and ability to access the data. It worked beautifully. Since then we've had numerous new, and more transparent, options for securing our data]

   


About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.





 
Earlier EntriesLater Entries

Dennis Forbes