Dennis Forbes on Pragmatic Software Development
Subscribe to RSS
 
Wednesday, March 22 2006

When time affords, I've been looking over two widely hyped web betas - Riya and Ether.

Riya is a "next generation" Flickr online photo web app, enabled with facial and text recognition. Expectations about the facial and text recognition have hugely scaled back from the hysterical claims being made months ago (now they're giving long lists of caveats, warning that it's just getting started and that you should focus more on the photo album capabilities). Given that this is supposed to be the site's killer feature, it will be interesting to see where they go with that. I'm going to try it out with a load of test photos and see how it compares.

Ether is a rather interesting service that arranges calls through phone service providers and clients, charging a, ahem, "pimp fee" for the service. It's quite a clever idea, and can greatly simplify the infrastructure and billing requirements for small phone service providers. On the downside it requires callers to register with Ether, which is a requirement that will definitely reduce acceptance (if it was "976" style billing, where the billing automatically goes on the phone bill, I would imagine it would do better, though of course that isn't really possible globally).

I can't really see a huge market for this service, but they've built quite a nice web app for the system, and the phone infrastructure seems to work well.

Potentially it could be used for easily billing out phone support for those who follow the software development model of "release the software for free and then charge for support".

Dennis Forbes
1-888-MY-ETHER ext. 01384785

(Note: I don't actually expect anyone to call that, as telephone style services aren't really my forte. Perhaps if I got in the astrology industry it would have more utility).

On the topic of neat web services, and while thinking of Ether, there's another clever one I was pointed to recently - http://www.jajah.com/.

Friday, March 24 2006

Like most people with a website, I regularly check the stats to see how things are going: How many people visited today? Where did they come from? How many times have the search engines sent someone my way? These are metrics that I use to know if I'm hitting internal goals, and allow me to alter plans when things head in the wrong direction (If I lose readers, I'll just have to make up some story about Google buying Digg and then duking it out with Microsoft Reddit Live! That sort of thing seems to play very well these days. Don't say I didn't warn you! Of course there's a bit of hypocrisy in the fact that I'm largely speculating about search algorithms, with very limited facts, while criticizing acquisition-of-the-hour rumors).

While the number of visitors has a fairly constant floor, the daily count ceiling can vary wildly if I've posted something new, if someone posted it to reddit or Digg or Slashdot, based upon how many people added things to their http://del.icio.us bookmarks, and so on.

A Wednesday might have 2200 visitors one week, while it sees 15,000 visitors the next.

Search engine referrals, in contrast, are usually fairly constant, with a generally predictable number being sent over, following a recurring weekly curve: Monday = X, Tuesday = X*1.1, Wednesday = X*1.15, Thursday = X*0.75, Friday = X*0.65, Saturday = X*0.4, Sunday = X*0.3. X as been slowly edging up as I add content, and as more inbound links appear and thus PageRank and similar rankings improve.

This week it hasn't been quite as predictable on the search engine front.

After a Sunday drop in search engine referrals, from Google in particular, on Monday Google referrals jumped 50% over the week before. On Tuesday they again jumped 50%. Then on Wednesday they dropped 20% under the mark set the preceding week. Again it came in 20% below on Thursday.

I would write this off as nothing more than normal fluctuations -- maybe users just weren't searching for the sort of content covered on here on Wednesday and Thursday, so the referrals dropped off -- but for the fact that Monday and Tuesday coincidentally also saw a large influx of visitors from Reddit, Digg, Delicious, popurl, and a few other meme sites, quadrupling the normal traffic. Of course these new links were far too fresh to affect the PageRank, so by traditional analysis shouldn't affect the search referrals at all.

This got me thinking, in my normal conspiracy theory way: What if Google has started tying site visits, metered by the Google toolbar (which sends back the sites you're visiting if you have pagerank display on), and has begun using the current values to determine search results?

They could tune this in such a way that a site has to get a certain percentage of non-search referred visitors for each search referral, otherwise the search result is downgraded. The benefit, of course, is that search spam sites that only see visitors courtesy of the search engines would be quickly punished. "Valuable" content that is seeing signficant non-search related traffic would be promoted.

Just some food for thought. I have no proof of this, but I've always felt that there would come a time that their web visit stats would start to influence the search results.

Wednesday, March 29 2006

[The static location of this piece can be found here]

The Search For A Domain Name

I recently had a need for a mid-sized amount of real-world data, which I required for testing purposes on low-end hardware (testing and demonstrating some of the new functionality of SQL Server 2005). I wanted something that wasn't confidential, which excluded the easy choice of using business data, and I refrain from using artificial data. Around the same time I happened across the requisition process for the .COM/.NET and .EDU TLD zones, so I made a request for access.

Soon enough I had the 3.5GB of .COM domain names, along with 650MB of .NET, loaded into the database (although for all results in this entry I only included the .COM TLD, for the data as of 2pm on March 28th, 2006. I'll analyze the other ones at a future date). It was a great foundation for a lot of tests and demonstrations, and served my original goal admirably. I didn't stop there, however; Curiousity led me to do some basic analysis to see what sorts of domain names are registered, and how saturated the registry really is.

Note that these are the Verisign distributed zone files, and do not include entries that have no nameservers configured, or which are in a hold state. While those comprise a very small minority of domain names, it does skew the results a bit. To improve accuracy when the sample set is small, for some of the tests I have validated the positives using the WHOIS infrastructure (for instance the domain file had several two letter sequences as being "available", and a dozen three letter sequences. All of them were the result of a hold state, or no nameservers configured). For aggregate results where it was inapplicable, I've filtered international domain names (IDN) from the results (prefaced with xn--).

You've thought up a brilliant idea for a new Web 2.0, AJAX-enabled web app, or you're about to release a thus-far-unnamed killer software app. Now you just need to find the perfect domain name for it to live at (and, in true new-economy fashion, you'll base your corporate name upon whatever available domain name you find... PILLAGEANDPLUNDR Corporation).

You pull up GoDaddy and start punching in clever names, along with their many variations, only to find that they're all seemingly taken.

"This can't be!" you cry. "Has every possibility already been registered?"

Given that there are approximately 50 million .COM domains registered, it is indeed true that the low-hanging fruit domain names are overwhelming taken, and your chances of lucking upon an unnoticed available three-letter acronym (TLA) are close to zero, and your only recourse would be to haggle with domain speculators.

What About Acronyms?

If you want one of the 676 possible two-letter sequences, for instance for an acronym or abbreviation, you're out of luck: They're all taken. Even allowing for digits, giving 1296 combinations, again every single variation is taken.

Of course, that's ignoring the fact that .COM registrars now mandate a 3-character minimum length, so it wouldn't be an option anyways.

Of the 17,576 possible three-letter sequences, again every single one is already taken. Adding digits to the mix (note that I'm intentionally ignoring obtuse dashes for such short domain names, though technically they are legal from the second character onwards), giving 46,656 permutations, yields a larger number of garbage domain entries (either REGISTRAR-LOCKED, REDEMPTIONPERIOD, or with no nameservers), giving a false hope of 228 seemingly open domains, yet they aren't actually available.

If you're dying to acquire great domains like 8VZ.com or Q6X.com, they'll free up within a month, though it seems evident that there are swaths of domain speculators acquiring every variant when they come available, so they won't go without a fight.

Stepping up to four letter sequences, choosing among the 456,976 combinations, yields a vastly greater availability -- perhaps the set is a bit too large for domain speculators and their unlikely success with random sequences -- with 97,786 showing as open. A quick check verifies that most are legitimately available. "Choice" domains, such as AGJV.com, EIYK.com, GZVW.com, and QFEV.com. Adding digits into the mix and there are a massive 1.16 million open domains, so long as you're looking for something like 7RG8.com, or U3JZ.com. Choose one and then manufacture a ridiculous backronym to explain it.

Going to 5-letter sequences (yet another five-letter acronym? YAFLA?), and of course the possibilities are rich, again presuming that you're willing to accept an arbitrary sequence of letters and/or digits, creating a backronym to match. Using just letters you have a rich 11,881,376 possibilities, of which approximately 11,015,028 are unclaimed.

How Long Are Most Domains?

Of course many of the registered domains are seldom, if ever, visited, with a huge percentage having nothing more than a parked page (users pay domain registrars to put up ads for themselves). Thus, analyzing the domain database without taking into account popularity/traffic is of limited value, but it does provide for a bit of entertainment.

As mentioned, 100% of 2 and 3 letter domain names are taken, but it starts to free up as the number of possibilities expodes, all the way up to 63-character domain names. The most popular registered domain name length is actually 11 characters long, tailing off from there.

The fun doesn't end at 31 characters, however. There are 253,000+ non-IDN domains that are 32 characters or longer, including 538 that are 63 characters long.

These include such superlative domains as ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ.com, WEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEBWEB.com, and DIDYOUKNOWTHATYOUCANONLYHAVESIXTY-THREECHARACTERSINADOMAIN-NAME.com.

What About Names?

The US Census Bureau has some handy common name files available on their site, so I thought I'd see how one's luck would be trying to register their own name(s).

If you're looking for a masculine domain name, you'll be disheartened to learn that of the 1219 male names listed by the US Census Bureau, every single one is registered. If you're looking for something feminine, you're in luck: As I type this, of the 2841 female names listed by the Census, you can soon grab the lucrative recently expired Erlinda.com, or the sitting in purgatory Shanita.com, though both are technically currently taken.

On the family name front, 100% of the top 10,000 family names are registered.

Cross joining the top 300 male names with the top 300 family names finds that ~10,112 of the 90,000 possibilities aren't registered, to the benefit of anyone named Antonio Hughes and Lawrence Torres out there! Similarly, cross joining the top 300 female names with the top 300 family names finds that ~14,103 possibilities are unclaimed.

Domain Name Love

On the love front, 1958 (68.9%) of the 2841 possible 'ILOVE'-prefixed female names (using the census set of names) sit unclaimed, which is surprizing, as only 665 (54.5%) of 1219 'ILOVE'-prefixed male names remain available.

Continuing down that path, the seedier side of the internet is hardly a secret, and it's evident in the DNS database as well. 268,971 domains contain the sequence SEX (11,333 of them also containing the sequence FREE), while 143,683 domains contain the sequence LOVE.

Other Tidbits

The most common letter to start a domain is S, with relatively few domains starting with Q, X, Y or Z.

While the most common digit to start a domain is, unsurprizingly, 1.

Every successful company has remoras and haters, so it was interesting to look at the number of suffixed alternatives for some well-known domains. While some of these are actually owned by the root domain owner, most are hanger-ons and critics.

Samples include GOOGLE-AMERICA, GOOGLE-BUDDY, MICROSOFT-EBOOKS, SLASHDOTREVIEW, SLASHDOTSLASH, and YAHOO2007.

Conclusion

Hopefully this was a bit entertaining, and maybe even informative. I'm doing a much more intriguing, large-scale analysis (again, it's a nice opportunity to demonstrate some of the new SQL Server 2005 functionality) that I'll publish soon, but these were the low-hanging fruit.

[Also see Domain Name Analysis - More Fascinating But Entirely Useless Charts]

Wednesday, April 12 2006

I've been extremely busy professionally over the past week, so I apologize for the lack of content. The quiet is also admittedly because it's hard to follow-up the domain name entries, given the extraordinary level of interest they received.

Apart from ~30,000+ visitors to those entries, per day, continuing for about a week (and still tapering off), I was also phone interviewed on National Public Radio (broadcast throughout the US), quoted in an ezine, translated to other languages, linked by several hundred other sites (including the blogs of several people in this industry who I've admired for many years), and parts of the entry is going to be published in a reputable magazine.

That interest completely shocked me.

I obtained the domain name database for purely functional reasons, and threw up the entry of observations purely because I found a couple of the stats interesting (I love digging into data and finding interesting correlations and insights. I imagine how interesting it would be to delve in some of the large datasets like grocery store databases: Who doesn't look in the cart  in line ahead of them, drawing conclusions about the personality and lifestyle of the individual based upon their purchases? Imagine all of the fun observations one could derive from the entire database of purchases).

At most I thought the regulars would find it interesting, and was shocked to see the level of traffic. Apart from all of the wonderful comments I've received, and publicity for my consulting/software development business, the benefit to PageRank has been tremendous, and search engine referrals are through the roof.

In any case, I have several entries almost ready for publication, so content should ramp up again shortly.

Have a fantastic day and week ahead.

Monday, April 24 2006

Data security has been on my mind lately, mostly after learning that approximately 700,000 laptops are stolen in the US per year. Add the armies of desktops stolen, the backup tapes lost, and the system compromises that occur, and the situation starts to look pretty grim for data security.

blade

How secure is your data?

If someone stole your desktop, or snatched your laptop from under you at a coffee shop, what confidential information could they gain?

While most thieves aren't of the capacity or motivation to crack the syskey or circumvent NTFS permissions (which is as easy as booting up with a knoppix disc. File ACLs only matter if the expected host operating system is in charge), your response should be to assume that they do, and that they are now reading all of your documents, looking at all of your shortcuts and form entry values, browsing your Outlook notes of account numbers and passwords, and are playing with your tax returns.

The real-world cost of such a compromise can be extraordinary. Losing an expensive piece of equipment can be annoying, but it pales compared to the wholesale loss of data privacy.

Do you use EFS (more information here)? Do you have a Data Recovery key with the private key stored offline in a protected location? Do you know what syskey does? Are you aware of the upcoming Secure Startup (which basically is whole volume encryption)?

Are you comfortable enough with your procedures that the physical loss of a computer to theft would be nothing more than a financial expense and setup hassle, with marginal or no data exposure?

Friday, May 05 2006

Came across the following video yesterday, and it serves as a mildly humorous worst-case scenario of the "How Secure Is Your Data?" entry from a bit back.

http://media1.break.com/dnet/media/content/stolenlaptop.wmv

As laughably over-the-top this professor's claims and grandiose threats are, most concerning to me was the obvious lack of confidence he holds in the integrity of data on his computer (a mobile computer no less, of the sort that close to a million per year are stolen in the US alone).

This computer was obviously stolen while unattended, and if even the rudiments of security best practices were followed -- use of some sort of encrypted file system, be it PGP disk, EFS in Windows, or similar technologies -- he should be able to write it off as a costly and inconvenient loss of some hardware. Instead, his hysterical threats make it out to be a matter of national security, to which every scary government agency will soon swoop down in the black helicopters. The perpetrator(s), we are told, must prove that the data hasn't been tampered with, and that it hasn't been copied (how, pray tell, does one prove that? It's the sort of negative proof that's rather difficult to contrive), and maybe then they won't be sent off to secret Eastern European prisons. Okay, I made that last bit up, but it's along the lines of the hyperbole.

From a professional perspective, I find the diatribe by this professor very self incriminating, hinting at terrible neglect in the management of data (purportedly other people's data as well, which should rightly make those third parties very angry). While it is almost certainly a ruse to scare a reluctant thief into confessing, it's akin to claiming that the guy who stole your car is in big trouble, because you just happen to store nuclear warheads in the trunk -- I'd have more of a problem with the guy with nukes in his trunk than with a petty thief.

Protect your data. Acting surprized when hardware loss occurs isn't acceptable, and is tantamount to gross neglect.

[Miles Archer has rightly pointed out in the comments that this video is a couple of years old. Nonetheless, we've had powerful encryption options for a long, long time. A decade ago I got the senior management, accounting and HR departments of a firm using PGPDisk for confidential data, separating the administration of systems (e.g. system ACLs) from the need and ability to access the data. It worked beautifully. Since then we've had numerous new, and more transparent, options for securing our data]

Monday, May 22 2006

Some recent software installation trials and tribulations (Microsoft's Team Foundation Server, for those who wonder) have encouraged me to restate the observations of a prior entry, Adoption = (Functionality - Cost) ^ Ease of Use.

Oakville 5 Drive-In

In that outing, I observed that the adoption (or avoidance) of a product is often correlated with the ease of taking the first step, along with the continued ease of using the product. While I focused on the usability and adoption of PVRs relative to VCRs, this premise holds true in the software field as well: Even among enterprise level applications -- huge, complex solutions that drive the engine of corporations -- the initial impression, or beginning evangelism, is often driven by the ability of some random tech guy to get the product installed and delivering some sort of value. All of the specialization, customization, and advanced uses will come later.

This can be demonstrated by analyzing the historic success of many Microsoft products. Compared to the Oracle of old, for instance, SQL Server was brainless to get running, and often found its way into many shops via MSDN subscriptions. Soon enough that MS Access developer was targeting SQL Server, tying themselves and their solutions to the product, in time taking advantage of all of its advanced functionality. The complexity of the product was "time-released". Microsoft Visual SourceSafe is widely considered an also-ran source control system, with a litany of missing functionality and known defects, yet it's the source control product in use by a huge number of software development shops -- Given how trivial it is to get going (versus many of the competitors that often demanded a sea of dependencies and configuration steps), many groups adopted it as a defacto source control product.

From hosting that first micro-project, it took hold until it was the foundation of the most compex of solutions.

The examples go on. Of course people would point to Linux as a counter-point, and to a small degree it is, even among the Linux camp the real adoption began when companies like Redhat made installation a simple "hit enter to all of the prompts" affair. Linux took off, while the more difficult to configure FreeBSD floundered.

Products took root, and then sprouted, because the first step was easy. This happens while much more capable solutions, with longer feature lists and a promise of a more rewarding long term, sit unloved and unused.

IMG_5283

All of this had me wondering what part virtual machines could play in this equation. Virtual machine technology -- where multiple logical machines are virtually hosted on one physical computing box -- is a wonderful (and improving) technology that I still considering somewhat akin to magic. With Virtual Machine technology, whole platforms, including all required libraries, applications and configurations, can be delivered as an already running box, perhaps requiring nothing more than an IP address and some very rudimentary configuration. From source-control products, to wikis, to web application servers, virtual machine technology could allow for hugely complex solutions to be delivered in a "ready-to-run" solution.

Of course there are downsides to this approach. For instance it sort of eliminates reuse of common components (even requiring a separate OS instance for each virtual machine), yet common components are often the most fragile, perilous element of many applications. It isn't entirely a loss. Also there are licensing issues, such as the fact that you can't simply bundle a copy of Windows Server 2003 R2 with your virtual machine.

It's more a solution that works in the open source world, where you can release virtual machines configured with Linux, Apache, Postgresql, PHP, and so on, all along with your custom, ready-to-rock solution.

Unrelated Note: The mood pictures were taken at the Oakville 5 drive-in this weekend, which is one of the few movie experiences we get with two very young children. One usability note that I always observe at the drive-in is how many drivers don't know how to turn their daytime running lights off (here in Canada all vehicles have low intensity lights on whenever the vehicle is running -- even during the daytime -- which has been demonstrated to reduce accident rates). For those who don't know, on most makes of cars you can turn off the daytime running lights by engaging the parking brake before you turn on the vehicle. This isn't universal -- for instance I know some Ford models where it doesn't work -- but I've used it in a number of makes and models to success. This allows paranoid-about-their-battery drivers to start their vehicles at the drive-in without inciting a riot.

Earlier EntriesLater Entries

Dennis Forbes - Dennis Forbes is a Toronto-based software architect and technology writer