Anchor Votes

Dennis W. Forbes - August 13th, 2002

Dennis Forbes

Foreword

If you're like many, Google has become your primary search engine, either directly or indirectly via a business relationship with another site (for example, Yahoo uses Google to search the web). For me, Google has become my home page: Virtually every session begins with a Google search. Maybe I'm looking for some good documentation for Netscape Navigator 4.7's obsolete and incomplete DOM, or a site showing the biggest skyscrapers. Whatever it is, Google usually quickly gets me to my destination. Before I raise the ire of the hordes of fanatical Google fans, I should start this paper off by saying that I am a tremendous fan of Google, and I have yet to see any search engine which has it beat, but that doesn't mean that Google can't be made better. I am also concerned that Google's ranking technology is masked in a tight lipped aura: This is a classic security by obscurity, and while I do believe that there is some merit to that in certain circumstances, the easy cause/effect analysis of Google renders such obscurity transparent to those that want to manipulate Google rankings for their own gain.

A Guess At The Technology Behind Google Rankings

*DISCLAIMER: I do not have access to Google's page ranking technology, and apart from some partial details on their site, they keep their ranking techniques tight lipped to avoid intentional rank manipulating. As such, everything I say in this article is purely speculative based upon analysis of search results for various terms and phrases. Please also note that I browse the web using Opera with pop-ups disabled, so follow any link at your own discretion.

Lately I've been fascinated by the techniques that Google uses to rank the search results, as obviously this dictates the usability of the results, and alternately the value of a website to businesses that are trying to get eyeballs to see their products and services. Indeed, a case could arguably be made that Google search positioning is becoming one of the most important "real estate value" elements of any web page (more important than acquiring a good domain name, although I note later that it very well may be that the domain name+directory structure remains critically important if it's contextual for the good or service that you're selling). If Google were to ever sell page rankings, which they currently do not do, they literally would be in a windfall as every company rushes to make sure that they obtain the most eyeball potential.

The Google ranking technique, in a nutshell, is that every link provided to a site is a vote for the site, with the weighting of the vote being determined by the number of votes that the voting site itself has received (another scenario is that indirectly each site promotes each subpage through internal linking, though effectively this results in the same thing for any aggregate site which provides an index). I've highlighted "site" for an important reason: A vote from anywhere within Slashdot garners the approximate voting power of Slashdot as a whole, a site which is one of the most linked sites on the Internet. The same reality holds true for the other conversation sites such as www.plastic.com or www.kuro5hin.org . The flip side is true as well: Not only does the link vote apply to the destination page, but also to the site as a whole. By their very nature, aggregate sites like GeoCities or Angelfire will get a lot of votes because they contain tens of thousands of pages, and by extrapolation every hosted page itself starts quite high in the rankings, regardless of its own merit: You can prove this yourself by browsing through the GeoCities pages and looking for various papers covering a specific topic, for instance this page on praying mantids (I randomly picked one as an example). Do a search on Google for mantids  (note: either `mantis' or `mantids' is correct), and there it is in the #7 position (you can repeat this with virtually any page on an aggregate site). To put that into perspective, there are some 750 pages dealing with mantids that are linked from Google, and that limit is simply because that's the maximum results that Google will return for that particular search term. A quick check (using the link: search criteria) confirms that the page in question is linked by no other sites but itself). Another example of a megalinked aggregate site is the members.aol.com domain (apparently now hometown.aol.com), where AOL members can post webpages. Searching for "Ford transmission " (perhaps you're having transmission problems and want to get it fixed, or you're thinking of purchasing a Ford and want to make sure the transmission is of good quality. Something of this nature could correlate with billions of dollars in sales and service) and the #1 result is http://hometown.aol.com/MKBradley/index.html . Again, to put this into perspective, the site in question is linked a lowly 11 times directly (2 times by themselves), yet his/her site has become the #1 voice regarding Ford transmissions (a product in millions of cars), again because it seems to have indirectly acquired the "voting power" of the entire members.aol.com site. Is it really a democracy that every page on these megalinked aggregate sites become premiere voices of their topic? Is it valid that this page would be ranked much more favourably if I hosted it on Geocities or aol.com? 

Not only does Google rank pages based upon the gross number of links multiplied by the various weighting factors, and then sorts them based upon the search criteria's appearing in the pages in question, but it also compares the text used in links themselves with the search criteria. For instance the following link, premiere Greater Toronto Area software development and consulting company, gives www.yafla.com some bonus points for anyone looking for any of those anchored words. I have no beef with that, and it actually makes a lot of sense, barring tampering (which is inevitable in any tamperable system). This particular ranking method came to the forefront a few years back in a rather hilarious circumstance.

It's clear by analyzing Google's results that not only do votes accumulate for pages via anchor "democracy", but additionally Google gives a heavy bonus for any page which includes one or more of the search words in the domain name or subdirectory. For instance, writing a page about fixing Ford transmissions would likely get you a far better ranking as http://www.fixingfordtransmissions.com/fix/transmission/fixit.html than it likely would as http://www.bobthemechanic.com/tipsandtricks/tip27.html.

So, anyways, that's a thoroughly amateur and largely obvious analysis of Google's page ranking techniques. Google appears to rank sites not just by the number of anchor tag "votes", multiplied by a site's weighting factor (it does not seem to be a page specific weighting, but rather seems to be site weighting. i.e. An obscure, unlinked, and unvisited page in the wilderness of AOL's members pages appears to be given the weight of the entire site, and conversely garners the votes of the entire site), but additionally by domain name matches.

A Mystery Is Afoot

Of course, then there's the perplexing. A search for "Britney Spears" gives the expected sites with Britney Spears in the URL or with heavily linked Britney Spears content, but then coming in at #9 is a hit for "Shavlik Technologies" (a company which recently earned some fame by having their hotfix checking tool endorsed and distributed by Microsoft ). Clearly something is afoot as the page in question has no information whatsoever about Britney Spears, not even spicy pictures, nor does the URL have anything relating to Britney Spears in it. The first step in determining why Shavlik's website was earning such a high ranking for an unrelated search was to search for any sites which linked to their site, easily facilitated by a quick link search. Among the various sites purportedly linking to Shavlik.com are quite a few that neither the current nor cached versions have any links whatsoever to the network security company, but instead they link to Britney Spears content. One common element, at least at a cursory glance, was that they all linked to a now-defunct "britneyspearsnow.com" website. Conversely, doing a link check for sites linking to www.britneyspearsnow.com and strangely the very first hit is the Shavlik page. Clearly, either intentionally or unintentionally, Google is confused between the two sites, and Shavlik has ended up with an inflated page ranking because of it. I can't comment on what technically is going wrong without knowing how Google is determining links, however I did do some hash checks to determine if it's a very rare case where hash results collide, but none of the variants (MD4, MD5, SHA1,MD160) seemed to give common results for variations of the two URLs with various prefixes and suffixes (I didn't expect that they would given the entropy involved), though my tests were far from exhaustive. To add fuel to the fire that there was intentional manipulation of search results, searching for "Shavlik Britney Spears" brings up a couple of pages that list Britney Spears fan pages, but sitting in the middle is a which-one-of-these-links-doesn't-belong Shavlik link.

How To Promote Your Own Site

 Clearly there is some awareness out there as to how to manipulate the search rankings, and following are a few methods that I think are common:

In no way am I promoting any method that encourages false search rank increases, but the next time you look at a search page ranking, realize that many of them were achieved via these methods.

Why It Matters, and the Future

Page rankings on Google are tremendously important, to the point that one could state a case that they supercede the relevance of the various DNS authorities (indeed, DNS is largely becoming irrelevant): Whether your business is on page 1 or page 30 can be the difference between prosperity or failure. Some studies have indicated that the mean search pages for a given query is approximately 1.8: If your result isn't on the first two page, then the majority of users will never even see that your site exists, much less visit it. Of course not every site can be on the front page, but for a given search phrase there has got to be a better way than simply promoting anyone who hosts on an aggregate site, or pays a search `optimizing' company to cross-link them hundreds or thousands of time.

It's an important point for the future because of a shift in the net: In the early days the net was largely populated with personal pages that could best be described as online bookmark lists: Everyone put a site up basically linking to all of their friends' sites, and among this giant recursive network a couple of neat links could be found. Sites truly could be ranked based upon the "votes" that they received. Very few people actually do that anymore, but instead cross-linking is mostly the domain of search ranking manipulation sites. Among legitimate pages, many actually avoid linking at all as every link represents a loss of a certain percentage of your readers: If I was concerned about whether people would make it this far, I might be concerned that I lost 1.7% of readers who went off to read about praying mantids, or to download the latest hotfix checker, etc. Most sites intentionally avoid linking anywhere outside of their own little world anymore.

Anchor `voting' has largely become the victim of rank manipulations, and has proven itself to be a flawed technique for search rankings. Some other techniques are fledgling, such as the Alexa technique of monitoring user's browsing and formulating a "most popular" listing based upon that (and alternately monitoring similar sites that the user visited in a session for correlations), and apart from the privacy issues it may prove to be a practical approach in the future. Other approaches include representative users voting for pages, and so long as the votes apply to specific topical areas (i.e. "computer stores in the Halton region"), and are not judged against a site that is heavily visited due to its Britney Spears content, then that may be a viable solution. Of course, in such a case vote stuffing and tampering again is very likely.

Anyways, just some meanderings about search engine technology.

Cheers.

Re: Slashdot Posting  2002-08-14 

Well, it looks like this got linked from Slashdot as the referral logs expanded quickly. In any case, browsing through the postings a couple of quick clarifications seem to be in order:


Other Articles By Dennis Forbes