|
While Benford's law (a.k.a. the first-digit law) is old hat for those in the mathematics posse, and has long been demystified, it's seeing increasingly frequent references in the online world: From blog subscriber counts, to advice for tax cheats ("make sure to distribute those numbers appropriately!"), to claims that it's a magic technique for detecting real or fake sequences of dice rolls (dubious) -- it's being portrayed as an infalliable method of numerical omniscience, applicable anywhere that sets of numbers can be found. There's a lot of truth out there, but there's also a lot of mistruth. So after seeing yet another incorrect application of Benford's law (where again it was presumed to magically apply to all number sets), I thought it worth throwing a quick entry together, adding in a little scripting goodness to demonstrate the point (the scripted section may not work in some aggregators and readers). This doesn't really relate to the normal subject matter of this blog, but hopefully it's interesting to people regardless. I should add the warning that I am not a mathematician, and my interest in this subject came only as a passing interest several years back. It was then that I caught a television program featuring a pundit describing a technique he was advocating to catch fraudulent tax returns. By analyzing the distribution of leading digits on tax return rows, he claimed, they could accurately predict where numbers were artifically generated, and conversely where they were real. The argument that he was proposing, and the cursory information I then found about this law, struck me as remarkably unintuitive (at the time, though now it seems embarrassingly obvious), so I spent a little time thinking about how this sort of numeric distribution comes about. ![]() What I learned then was that the "law" predicts that approximately 1/3 of numbers in certain sets of data -- in particular those with a logarithmic distribution (this will be discussed later) -- begin with the number 1, with decreasing frequency for each remaining digit (e.g. numbers beginning with a "9" occur in only 4.6% of numeric sets conforming with the law. Of course this is all in regards to base-10 numbers). Purportedly the first known inklings of the law were described when Simon Newcomb, a Nova Scotian astronomer, noticed that certain pages of a logarithm book had far more wear than other pages, indicating that certain values appeared with more prevalence. The reason for the unevent lookup wear became evident on further analysis: If one were to accumulate a vast reservoir of data on the populations of cities, the prices of menu items, and so on, the eerie presence of Benford's law would become evident, seemingly against common wisdom. Where one would expect numbers to cover the spectrum, instead the leading digit distribution predictions held true. The following is a demonstration of Benford's Law materialized, with zero magic or alien intervention. Simply choose the settings (the defaults should be fine) and then click on "Initialize Random Set". This will give you a set of randomly distributed numbers between 0 and the max random number chosen. The table will display the prevalence of leading digits. Thus far the numbers should be randomly distributed, risking a ticket from a Benford's law enforcement officer. Of course random or linearly distributed numbers aren't expected to conform to Benford's law, so that's entirely expected. Now click on the "Inflation / Deflation!" button, which will randomly scale each value in the set to anywhere from 25% to 225% of its original value on each press. Almost immediately the distribution will start to mirror Benford's Law. At most you might require two or three iterations until it accurately comforms. Try it with a random starting max of 5 (thereby making the initial set only possibly contain the starting digits from 1-5) and then start scaling. Does Benford's Law appear? Benford's Law DemonstrationNumber of Random Values: Random Max:
The explanation is simple and obvious once described: To go from 1 to 2, a number has to appreciate by 100%, whereas to go from 2 to 3 it would only have to appreciate by 50%. To go from 3 to 4 requires only a 33% increase. This might seem irrelevant, as from a purely additive sense each increase is the same +1 linear increase, however in a logarithmic distribution (e.g. funding that increases or decreases 15% a year), increases and decreases are proportionate with the underlying value. The same friction-of-appreciation also holds true going from 10 to 20, or 100 to 200, or 100000 to 200000, each representing a much more significant proportionate increase than the following 20 to 30 or 200 to 300 or 200000 to 300000. For this particular sample, this materializes as random proportionate deltas have a higher probability of "skipping" the higher leading digits, while sticking to the lower leading digits. If an existing value is 50, for instance, and it's going to randomly increase anywhere from 0 to 200%, that yields a 50% probability that the resulting value will have a leading digit of 1. Think of the population increase or decrease of a city -- it generally scales with the city. A large city might grow or shrink by 50,000 people year over year, albeit representing only a small percentage of the total population, while a small city might increase by 500 people. Yet as a percentage of population change they might be the same. Similarly, an item at $10.00 will have to see a lot of inflation until it costs $20, but then it's a short ride to $30, and an even shorter hop to $40 -- proportionately speaking, of course. And units of measure don't actually matter. After Benford's Law has appeared in the set, click on the Multiply Set button - this will multiply every set member by 3.75X (a completely arbitrary value)...yet the pattern remains. Hopefully this has delivered a bit of food for thought about the applicability (or inapplicability) of Benford's law. It generally only fits larger sets of logarithmically distributed values, although that happens to be what many of the values in society, and in nature, are. |
(C) Dennis Forbes 2007