Home > data analysis > Benford’s law and numeric literals in source code

## Benford’s law and numeric literals in source code

December 13th, 2008

Benford’s law applies to values derived from a surprising number number of natural and man-made processes. I was very optimistic that it would also apply to numeric literals in source code. Measurements of C source showed that I was wrong (the chi-square fit was 1,680 for decimal integer literals and 132,398 for floating literals).

Probability that the leading digit of an (decimal or hexadecimal) integer literal has a particular value (dotted lines predicted by Benford’s law).

What are the conditions necessary for a sample of values to follow Benford’s law? A number of circumstances have been found to result in sample values having a leading digit that follows Benford’s law, including:

• Selecting random samples from different sets of values where each set has a different probability distribution (i.e, select the distributions at random and then collect a sample of values from each of these distributions)
• If the sample values are derived from a process that is scale invariant.
• If the sample values are derived from a process that involves multiplying independent values having a uniform distribution.
• Samples that have been found to follow Benford’s law include lists of physical constants and accounting data (so much so that it has been used to detect accounting fraud). However, the number of data-sets containing values whose leading digit follows Benford’s law is not a great as some would make us believe.

Why don’t the leading digits of numeric literals in source code follow Benford’s law?

• Perhaps small values are over represented because they are used as offsets to access the storage either side of some pointer (in C/C++/Java/(not Pascal/Fortran) the availability of the `++`/`--` operators reduces the number of instances of `1` to increment/decrement a value). But this only applies to integer types, not floating types
• Probability that the leading, first non-zero, digit of a floating literal has a particular value (dashed line predicted by Benford’s law).

• Perhaps there exists a high degree of correlation between the value of literals. I’m not yet sure how to look for this.
• Why is there a huge spike at `5` for the floating-point literals? Have values been rounded to produce `0.5`? This looks like an area where methods used for accounting fraud detection might be applied (not that any fraud is implied, just irregularity).
• Why is the distribution of the leading digit fairly uniform for hexadecimal literals?
• These surprising measurements show that there is a lot to the shape of numeric literals that is yet to be discovered.