Using numeric literals to identify application domains
I regularly get to look at large quantities of source and need to quickly get some idea of the application domains used within the code; for instance, statistical calculations, atomic physics or astronomical calculations. What tool might I use to quickly provide a list of the application domains used within some large code base?
Within many domains some set of numbers are regularly encountered and for several years I have had the idea of extracting numeric literals from source and comparing them against a database of interesting numbers (broken down by application). I have finally gotten around to writing a program do this extraction and matching, its imaginatively called numbers. The hard part is creating an extensive database of interesting numbers and I’m hoping that that releasing everything under an open source license will encourage people to add their own lists of domain specific interesting numbers.
The initial release is limited to numeric literals containing a decimal point or exponent, i.e., floating-point literals. These tend to be much more domain specific than integer literals and should cut down on the noise, making it easier for me tune the matching algorithms without (currently numeric equality within some fuzz-factor and fuzzy-matching of digits in the mantissa).
Sometimes an algorithm uses a set of numbers (e.g., crc checking) and a match should only occur if all values from this set are encountered.
The larger the interesting number database the larger the probability of matching against a value from an unrelated domain. The list of atomic weights seem to be very susceptible to this problem. I am currently investigating whether the words that co-occur with an instance of a numeric literal can be used to reduce this problem, perhaps by requiring that at least one word from a provided list occur in the source before a match is flagged for some literal.
Some numbers frequently occur in several domains. I am hoping that the word analysis might also be used to help reduce the number of domains that need to be considered (ideally to one).
Another problem is how to handle conversion factors, that is the numeric constant used to convert one unit to another, e.g., meters to furlongs. What is needed is to calculate all unit conversion values from a small set of ‘basic’ units. It would probably be very interesting to find conversions between some rarely seen units occurring in source.
I have been a bit surprised by how many apparently non-interesting floating-point literals occur in source and am hoping that a larger database will turn them into interesting numbers (I suspect this will only occur for a small fraction of these literals).