January 22, 2017 Derek Jones 2 comments

I was at the Full Fact hackathon last Friday (yes, a weekday hackathon; it looked interesting and interesting hackathons have been very thin on the ground in the last six months). Full Fact is an independent fact checking charity; the event was hosted by Facebook.

Full Fact are aiming to check facts in real-time, for instance tweeting information about inaccurate statements made during live political debates on TV. Real-time responses sounds ambitious, but they are willing to go with what is available, e.g., previously checked facts built up from intensive checking activities after programs have been aired.

The existing infrastructure is very basic, it is still early days.

Being a numbers person I volunteered to help out analyzing numbers. Transcriptions of what people say often contains numbers written as words rather than numeric literals, e.g., eleven rather than 11. Converting number words to numeric literals would enable searches to made over a range of values. There is an existing database of checked facts and Solr is the search engine used in-house, this supports numeric range searches over numeric literals.

Converting number words to numeric literals sounds like a common problem and I expected to be able to choose from a range of fancy Python packages (the in-house development language).

Much to my surprise, the best existing code I could find was rudimentary (e.g., no support for fractions or ranking words such as first, second).

spaCy was used to tokenize sentences and decide whether a token was numeric and text2num converted the token to a numeric literal (nltk has not kept up with advances in nlp).

I quickly encountered a bug in spaCy, which failed to categorize eighteen as a number word; an update was available on github a few hours after I reported the problem+fix :-). The fact that such an obvious problem had not been reported before suggests that few people are using this functionality.

Jenna, the other team member writing code, used beautifulsoup to extract sentences from the test data (formatted in XML).

Number words do not always have clear cut values, e.g., several thousand, thousands, high percentage and character sequences that could be dates. Then there are fraction words (e.g., half, quarter) and ranking words (e.g., first, second), all everyday uses that will need to be handled. It is also important to be able to distinguishing between dates, percentages and ‘raw’ numbers.

The UK is not the only country with independent fact checking organizations. A member of the Chequeado, in Argentina, was at the hack. Obviously number words will have to handle the conventions of other languages.

Full Fact are looking to run more hackathons in the UK. Keep your eyes open for Hackathon announcements. In the meantime, if you know of a good python library for handling word to number conversion, please let me know.

Categories: Uncategorized Tags: numbers, words

Searching for inaccurate literals in R

May 30, 2011 Derek Jones No comments

In creating the numbers tool I wanted to be able to do two things, 1) obtain information about what source did by matching the numeric literals it contained against a database of ‘interesting’ values (now with over 14,000 entries) and 2) flag possible incorrect numeric literals (e.g., 3.1459265 when 3.14159265 had been intended in core/Helix.cpp of the MIFit source {now fixed}).

I have recently been enhancing ‘incorrect numeric literal’ support and using the latest release of R as a test bed (whose floating-point literals are almost identical to the last release I looked at, R-2.11.1, log file here).

The first fault I found (0.20403... instead of 0.020403...) looked very serious until I realised it was involved in calculating an initial value feed into an iterative algorithm (at worst causing an extra iteration or so). It looks like the developer overlooked the “e-1” that appears in the original (click on ‘Page 48’).

The second possible problem turned out to be an ambiguity in the file main/color.c which contains the comment “CIE-XYZ to sRGB” above three expressions that perform a conversion from CIE-XYZ to BT.709 RGB. Did the developer get the comment or the numeric literals wrong? People are known to confuse the two forms of RGB (for an explanation see Annex B) .

Apart from a few minor errors such as 0.950301 instead of 0.9503041 (in …/grDevices/R/postscript.R) nothing else of interest turned up so I shifted attention to the add-on packages available on the Comprehensive R Archive Network.

The 3,000+ packages occupy almost 2 Gig in compressed form (fortunately numbers can operate directly on compressed archives and the files did not need to be unpacked) and I decided to limit the analysis to just the R source files, which cut the number of floating-point literals down to around 2 million (after ignoring the contents of comments, 10M compressed log file here).

The various floating-point literals having a value close to 2.30258509299404568402 (the most common match; no idea why the value ln(10) or 1/log(e) should be so popular) highlight the various issues that crop up when using approximate matching to look for faults. The following are some of these matches (first number is total occurrences, second sequence is the literal appearing in the source with dot denoting the same digit as in the number matched against):

  92 ........              2.30258509299404568402  ln(10) or 1/log(e)
   5 ...............5      2.30258509299404568402  ln(10) or 1/log(e)
   1 .....80528052805      2.30258509299404568402  ln(10) or 1/log(e)
   3 .....6                2.30258509299404568402  ln(10) or 1/log(e)
   2 .....67               2.30258509299404568402  ln(10) or 1/log(e)
   1 .....38               2.30258509299404568402  ln(10) or 1/log(e)
   2 .....8                2.30258509299404568402  ln(10) or 1/log(e)
   1 .....42               2.30258509299404568402  ln(10) or 1/log(e)
   2 ......7               2.30258509299404568402  ln(10) or 1/log(e)
   2 ......2               2.30258509299404568402  ln(10) or 1/log(e)
   1 .......               2.30258509299404568402  ln(10) or 1/log(e)
   2 .....6553             2.30258509299404568402  ln(10) or 1/log(e)
   1 .......4566           2.30258509299404568402  ln(10) or 1/log(e)

Most of those 92 seven digit matches occur in a subdirectory called data implying that they do not occur within code expressions, while .....80528052805 contains enough extra trailing non-matching digits to suggest a different value really was intended. Are there enough unmatched trailing digits in .....6553 to consider it a different value? More experience needs to be gained before attempting to make this call automatically.

At the moment a person has to look at the code containing these ‘close’ values to decide whether the author made a mistake or really did mean to use the value given (unfortunately numbers does not yet have a fancy gui to simplify this task). Sometimes the literals appear in data and other times in an expression that requires domain knowledge to figure out whether it is correct or not. My cursory sampling of the very large data set did not find any serious problems.

Some of the unmatched literals contain so few significant digits they would match many entries in a database of ‘interesting’ values. For instance the numbers database used to contain 745.0, the mean radius of the minor planet Sedna (according to the latest NASA data), but it was removed because of the large number of false positive matches it generated.

Many of the unmatched literals appear to do not appear to have any special interest outside of code that contains them, for instance 0.2.

I am hoping that readers of this blog will download numbers and run their code through it. They might find some faults in their code and add new values to their local ‘interesting’ numbers database to target their own application domain(not forgetting to email me a copy to include in the next release). Suggestions for improving the detection of inaccurate literals always welcome (check to the TODO file first).

An interesting observation from comparing the mathematical equations in the book Computation of Special Functions with the Fortran source provided by its authors is that when a ‘known’ constant (e.g., pi, pi/2) appears in isolation (e.g., as an argument or a value in an assignment) its literal representation often contains as many digits as supported in 64-bits, while when the same constant appears within an expression evaluating a polynomial it often contains the same number of digits as the other literals appearing in that expression (which is usually less than supported in 64-bits).

Categories: Uncategorized Tags: equation, faults, floating-point, fuzzy match, literals, numbers, R, static analysis, tools

Using numeric literals to identify application domains

February 28, 2010 Derek Jones 1 comment

I regularly get to look at large quantities of source and need to quickly get some idea of the application domains used within the code; for instance, statistical calculations, atomic physics or astronomical calculations. What tool might I use to quickly provide a list of the application domains used within some large code base?

Within many domains some set of numbers are regularly encountered and for several years I have had the idea of extracting numeric literals from source and comparing them against a database of interesting numbers (broken down by application). I have finally gotten around to writing a program do this extraction and matching, its imaginatively called numbers. The hard part is creating an extensive database of interesting numbers and I’m hoping that that releasing everything under an open source license will encourage people to add their own lists of domain specific interesting numbers.

The initial release is limited to numeric literals containing a decimal point or exponent, i.e., floating-point literals. These tend to be much more domain specific than integer literals and should cut down on the noise, making it easier for me tune the matching algorithms without (currently numeric equality within some fuzz-factor and fuzzy-matching of digits in the mantissa).

Sometimes an algorithm uses a set of numbers (e.g., crc checking) and a match should only occur if all values from this set are encountered.

The larger the interesting number database the larger the probability of matching against a value from an unrelated domain. The list of atomic weights seem to be very susceptible to this problem. I am currently investigating whether the words that co-occur with an instance of a numeric literal can be used to reduce this problem, perhaps by requiring that at least one word from a provided list occur in the source before a match is flagged for some literal.

Some numbers frequently occur in several domains. I am hoping that the word analysis might also be used to help reduce the number of domains that need to be considered (ideally to one).

Another problem is how to handle conversion factors, that is the numeric constant used to convert one unit to another, e.g., meters to furlongs. What is needed is to calculate all unit conversion values from a small set of ‘basic’ units. It would probably be very interesting to find conversions between some rarely seen units occurring in source.

I have been a bit surprised by how many apparently non-interesting floating-point literals occur in source and am hoping that a larger database will turn them into interesting numbers (I suspect this will only occur for a small fraction of these literals).

Categories: Uncategorized Tags: floating-point, numbers

The Shape of Code

Archive

Full Fact checking of number words

Searching for inaccurate literals in R

Using numeric literals to identify application domains

Recent Posts

Recent Comments

Archives

Meta