Empirical analysis of UK legislation started this weekend

November 23rd, 2014 No comments

Yesterday I was one of a dozen or so people at a hackathon hosted by the Ministry of Justice. I’m sure that the organizers would have liked a lot more people attending, but the McDonald’s hackathon across town was a bigger draw for most of the hackers I know.

The dataset was all UK legislation back to 1988, and a less complete set going back to 1267, in html and various flavors of xml; in all 20G of compressed files, with the compressed html files occupying 7G on my machine. As of this weekend the files are available online.

There were about half a dozen domain experts, in various aspects of the creation of UK legislation, present and they suggested lots of interesting questions that we (i.e., the attendees who could code) might like to try to answer.

I was surprised at the simplicity of some questions, e.g., volume (e.g., word count) of legislation over time and branch of government. The reason these question had not been answered before is that the data had not been available; empirical analysis of UK legislation started this weekend.

The most interesting question I heard involved the relative volume of primary and secondary legislation. Primary legislation is written by Parliament and in some cases creates a framework that is used by secondary legislation which contains the actual details. A lot of this secondary legislation is written by the Executive branch of government (by civil servants in the appropriate branch of government) and may not involve any parliamentary oversight. Comparing the relative volume of primary/secondary legislation over time would show if the Executive branch was gaining more powers to write laws, at the expense of Parliament.

With all the interesting discussions going on, setting ourselves up and copying the data (from memory sticks, not the internet), coding did not really start until 11, and we had to have our projects ‘handed-in’ by 16:00, not enough time to be sure of getting even an approximate answer to the primary/secondary legislation question. So I plumped for solving a simpler problem that I was confident of completing.

Certain phrases are known to be frequently used in legislation, e.g., “for the purposes of”. What phrases are actually very common in practice? The domain experts were interested in phrases by branch of government and other subsets; I decided to keep it simple by processing every file and giving them the data.

Counting sequences of n words is a very well studied problem and it was straightforward to locate a program to do it for me. I used the Lynx web browser to strip out the html (the importance of making raw text available for this kind of analysis work was recognized and this will be available at some future date). I decided that counting all four word sequences ought to be doable on my laptop and did manage to get it all to work in the available time. Code and list of 4-grams over the whole corpus available on Github.

As always, as soon as they saw the results, the domain experts immediately starting asking new questions.

Regular readers of this blog will know of my long standing involvement in the structure and interpretation of programming language standards. It was interesting to hear those involved in the writing/interpretation of legislation having exactly the same issues, and coming up with very similar solutions, as those of us in the language standards world. I was surprised to hear that UK legislation has switched from using “shall” to using “must” to express requirements (one of the hacks plotted the use of shall/must over time and there has been an abrupt change of usage). One of the reasons given was that “must” is more modern; no idea how word modernness was measured. In the ISO standards’ world “shall” is mandated over “must”. Everybody was very busy in the short amount of time available, so I had to leave an insiders chat about shall/must to another time.

The availability of such a large amount of structured English documents having great import should result in some interesting findings and tools being produced.

Cobol 2014, perhaps the definitive final version of the language…

November 6th, 2014 No comments

Look what arrived in the post this morning, a complementary copy of the new Cobol Standard (the CD on top of a paper copy of the 1985 standard).

Photo of Cobol 2014 and 1985 Standards

In the good old days, before the Internet, members of IST/5 received a complementary copy of every new language standard in comforting dead tree form (a standard does not feel like a standard until it is weighed in the hand; pdfs are so lightweight); these days we get complementary access to pdfs. I suspect that this is not a change of policy at British Standards, but more likely an excessive print run that they need to dispose of to free up some shelf space. But it was nice of them to think of us workers rather than binning the CDs (my only contribution to Cobol 2014 was to agree with whatever the convener of the committee proposed with regard to Cobol).

So what does the new 955 page standard have to say for itself?

“COBOL began as a business programming language, but its present use has spread well beyond that to a general purpose programming language. Significant enhancements in this International Standard include:

— Dynamic-capacity tables

— Dynamic-length elementary items

— Enhanced locale support in functions

— Function pointers

— Increased size limit on alphanumeric, boolean, and national literals

— Parametric polymorphism (also known as method overloading)

— Structured constants

— Support for industry-standard arithmetic rules

— Support for industry-standard date and time formats

— Support for industry-standard floating-point formats

— Support for multiple rounding options”

I guess those working with Cobol will find these useful, but I don’t see them being enough to attract new users from other languages.

I have heard tentative suggestions of the next revision appearing in the 2020′s, but with membership of the Cobol committee dying out (literally in some cases and through retirement in others) perhaps this 2014 publication is the definitive final version of Cobol.

Tags: , ,

The POPL 2015 papers involving C

November 4th, 2014 No comments

SIGPLAN (the ACM Special Interest Group on Programming LANguages) has just made available many of the papers that have been accepted for their 2015 POPL conference (Principles of Programming Languages). Good for them. I wish more conferences would do this.

There are three papers involving C, so obviously I have read those first. Two papers are heavy on the mathematics and one not quiet so heavy:

  • Sound Modular Verification of C Code Executing in an Unverified Context: Describes a tool that takes C source annotated with separation logic and translates it to C source containing runtime checks; it is intended that these runtime checks verify the conditions expressed in the separation logic. Why does the developer add the interface checks in separation logic and then translate, rather than adding them in C in the first place? This was question was not addressed
  • Common compiler optimisations are invalid in the C11 memory model and what we can do about it: This sounds like bad news, but the introduction mentions specialist optimizations that are common in that specialist area. There follows 11 pages of mathematics + another five pages in an appendix. Page 12 tells us what it is all about. Some requirements in C11 would be muck up the nice mathematics should CompCert, which currently supports C90, be upgraded to C11. In other words, a compiler implementor is complaining that wording in the standard is making their life difficult (hey, join the queue) and has published a paper about it.
  • Formal verification of a C static analyzer: An interesting piece of work spoiled by claims that a soap powder manufacturer would not be able to get away with. Verasco, the static analysis tool described, does its checking on an intermediate language that is two-steps removed from the original C source. Using the authors’ logic I could bolt on one of the existing Fortran-to-C translators and claim to have a formally-verified Fortran static analyzer, with C being just an intermediate language in the chain. The problem with analyzing an intermediate language is that the transformations that have occurred along the way have changed the semantics of the original code, so the results of any analysis could be different than if applied to the original source. An example from the paper, the code:
    z = f(x) + 2 * g(y)

    is transformed to:

    t1 = f(x); t2 = g(y); z = t1 + 2 * t2;

    The implementation thus selects one of the two possible evaluation orders for the functions f and g. It is possible that calling f before g will result in behavior that is different from calling g before f (no undefined behavior occurs because there is a sequence point before a function returns, using pre-C11 terminology).

    So Verasco is only checking one of the two possible execution paths in this code. Not a particularly sound proof.

    C-semantics is the C formal methods tool that stands head and shoulders above anything else that currently exists (a fun Fibonacci example). It is actually based on the C source and does significantly more checking than verasco, but is not mentioned in the “Related work” section of the paper.

Some of the other POPL papers look a lot more interesting and potentially useful.

Workshop on App Store Analysis

October 29th, 2014 No comments

I was at the 36th CREST Open Workshop, on App Store Analysis, at the start of this week. The attendee list reads like a who’s who of academics researching App stores. What really stood out for me was the disconnect between my view of the software engineering aspects of developing mobile Apps and the view of many, but not all, academics in the room.

Divergent points of view on App development being different because… included:

Academics: they are written by a small number (often one) of developers.
Me: This was true in the early days of microprocessors and the web. When something new comes out only a small number of people are involved in it and few companies are willing to invest in setting up large development teams. If the new thing succeeds (i.e., there is money to be made) the money to create large teams will follow.

Academics: third party libraries make a significant contribution to functionality.
Me: This is true of a lot of web software and it is becoming more common for Apps on all platforms. It was not true in the past because the libraries were not available; Open Source changed all that.

Academics: they are not structured/written according to software engineering principles (someone in the room thought that waterfall was still widely used).
Me: This is true of most software produced by individuals who are writing something out of interest in their spare time or because they are not gainfully employed in ‘real’ work. When microcomputers were new the internal quality of most software on the market was truly appalling; it was primarily written by people who knew a market niche very well and taught themselves programming, the software sold because it addressed the needs to its customers and code quality was irrelevant (of course the successful products eventually needed to be maintained, which in when code quality became important, but they now had money to employ developers who knew about that kind of stuff).

Academics: the rapid rate of change (in tools and libraries etc) being experienced will continue into the foreseeable future.
Me: I was staggered that anyone could think this.

Academics: lots of money to be made for minimal investment:
Me: Those days are past.

Me: power drain issues (may) be a significant design issues.
Academics: Blank look.

Other things to report:

Various concerns raised by people who had encountered the viewpoint that mobile Apps were not considered worthy of serious academic study within software engineering; this point of view seemed to be changing. I don’t recall there every having been academic research groups targeting microcomputer software, but this certainly happened for web development.

I was a bit surprised at the rather rudimentary statistical techniques that were being used. But somebody is working on a book to change this.

Predicting stuff involving the next hour of my life

October 20th, 2014 No comments

Rain-on-me is an idea for an App that I have had for a while and have been trying to get people interested in it at Hackathons I attend. At the Techcrunch hackathon last weekend my pitch convinced Rob Finean, who I worked with at the Climate change hack, and we ended up winning in the Intel Mashery category (we used the wunderground API to get our realtime data).

The Rain-on-me idea is to use realtime rain data to predict how much rain will occur at my current location over the next hour or so (we divided the hour up into five minute intervals). This country, and others, has weather enthusiasts who operate their own weather stations and the data from these stations has been aggregated by the Weather Underground and made available on the Internet. Real-time data from local weather stations upwind of me could be used to predict what rain I am going to experience in the near future.

Anybody who has looked at weather station data, amateur or otherwise, knows that measured wind direction/speed can be surprisingly variable and that sometimes sensor stop reporting. But this is a hack, so lets be optimistic; station reporting intervals seem to be around 30 minutes, with some reporting every 15 mins and others once an hour, which is theory is good enough for our needs.

What really caught peoples’ attention was the simplicity of the user interface (try it and/or download code):

Rain prediction for the next hour

Being techies we were working on a design that showed quantity of rain and probability of occurring (this was early on and I had grand plans for modeling data from multiple stations). Rob had a circular plot design and Manoj (team member on previous hacks who has been bitten by the Raspberry pi bug) suggested designing it to run on a smart watch; my only contribution to the design was the use of five minute intervals.

The simplicity of the data presentation allows viewers to rapidly obtain a general idea of the rain situation in their location over the next hour (the hour is measured from the location of the minute hand; the shades of blue denote some combination of quantity of rain and probability of occurring).

This is the first App I’ve seen that actually makes sense on a smart watch. In fact if the watches communicated rain status at their current location then general accuracy over the next hour could become remarkably good.

Rainfall is only one of the things in my life that I would like predicted for the next hour. I want British rail to send me the predicted arrival time of the train I am on my way to catch (I may not need to rush so much if it is a few minutes late), when is the best time, in the next hour, to turn up at my barber for a hair cut (I want minimum waiting time after I arrive), average number of bikes for hire at my local docking station (should I leave now or is it safe to stay where I am a bit longer), etc.

Predicting events in the next hour of people’s lives is the future of Apps!

The existing rain-on-me implementation is very primitive; it uses the one weather station having the shortest perpendicular distance from the line going through the current location coming from the current wind direction (actually the App uses an hour of Saturday’s data since it was not raining on the Sunday lunchtime when we presented). There is plenty of room for improving the prediction reliability.

Other UK weather data sources include the UK Metoffice which supplies rainfall radar and rainfall predictions at hourly intervals for the next five days (presumably driven from the fancy whole Earth weather modeling they do); they also have an API for accessing hourly data from the 150 sites they operate.

The Weather Underground API is not particularly usable for this kind of problem. The call to obtain a list of stations close to a given latitude/longitude gives the distance (in miles and kilometers, isn’t there a formula to convert one to the other) of those station from what looks like the closest large town, so a separate call is needed for each station id to get their actual location!!! Rather late in the day I found out that the UK Metoffice has hidden away (or at least not obviously linked to) the Weather Observations Website which appears to be making available data from amateur weather stations.

Running Average Power Limit: a new target for viruses

October 10th, 2014 No comments

I have been learning about the Running Average Power Limit, RAPL, feature that Intel introduced with their Sandy Bridge architecture. RAPL is part of a broader framework providing access to all kinds of interesting internal processor state (e.g., detailed instruction counts, cache accesses, branch information etc; use PAPI to get at the numbers on your system, existing perf users need at least version 3.14 of the Linux kernel).

My interest in RAPL is in using it to monitor the power consumed by short code sequences that are alternative implementations of some functionality. There are some issues that need to be sorted out before this is possible at the level of granularity I want to look at.

I suspect that RAPL might soon move from a very obscure feature to something that is very widely known and talked about; it provides a means for setting an upper limit on the average power consumed by a processor, under software control.

Some environmental activists are very militant and RAPL sounds like it can help save the planet by limiting the average power consumed by computers. Operating systems do provide various power saving options, but I wonder how widely they are used aggressively; one set of building based measurements shows a fairly constant rate of power consumption, smaller set showing a bit of daily variation.

How long will it be before a virus targeting RAPL appears?

Limiting the average power consumed by a processor is likely to result in programs running more slowly. Will the average user notice? Slower browser response could be caused by all sorts of things. Users are much more likely to notice a performance problem when watching a high definition video.

For service providers RAPL is another target to add to the list of possible denial-of-service attacks.

A book about some important bits of R

September 27th, 2014 No comments

I see that Hadley Wickham’s new book, “Advanced R”, is being published in dead tree form and will be available a month or so. Hadley has generously made the material available online; I quickly skimmed the material a few months ago when I first heard about it and had another skim this afternoon.

The main problem with the book is its title, authors are not supposed to write advanced books and then call them advanced. When I studied physics the books all had “advanced” in their titles, but when I got to University the books switched to having some variant of “fundamental” in their title. A similar pattern applies to computer books, with the books aimed at people who know a bit and want to learn a bit more having an advanced-like word in their title and the true advanced stuff having more downbeat titles, e.g., Javascript: The Good Parts, “Algorithms in Snobol 4″, Algorithms + Data Structures = Programs.

Some alternative title suggestions: “R: Some important bits”, “The Anatomy of R” or “The nitty gritty of R”.

The book is full of useful technical details that are scattered about and time consuming to find elsewhere; a useful reference manual, covering how to do technical stuff in R, to have on the shelf.

My main quibble with the book is the amount of airplay that the term “functional programming” gets. Does anybody really care that R has a strong functional flavor? Would many R users recognize another functional language if it jumped up and bit them? The die hard functional folk would probably say that R is not really a functional language, but who cares. I think people who write about R should stop using the words “functional programming”, it just confuses R users and serves no useful purpose; just talk about the convenient things that R allows us to write.

A book that I would really like to read is the R equivalent of books such as “Algorithms in Snobol 4″, “Effective C++” and “SQL for Smarties” (ok, that one has advanced in the subtitle), that take a wide selection of relatively simple problems and solve them in ways that highlight different aspects of the language (perhaps providing multiple solutions to the same problem).

Tags: ,

Creating a map showing land covered by rising sea levels

September 15th, 2014 1 comment

I joined the Geekli.st climate Hackathon this weekend at the Hub Westminster (my favorite venue for Hackathons). While the organizers had lots of enthusiasm they had very little in the way of data for us to work on. No problem, ever since the Flood-relief hackathon I have wanted to use the SRTM ‘whole Earth’ elevation data on a flood related hack. Since this was a climate change related hack the obvious thing to do was to use the data to map the impact of any increases in sea level (try it, with wording for stronger believers).

The hacking officially started Friday evening at 19:00, but I only attended the evening event to meet people and form a team. Rob Finean was interested in the idea of mapping the effects of sea a rise in level (he also had previous experience using leaflet, a JavaScript library for interactive maps) and we formed a team, Florian Rathgeber joined us on Saturday morning.

I downloaded all the data for Eurasia (5.6G) when I got home Friday night and arriving back at the hackthon on Saturday morning started by writing a C program to convert the 5,876 files, each 1-degree by 1-degree squares on the surface of the Earth, to csv files.

The next step was to fit a mesh to the data and then locate constant altitude contours, at 0.5m and 1.5m above current sea level. Fitting a 2-D mesh to the data was easy (I wanted to use least squares rather than splines so that errors in the measurements could be taken into account), as was plotting and drawing contours, but getting the actual values for the contour lat/long proved to be elusive. I got bogged down looking at Python code, Florian knew a lot more Python than me and started looking for a Python solution while I investigated what R had to offer. Given the volume of data a Python solution looked like the best fit for the work-flow.

By late afternoon no real progress had been made and things were not looking good. Google searches on the obvious keywords returned lots of links to contour plotting libraries and papers claiming to have found a better contour evaluation algorithm, but no standalone libraries. I was reduced to downloading the source code of R to search for the code it used to calculate contours, with a view to extracting the code for my own use.

Rob wanted us to produce kml (Keyhole Markup Language) that his front end could read to render an overlay on a map.

At almost the same time Florian found that GDAL (Geospatial Data Abstraction Library) could convert hgt files (the raw SRTM file format) to kml and I discovered the R contourLines function. Florian had worked with GDAl before but having just completed his PhD had to leave to finish a paper he was working on, leaving us with instruction on the required options.

The kml output by GDAL was great for displaying contours, but did not fill in the enclosed area. The output I was generating using R filled the area enclosed by the contours but contained lots of noise because independent contours were treated having a connection to each other. I knew a script could be written to produce the desired output from the raw data, but did not know if GDAL had options to do what we wanted.

Its all very well being able to write a script to produce the desired output, but what is the desired output? Rob was able to figure out how the contour fill data had to be formatted in the kml file and I generated this using R, awk, sed, shell scripts and around six hours of cpu time on my laptop.

Rob’s front end uses leaflet with mapping data from Openstreetmap and the kml files to create a fantastic looking user-configurable map showing the effect of 0.5m and 1.5 rises in sea level.

The sea level data on the displayed map only shows the south of England and some of the north coast of Europe because loading any more results in poor performance (it is all loaded statically). Support is needed for dynamically loading of data on an as required basis. All of the kml files for Eurasia with 1.5 sea level rise are up on Github (900M+ of data). At the moment the contour data is only at the most detailed level of resolution and less detailed resolution is needed for when users zoom out. R’s contourLines function has no arguments for changing the resolution of which it returns; if you know of a better contour library please let me know.

The maps show average sea level. When tides are taken into account the sea level at certain times of the day may be a lot higher in some areas. We did not have access to tide data and would not have had time to make use of it anyway, so the effects of tide on sea level are not included.

Some of the speckling in the overlays may be noise caused by the error bounds of the SRTM data (around 6m for Eurasia; an accuracy level that makes our expectation of a difference between 0.5m and 1.5m contours problematic).

Is Early parsing now practical?

September 10th, 2014 1 comment

Language parsing was once a hot topic within computing research. The discovery of LALR parsing, quickly followed by yacc becoming available on Unix, resulted in this approach to language parsing dominating developer mind-share (helped by the first half of most compiler books being devoted to the theory of LR parsing). Until maybe 10 years ago the received wisdom was to implement parsers using Bison (the GNU successor to yacc); this process automatically creates arrays of values that are read by a parser to decide how to process the tokens fed to it by a lexer. The accepted wisdom has now shifted to creating hand written recursive decent parsers (or some variant), where the developer writes code that decides what to do next based on the current token(s); developers are back doing things the way they were done before yacc was written in 1970.

Is this change of implementation choice driven by fashion (despite heroic efforts nobody has been able to produce an industrial strength LALR based parser for C++; all C++ compilers that I am aware of use recursive descent and, sad to say, C++ is a trend setter), existing languages outgrowing existing parsing technology or just developers forgetting what a maintenance nightmare recursive descent can be?

I’m a fan of using tools and the big advantage parser generators have over hand written parsers is that they warn about ambiguities in the syntax, i.e., potential faults in the specification or implementation. Hand written recursive decent is just code that does what is written.

The big disadvantage of LALR parsing are restrictions on the form of the grammars that are accepted (in practice the tools usually complain that an ambiguity cannot be resolved and make use of some default behavior to handle it). Transforming a grammar into a form acceptable to tools, such as Bison, without too many warnings being generated, can take a lot of work by an experienced compiler developer. I once spent a month creating a workable LALR grammar for all of SQL-92 and could have written a recursive decent parser in less time (grammar transformations are a potential source of faults as much as hand written parsers are).

Introductions to parsing sometimes mention how much easier life would be using Early parsing, if only its performance was not so appalling. It turns out that a linear algorithm for Early parsing was published in 1991, followed by various useful refinements in 2001 (all discussed in what is effectively the encyclopedia of parsing sitting on my shelf waiting to be read). Theory will sit on the shelf until somebody implements it and a few days ago I found out about Marpa, a linear time Early parser.

So why does Early parsing make life so much easier, at least for those implementing parsers, than LALR parsing? Early parsing has far fewer restrictions on the form of the grammars it accepts. This means no more spending a month transforming a grammar into something acceptable to the tool being used (at least in theory, I have not tried any large grammars yet; somebody has written one for C).

Another benefit from using an Early parser is the potential for improved syntax error recovery, the drive to reduce the size of the arrays generated by yacc/Bison resulted in information essential for good error recovery being thrown away (the original LALR theory threw some useful information away and over the years several PhDs were awarded to researchers who figured out how to throw even more away). When things go wrong Early parsers have lots of useful information to them.

To check out the hype I’m jumping in at the deep end with the grammar for C++14, can I really cut-and-paste the grammar from the appendix, add in some Marpa syntax and start parsing C++? I will let you know whether I sink or swim.

undefined behavior: pay up or shut up

August 31st, 2014 2 comments

Academia recently discovered undefined behavior in C, twenty five years after industry tool vendors first started trying to help developers catch the problems it causes. Some of the tools that are now being written are doing stuff that we could only dream about back in the day.

The forces that morph occurrences of undefined behavior in source code to unwanted behavior during program execution have changed over the years.

  • When developers paid for their compilers there was an incentive for compiler writers to try to be nice to developers by doing the right thing for undefined behaviors. Twenty five years ago there were lots of commercial compilers all having slightly different views about what the right thing might be; a lot of code was regularly ported to different compilers and got to encounter different compiler writer’s views.
  • These days there is widespread use of open source compilers, which developers don’t pay for, removing the incentive for compilers writers to be nice to developers. Paying customers want support for new processors, enhancements to existing generated code quality and the sexy topic for PhDs is code optimization; what better climate for treating source containing undefined behavior as road kill. Now developers only need to upgrade to a later release of the compiler they are using to encounter an unexpected handling of undefined behavior.

A recent blog post, authored by some of the academics alluded to above, proposes adding a new option to gcc: -std=friendly-c. If developers feel that this kind of option needs to be supported then they should contribute to a crowdfunding campaign (none exists at the time of writing) to raise, say, $500,000 towards supporting the creation and ongoing support for the functionality behind this option. Of course one developer’s friendly is another developer’s unfriendly, so we could end up with multiple funds each promoting an option that supports a view of the world that is specific to one target environment.

At the moment, in response to user complaints, Open source compiler vendors lamely point out that the C standard permits them to handle source containing undefined behaviors the way they do; they stop short of telling people to quit complaining and that they are getting the compiler for free.

If this undefined behavior issue starts to gain substantial publicity, but insufficient funding, open source compiler vendors will need to start putting a positive spin on the decisions they make. Not being in marketing I might have a problem keeping a straight face when giving the following positive messages:

  • We are helping to save the world: optimized programs use less power (ok, every now and again they can use more). Do you really want to stop us adding more optimizations just because you cannot find the time to fix a mistake in your code?
  • We are helping your application gain market share. Applications that are not actively maintained are less and less likely to continue to work with every release of the compiler.