Yesterday I was one of a dozen or so people at a hackathon hosted by the Ministry of Justice. I’m sure that the organizers would have liked a lot more people attending, but the McDonald’s hackathon across town was a bigger draw for most of the hackers I know.
The dataset was all UK legislation back to 1988, and a less complete set going back to 1267, in html and various flavors of xml; in all 20G of compressed files, with the compressed html files occupying 7G on my machine. As of this weekend the files are available online.
There were about half a dozen domain experts, in various aspects of the creation of UK legislation, present and they suggested lots of interesting questions that we (i.e., the attendees who could code) might like to try to answer.
I was surprised at the simplicity of some questions, e.g., volume (e.g., word count) of legislation over time and branch of government. The reason these question had not been answered before is that the data had not been available; empirical analysis of UK legislation started this weekend.
The most interesting question I heard involved the relative volume of primary and secondary legislation. Primary legislation is written by Parliament and in some cases creates a framework that is used by secondary legislation which contains the actual details. A lot of this secondary legislation is written by the Executive branch of government (by civil servants in the appropriate branch of government) and may not involve any parliamentary oversight. Comparing the relative volume of primary/secondary legislation over time would show if the Executive branch was gaining more powers to write laws, at the expense of Parliament.
With all the interesting discussions going on, setting ourselves up and copying the data (from memory sticks, not the internet), coding did not really start until 11, and we had to have our projects ‘handed-in’ by 16:00, not enough time to be sure of getting even an approximate answer to the primary/secondary legislation question. So I plumped for solving a simpler problem that I was confident of completing.
Certain phrases are known to be frequently used in legislation, e.g., “for the purposes of”. What phrases are actually very common in practice? The domain experts were interested in phrases by branch of government and other subsets; I decided to keep it simple by processing every file and giving them the data.
Counting sequences of n words is a very well studied problem and it was straightforward to locate a program to do it for me. I used the Lynx web browser to strip out the html (the importance of making raw text available for this kind of analysis work was recognized and this will be available at some future date). I decided that counting all four word sequences ought to be doable on my laptop and did manage to get it all to work in the available time. Code and list of 4-grams over the whole corpus available on Github.
As always, as soon as they saw the results, the domain experts immediately starting asking new questions.
Regular readers of this blog will know of my long standing involvement in the structure and interpretation of programming language standards. It was interesting to hear those involved in the writing/interpretation of legislation having exactly the same issues, and coming up with very similar solutions, as those of us in the language standards world. I was surprised to hear that UK legislation has switched from using “shall” to using “must” to express requirements (one of the hacks plotted the use of shall/must over time and there has been an abrupt change of usage). One of the reasons given was that “must” is more modern; no idea how word modernness was measured. In the ISO standards’ world “shall” is mandated over “must”. Everybody was very busy in the short amount of time available, so I had to leave an insiders chat about shall/must to another time.
The availability of such a large amount of structured English documents having great import should result in some interesting findings and tools being produced.