Archive

Archive for September, 2014

A book about some important bits of R

September 27th, 2014 No comments

I see that Hadley Wickham’s new book, “Advanced R”, is being published in dead tree form and will be available a month or so. Hadley has generously made the material available online; I quickly skimmed the material a few months ago when I first heard about it and had another skim this afternoon.

The main problem with the book is its title, authors are not supposed to write advanced books and then call them advanced. When I studied physics the books all had “advanced” in their titles, but when I got to University the books switched to having some variant of “fundamental” in their title. A similar pattern applies to computer books, with the books aimed at people who know a bit and want to learn a bit more having an advanced-like word in their title and the true advanced stuff having more downbeat titles, e.g., Javascript: The Good Parts, “Algorithms in Snobol 4″, Algorithms + Data Structures = Programs.

Some alternative title suggestions: “R: Some important bits”, “The Anatomy of R” or “The nitty gritty of R”.

The book is full of useful technical details that are scattered about and time consuming to find elsewhere; a useful reference manual, covering how to do technical stuff in R, to have on the shelf.

My main quibble with the book is the amount of airplay that the term “functional programming” gets. Does anybody really care that R has a strong functional flavor? Would many R users recognize another functional language if it jumped up and bit them? The die hard functional folk would probably say that R is not really a functional language, but who cares. I think people who write about R should stop using the words “functional programming”, it just confuses R users and serves no useful purpose; just talk about the convenient things that R allows us to write.

A book that I would really like to read is the R equivalent of books such as “Algorithms in Snobol 4″, “Effective C++” and “SQL for Smarties” (ok, that one has advanced in the subtitle), that take a wide selection of relatively simple problems and solve them in ways that highlight different aspects of the language (perhaps providing multiple solutions to the same problem).

Tags: ,

Creating a map showing land covered by rising sea levels

September 15th, 2014 1 comment

I joined the Geekli.st climate Hackathon this weekend at the Hub Westminster (my favorite venue for Hackathons). While the organizers had lots of enthusiasm they had very little in the way of data for us to work on. No problem, ever since the Flood-relief hackathon I have wanted to use the SRTM ‘whole Earth’ elevation data on a flood related hack. Since this was a climate change related hack the obvious thing to do was to use the data to map the impact of any increases in sea level (try it, with wording for stronger believers).

The hacking officially started Friday evening at 19:00, but I only attended the evening event to meet people and form a team. Rob Finean was interested in the idea of mapping the effects of sea a rise in level (he also had previous experience using leaflet, a JavaScript library for interactive maps) and we formed a team, Florian Rathgeber joined us on Saturday morning.

I downloaded all the data for Eurasia (5.6G) when I got home Friday night and arriving back at the hackthon on Saturday morning started by writing a C program to convert the 5,876 files, each 1-degree by 1-degree squares on the surface of the Earth, to csv files.

The next step was to fit a mesh to the data and then locate constant altitude contours, at 0.5m and 1.5m above current sea level. Fitting a 2-D mesh to the data was easy (I wanted to use least squares rather than splines so that errors in the measurements could be taken into account), as was plotting and drawing contours, but getting the actual values for the contour lat/long proved to be elusive. I got bogged down looking at Python code, Florian knew a lot more Python than me and started looking for a Python solution while I investigated what R had to offer. Given the volume of data a Python solution looked like the best fit for the work-flow.

By late afternoon no real progress had been made and things were not looking good. Google searches on the obvious keywords returned lots of links to contour plotting libraries and papers claiming to have found a better contour evaluation algorithm, but no standalone libraries. I was reduced to downloading the source code of R to search for the code it used to calculate contours, with a view to extracting the code for my own use.

Rob wanted us to produce kml (Keyhole Markup Language) that his front end could read to render an overlay on a map.

At almost the same time Florian found that GDAL (Geospatial Data Abstraction Library) could convert hgt files (the raw SRTM file format) to kml and I discovered the R contourLines function. Florian had worked with GDAl before but having just completed his PhD had to leave to finish a paper he was working on, leaving us with instruction on the required options.

The kml output by GDAL was great for displaying contours, but did not fill in the enclosed area. The output I was generating using R filled the area enclosed by the contours but contained lots of noise because independent contours were treated having a connection to each other. I knew a script could be written to produce the desired output from the raw data, but did not know if GDAL had options to do what we wanted.

Its all very well being able to write a script to produce the desired output, but what is the desired output? Rob was able to figure out how the contour fill data had to be formatted in the kml file and I generated this using R, awk, sed, shell scripts and around six hours of cpu time on my laptop.

Rob’s front end uses leaflet with mapping data from Openstreetmap and the kml files to create a fantastic looking user-configurable map showing the effect of 0.5m and 1.5 rises in sea level.

The sea level data on the displayed map only shows the south of England and some of the north coast of Europe because loading any more results in poor performance (it is all loaded statically). Support is needed for dynamically loading of data on an as required basis. All of the kml files for Eurasia with 1.5 sea level rise are up on Github (900M+ of data). At the moment the contour data is only at the most detailed level of resolution and less detailed resolution is needed for when users zoom out. R’s contourLines function has no arguments for changing the resolution of which it returns; if you know of a better contour library please let me know.

The maps show average sea level. When tides are taken into account the sea level at certain times of the day may be a lot higher in some areas. We did not have access to tide data and would not have had time to make use of it anyway, so the effects of tide on sea level are not included.

Some of the speckling in the overlays may be noise caused by the error bounds of the SRTM data (around 6m for Eurasia; an accuracy level that makes our expectation of a difference between 0.5m and 1.5m contours problematic).

Is Early parsing now practical?

September 10th, 2014 1 comment

Language parsing was once a hot topic within computing research. The discovery of LALR parsing, quickly followed by yacc becoming available on Unix, resulted in this approach to language parsing dominating developer mind-share (helped by the first half of most compiler books being devoted to the theory of LR parsing). Until maybe 10 years ago the received wisdom was to implement parsers using Bison (the GNU successor to yacc); this process automatically creates arrays of values that are read by a parser to decide how to process the tokens fed to it by a lexer. The accepted wisdom has now shifted to creating hand written recursive decent parsers (or some variant), where the developer writes code that decides what to do next based on the current token(s); developers are back doing things the way they were done before yacc was written in 1970.

Is this change of implementation choice driven by fashion (despite heroic efforts nobody has been able to produce an industrial strength LALR based parser for C++; all C++ compilers that I am aware of use recursive descent and, sad to say, C++ is a trend setter), existing languages outgrowing existing parsing technology or just developers forgetting what a maintenance nightmare recursive descent can be?

I’m a fan of using tools and the big advantage parser generators have over hand written parsers is that they warn about ambiguities in the syntax, i.e., potential faults in the specification or implementation. Hand written recursive decent is just code that does what is written.

The big disadvantage of LALR parsing are restrictions on the form of the grammars that are accepted (in practice the tools usually complain that an ambiguity cannot be resolved and make use of some default behavior to handle it). Transforming a grammar into a form acceptable to tools, such as Bison, without too many warnings being generated, can take a lot of work by an experienced compiler developer. I once spent a month creating a workable LALR grammar for all of SQL-92 and could have written a recursive decent parser in less time (grammar transformations are a potential source of faults as much as hand written parsers are).

Introductions to parsing sometimes mention how much easier life would be using Early parsing, if only its performance was not so appalling. It turns out that a linear algorithm for Early parsing was published in 1991, followed by various useful refinements in 2001 (all discussed in what is effectively the encyclopedia of parsing sitting on my shelf waiting to be read). Theory will sit on the shelf until somebody implements it and a few days ago I found out about Marpa, a linear time Early parser.

So why does Early parsing make life so much easier, at least for those implementing parsers, than LALR parsing? Early parsing has far fewer restrictions on the form of the grammars it accepts. This means no more spending a month transforming a grammar into something acceptable to the tool being used (at least in theory, I have not tried any large grammars yet; somebody has written one for C).

Another benefit from using an Early parser is the potential for improved syntax error recovery, the drive to reduce the size of the arrays generated by yacc/Bison resulted in information essential for good error recovery being thrown away (the original LALR theory threw some useful information away and over the years several PhDs were awarded to researchers who figured out how to throw even more away). When things go wrong Early parsers have lots of useful information to them.

To check out the hype I’m jumping in at the deep end with the grammar for C++14, can I really cut-and-paste the grammar from the appendix, add in some Marpa syntax and start parsing C++? I will let you know whether I sink or swim.