A Python Data Science hackathon

April 23rd, 2018 (17 hours ago) No comments

I was at the Man AHL Hackathon this weekend. The theme was improving the Python Data Science ecosystem. Around 15, or so, project titles had been distributed around the tables in the Man AHL cafeteria and the lead person for each project gave a brief presentation. Stable laws in SciPy sounded interesting to me and their room location included comfy seating (avoiding a numb bum is an under appreciated aspect of choosing a hackathon team and wooden bench seating is not numbing after a while).

Team Stable laws consisted of Andrea, Rishabh, Toby and yours truly. Our aim was to implement the Stable distribution as a Python module, to be included in the next release of SciPy (the availability had been announced a while back and there has been one attempt at an implementation {which seems to contain a few mistakes}).

We were well-fed and watered by Man AHL, including fancy cream buns and late night sushi.

A probability distribution is stable if the result of linear combinations of the distribution has the same distribution; the Gaussian, or Normal, distribution is the most well-known stable distribution and the central limit theorem leads many to think, that is that.

Two other, named, stable distributions are the Cauchy distribution and most interestingly (from my perspective this weekend) the Lévy distribution. Both distributions have very fat tails; the mean and variance of the Cauchy distribution is undefined (i.e., the values jump around as the sample size increases, never converging to a fixed value), while they are both infinite for the Lévy distribution.

Analytic expressions exist for various characteristics of the Stable distribution (e.g., probability distribution function), with the Gaussian, Cauchy and Lévy distributions as special cases. While solutions for implementing these expressions have been proposed, care is required; the expressions are ill-behaved in different ways over some intervals of their parameter values.

Andrea has spent several years studying the Stable distribution analytically and was keen to create an implementation. My approach for complicated stuff is to find an existing implementation and adopt it. While everybody else worked their way through the copious papers that Andrea had brought along, I searched for existing implementations.

I found several implementations, but they all suffered from using approaches that delivered discontinuities and poor accuracies over some range of parameter values.

Eventually I got lucky and found a paper by Royuela-del-Val, Simmross-Wattenberg and Alberola-López, which described their implementation in C: Libstable (licensed under the GPL, perfect for SciPy); they also provided lots of replication material from their evaluation. An R package was available, but no Python support.

No other implementations were found. Team Stable laws decided to create a new implementation in Python and to create a Python module to interface to the C code in libstable (the bit I got to do). Two implementations would allow performance and accuracy to be compared (accuracy checks really need three implementations to get some idea of which might be incorrect, when output differs).

One small fix was needed to build libstable under OS X (change Makefile to link against .so library, rather than .a) and a different fix was needed to install the R package under OS X (R patch; Windows and generic Unix were fine).

Python’s ctypes module looked after loading the C shared library I built, along with converting the NumPy arrays. My PyStable module will not win any Python beauty contest, it is a means of supporting the comparison of multiple implementations.

Some progress was made towards creating a new implementation, more than 24 hours is obviously needed (libstable contains over 4,000 lines of code). I had my own problems with an exception being raised in calls to stable_pdf; libstable used the GNU Scientific Library and I tracked the problem down to a call into GSL, but did not get any further.

We all worked overnight, my first 24-hour hack in a very long time (I got about 4-hours sleep).

After Sunday lunch around 10 teams presented and after a quick deliberation, Team Stable laws were announced as the winners; yea!

Hopefully, over the coming weeks a usable implementation will come into being.

Influential philosophers of source code

April 19th, 2018 1 comment

Who is the most important/influential philosopher of source code? Source code, as far as I know, is not a subject that philosophers claim to be studying; but, the study of logic, language and the mind is the study of source code.

For many, Ludwig Wittgenstein would probably be the philosopher that springs to mind. Wittgenstein became famous as the world’s first Perl programmer, with statements such as: “If a lion could talk, we could not understand him.” and “Whereof one cannot speak, thereof one must be silent.”

Noam Chomsky, a linguist, might be another choice, based on his specification of the Chomsky hierarchy (which neatly categorizes grammars). But generative grammars (for which he is famous in linguistics) is about generating language, not understanding what has been said/written.

My choice for the most important/influential philosopher of source code is Paul Grice. A name, I suspect, that is new to most readers. The book to quote (and to read if you enjoy the kind of books philosophers write) is “Studies in the Way of Words”.

Grice’s maxims, provide a powerful model for human communication; the tldr:

  • Maxim of quality: Try to make your contribution one that is true.
  • Maxim of quantity: Make your contribution as informative as is required.
  • Maxim of relation: Be relevant.

But source code is about human/computer communication, you say. Yes, but so many developers seem to behave as-if they were involved in human/human communication.

Source code rarely expresses what the developer means; source code is evidence of what the developer means.

The source code chapter of my empirical software engineering book is Gricean, with a Relevance theory accent.

More easily digestible books on Grice’s work (for me at least) are: “Relevance: Communication and Cognition” by Sperber and Wilson, and the more recent “Meaning and Relevance” by Wilson and Sperber.

The C++ committee has taken off its ball and chain

April 14th, 2018 8 comments

A step change in the approach to updates and additions to the C++ Standard occurred at the recent WG21 meeting, or rather a change that has been kind of going on for a few meetings has been documented and discussed. Two bullet points at the start of “C++ Stability, Velocity, and Deployment Plans [R2]”, grab reader’s attention:

● Is C++ a language of exciting new features?
● Is C++ a language known for great stability over a long period?

followed by the proposal (which was agreed at the meeting): “The Committee should be willing to consider the design / quality of proposals even if they may cause a change in behavior or failure to compile for existing code.”

We have had 30 years of C++/C compatibility (ok, there have been some nibbling around the edges over the last 15 years). A remarkable achievement, thanks to Bjarne Stroustrup over 30+ years and 64 full-week standards’ meetings (also, Tom Plum and Bill Plauger were engaged in shuttle diplomacy between WG14 and WG21).

The C/C++ superset/different issue has a long history.

In the late 1980s SC22 (the top-level ISO committee for programming languages) asked WG14 (the C committee) whether a standard should be created for C++, and if so did WG14 want to create it. WG14 considered the matter at its April 1989 meeting, and replied that in its view a standard for C++ was worth considering, but that the C committee were not the people to do it.

In 1990, SC22 started a study group to look into whether a working group for C++ should be created and in the U.S. X3 (the ANSI committee responsible for Information processing systems) set up X3J16. The showdown meeting of what would become WG21, was held in London, March 1992 (the only ISO C++ meeting I have attended).

The X3J16 people were in London for the ISO meeting, which was heated at times. The two public positions were: 1) work should start on a standard for C++, 2) C++ was not yet mature enough for work to start on a standard.

The, not so public, reason given for wanting to start work on a standard was to stop, or at least slow down, changes to the language. New releases, rumored and/or actual, of Cfront were frequent (in a pre-Internet time sense). Writing large applications in a version of C++ that was replaced with something sightly different six months later had developers in large companies pulling their hair out.

You might have thought that compiler vendors would be happy for the language to be changing on a regular basis; changes provide an incentive for users to pay for compiler upgrades. In practice the changes were so significant that major rework was needed by somebody who knew what they were doing, i.e., expensive people had to be paid; vendors were more used to putting effort into marketing minor updates. It was claimed that implementing a C++ compiler required seven times the effort of implementing a C compiler. I have no idea how true this claim might have been (it might have been one vendor’s approximate experience). In the 1980s everybody and his dog had their own C compiler and most of those who had tried, had run into a brick wall trying to implement a C++ compiler.

The stop/slow down changing C++ vs. let C++ “fulfill its destiny” (a rallying call from the AT&T rep, which the whole room cheered) finally got voted on; the study group became a WG (I cannot tell you the numbers; the meeting minutes are not online and I cannot find a paper copy {we had those until the mid/late-90s}).

The creation of WG21 did not have the intended effect (slowing down changes to the language); Stroustrup joined the committee and C++ evolution continued apace. However, from the developers’ perspective language change did slow down; Cfront changes stopped because its code was collapsing under its own evolutionary weight and usable C++ compilers became available from other vendors (in the early days, Zortech C++ was a major boost to the spread of usage).

The last WG21 meeting had 140 people on the attendance list; they were not all bored consultants looking for a creative outlet (i.e., exciting new features), but I’m sure many would be happy to drop the ball-and-chain (otherwise known as C compatibility).

I think there will be lots of proposals that will break C compatibility in one way or another and some will make it into a published standard. The claim will be that the changes will make life easier for future C++ developers (a claim made by proponents of every language, for which there is zero empirical evidence). The only way of finding out whether a change has long term benefit is to wait a long time and see what happens.

The interesting question is how C++ compiler vendors will react to breaking changes in the language standard. There are not many production compilers out there these days, i.e., not a lot of competition. What incentive does a compiler vendor have to release a version of their compiler that will likely break existing code? Compiler validation, against a standard, is now history.

If WG21 make too many breaking changes, they could find C++ vendors ignoring them and developers asking whether the ISO C++ standards’ committee is past its sell by date.

Tags: , , ,

GDPR has a huge impact on empirical software engineering research

April 5th, 2018 3 comments

The EU’s General Data Protection Regulation (GDPR) is going to have a huge impact on empirical software engineering research. After 25 May 2018, analyzing source code will never be the same again.

I am not a lawyer and nothing qualifies me to talk about the GDPR.

People put their name in source code, bug tracking databases and discussion forums; this is personal identifying information.

Researchers use personal names to obtain information about a wide variety of activities, e.g., how much code did individuals write, how many bug reports did they process, contributions in discussions of one sort or another.

Open source licenses give others all kinds of rights (e.g., ability to use and modify source code), but they do not contain any provisions for processing personal data.

Adding a “I hereby give permission for anybody to process information about my name in any way they see fit.” clause to licenses is not going to help.

The GDPR requires (article 5: Principles relating to processing of personal data):

“Personal data shall be: … collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes;”

That is, personal data can only be processed for the specific reason it was collected, i.e., if you come up with another bright idea for analysis of data that has just been collected, it may be necessary to obtain consent, from those whose personal data it is, before trying out the bright idea.

It is not possible to obtain blanket permission (article 6, Lawfulness of processing):

“…the data subject has given consent to the processing of his or her personal data for one or more specific purposes;”, i.e., consent has to be obtained from the data subject for each specific purpose.

Github’s Global Privacy Practices shows that Github are intent on meeting the GDPR requirements, they include: “GitHub provides clear methods of unambiguous, informed consent at the time of data collection, when we do collect your personal data.”. Processing personal information, about an EU citizen, contained in source code appears to be a violation of Github’s terms of service.

The GDPR has many other requirements, e.g., right to obtain information on what information is held and right to be forgotten. But, the upfront killer is not being able to cheaply collect lots of code and then use personal information to help with the analysis.

There are exceptions for: Processing for archiving, scientific or historical research or statistical purposes. Can somebody who blogs and is writing a book claim to be doing scientific research? People who know more about these exceptions than me, tell me that there could be a fair amount of paperwork involved when making use of the exception, i.e., being able to show that privacy safeguards are in place.

Then, there is the issue of what constitutes personal information. Git’s hashing algorithm makes use of the committer’s name and/or email address. Is a git hash personal identifying information?

A good introduction to the GDPR for developers, and one for researchers.

Reliability chapter added to “Empirical software engineering using R”

April 3rd, 2018 No comments

The Reliability chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I have been working on this draft for four months and it still needs lots of work; time to move on and let it stew for a while. Part of the problem is lack of public data; cost and schedule overruns can be rather public (projects chapter), but reliability problems are easier to keep quiet.

Originally there was a chapter covering reliability and another one covering faults. As time passed, these merged into one. The material kept evaporating in front of my eyes (around a third of the initial draft, collected over the years, was deleted); I have already written about why most fault prediction research is a waste of time. If it had not been for Rome I would not have had much to write about.

Perhaps what will jump out at people most, is that I distinguish between mistakes in code and what I call a fault experience. A fault_experience=mistake_in_code + particular_input. Most fault researchers have been completely ignoring half of what goes into every fault experience, the input profile (if the user does not notice a fault, I do not consider it experienced) . It’s incredibly difficult to figure out anything about the input profile, so it has been quietly ignored (one of the reasons why research papers on reported faults are such a waste of time).

I’m also missing an ‘interesting’ figure on the opening page of the chapter. Suggestions welcome.

I have not said much about source code characteristics. There is a chapter covering source code, perhaps some of this material will migrate to reliability.

All sorts of interesting bits and pieces have been added to earlier chapters. Ecosystems keeps growing and in years to come somebody will write a multi-volume tomb on software ecosystems.

I have been promised all sorts of data. Hopefully some of it will arrive.

As always, if you know of any interesting software engineering data, please tell me.

Source code chapter next.

McCabe’s cyclomatic complexity and accounting fraud

March 27th, 2018 2 comments

The paper in which McCabe proposed what has become known as McCabe’s cyclomatic complexity did not contain any references to source code measurements, it was a pure ego and bluster paper.

Fast forward 10 years and cyclomatic complexity, complexity metric, McCabe’s complexity…permutations of the three words+metrics… has become one of the two major magical omens of code quality/safety/reliability (Halstead’s is the other).

It’s not hard to show that McCabe’s complexity is a rather weak measure of program complexity (it’s about as useful as counting lines of code).

Just as it is possible to reduce the number of lines of code in a function (by putting all the code on one line), it’s possible to restructure existing code to reduce the value of McCabe’s complexity (which is measured for individual functions).

The value of McCabe’s complexity for the following function is 5 16, i.e., and there are 16 possible paths through the function:

int main(void)
{
if (W) a(); else b();
if (X) c(); else d();
if (Y) e(); else f();
if (Z) g(); else h();
}

each ifelse contains two paths and there are four in series, giving 2*2*2*2 paths.

Restructuring the code, as below, removes the multiplication of paths caused by the sequence of ifelse:

void a_b(void) {if (W) a(); else b();}
 
void c_d(void) {if (X) c(); else d();}
 
void e_f(void) {if (Y) e(); else f();}
 
void g_h(void) {if (Z) g(); else h();}
 
int main(void)
{
a_b();
c_d();
e_f();
g_h();
}

reducing main‘s McCabe complexity to 1 and the four new functions each have a McCabe complexity of two.

Where has the ‘missing’ complexity gone? It now ‘exists’ in the relationship between the functions, a relationship that is not included in the McCabe complexity calculation.

The number of paths that can be traversed, by a call to main, has not changed (but the McCabe method for counting them now produces a different answer)

Various recommended practice documents suggest McCabe’s complexity as one of the metrics to consider (but don’t suggest any upper limit), while others go as far as to claim that it’s bad practice for functions to have a McCabe’s complexity above some value (e.g., 10) or that “Cyclomatic complexity may be considered a broad measure of soundness and confidence for a program“.

Consultants in the code quality/safety/security business need something to complain about, that is not too hard or expensive for the client to fix.

If a consultant suggested that you reduced the number of lines in a function by joining existing lines, to bring the count under some recommended limit, would you take them seriously?

What about, if a consultant highlighted a function that had an allegedly high McCabe’s complexity? Should what they say be taken seriously, or are they essentially encouraging developers to commit the software equivalent of accounting fraud?

Top, must-read paper on software fault analysis

March 25th, 2018 No comments

What is the top, must read, paper on software fault analysis?

Software Reliability: Repetitive Run Experimentation and Modeling by Phyllis Nagel and James Skrivan is my choice (it’s actually a report, rather than a paper). Not only is this report full of interesting ideas and data, but it has multiple replications. Replication of experiments in software engineering is very rare; this work was replicated by the original authors, plus Scholz, and then replicated by Janet Dunham and John Pierce, and then again by Dunham and Lauterbach!

I suspect that most readers have never heard of this work, or of Phyllis Nagel or James Skrivan (I hadn’t until I read the report). Being published is rarely enough for work to become well-known, the authors need to proactively advertise the work. Nagel, Dunham & co worked in industry and so did not have any students to promote their work and did not spend time on the academic seminar circuit. Given enough effort it’s possible for even minor work to become widely known.

The study run by Nagel and Skrivan first had three experienced developers independently implement the same specification. Each of these three implementations was then tested, multiple times. The iteration sequence was: 1) run program until fault experienced, 2) fix fault, 3) if less than five faults experienced, goto step (1). The measurements recorded were fault identity and the number of inputs processed before the fault was experienced.

This process was repeated 50 times, always starting with the original (uncorrected) implementation; the replications varied this, along with the number of inputs used.

For a fault to be experienced, there has to be a mistake in the code and the ‘right’ input values have to be processed.

How many input values need to be processed, on average, before a particular fault is experienced? Does the average number of inputs values needed for a fault experience vary between faults, and if so by how much?

The plot below (code+data) shows the numbers of inputs processed, by one of the implementations, before individual faults were experienced, over 50 runs (sorted by number of inputs):

Number of inputs processed before particular fault experienced

Different faults have different probabilities of being experienced, with fault a being experienced on almost any input and fault e occurring much less frequently (a pattern seen in the replications). There is an order of magnitude variation in the number of inputs processed before particular faults are experienced (this pattern is seen in the replications).

Faults were fixed as soon as they were experienced, so the technique for estimating the total number of distinct faults, discussed in a previous post, cannot be used.

A plot of number of faults found against number of inputs processed is another possibility. More on that another time.

Suggestions for top, must read, paper on software faults, welcome (be warned, I think that most published fault research is a waste of time).

Estimating the number of distinct faults in a program

March 18th, 2018 No comments

In an earlier post I gave two reasons why most fault prediction research is a waste of time: 1) it ignores the usage (e.g., more heavily used software is likely to have more reported faults than rarely used software), and 2) the data in public bug repositories contains lots of noise (i.e., lots of cleaning needs to be done before any reliable analysis can done).

Around a year ago I found out about a third reason why most estimates of number of faults remaining are nonsense; not enough signal in the data. Date/time of first discovery of a distinct fault does not contain enough information to distinguish between possible exponential order models (technical details; practically all models are derived from the exponential family of probability distributions); controlling for usage and cleaning the data is not enough. Having spent a lot of time, over the years, collecting exactly this kind of information, I was very annoyed.

The information required, to have any chance of making a reliable prediction about the likely total number of distinct faults, is a count of all fault experiences, i.e., multiple instances of the same fault need to be recorded.

The correct techniques to use are based on work that dates back to Turing’s work breaking the Enigma codes; people have probably heard of Good-Turing smoothing, but the slightly later work of Good and Toulmin is applicable here. The person whose name appears on nearly all the major (and many minor) papers on population estimation theory (in ecology) is Anne Chao.

The Chao1 model (as it is generally known) is based on a count of the number of distinct faults that occur once and twice (the Chao2 model applies when presence/absence information is available from independent sites, e.g., individuals reporting problems during a code review). The estimated lower bound on the number of distinct items in a closed population is:

S_{est} ge S_{obs}+{n-1}/{n}{f^2_1}/{2f_2}

and its standard deviation is:

S_{sd-est}=sqrt{f_2 [0.25k^2 ({f_1}/{f_2} )^4+k^2 ({f_1}/{f_2} )^3+0.5k ({f_1}/{f_2} )^2 ]}

where: S_{est} is the estimated number of distinct faults, S_{obs} the observed number of distinct faults, n the total number of faults, f_1 the number of distinct faults that occurred once, f_2 the number of distinct faults that occurred twice, k={n-1}/{n}.

A later improved model, known as iChoa1, includes counts of distinct faults occurring three and four times.

Where can clean fault experience data, where the number of inputs have been controlled, be obtained? Fuzzing has become very popular during the last few years and many of the people doing this work have kept detailed data that is sometimes available for download (other times an email is required).

Kaminsky, Cecchetti and Eddington ran a very interesting fuzzing study, where they fuzzed three versions of Microsoft Office (plus various Open Source tools) and made their data available.

The faults of interest in this study were those that caused the program to crash. The plot below (code+data) shows the expected growth in the number of previously unseen faults in Microsoft Office 2003, 2007 and 2010, along with 95% confidence intervals; the x-axis is the number of faults experienced, the y-axis the number of distinct faults.

Predicted growth of unique faults experienced in Microsoft Office

The take-away point: if you are analyzing reported faults, the information needed to build models is contained in the number of times each distinct fault occurred.

Historians of computing

March 13th, 2018 5 comments

Who are the historians of the computing? The requirement I used for deciding who qualifies (for this post), is that the person has written multiple papers on the subject over a period that is much longer than their PhD thesis (several people have written history of some aspect of computing PhDs and then gone on to research other areas).

Maarten Bullynck. An academic who is a historian of mathematics and has become interested in software; use HAL to find his papers, e.g., What is an Operating System? A historical investigation (1954–1964).

Martin Campbell-Kelly. An academic who has spent his research career investigating computing history, primarily with a software orientation. Has written extensively on a wide variety of software topics. His book “From Airline Reservations to Sonic the Hedgehog: A History of the Software Industry” is still on my pile of books waiting to be read (but other historian cite it extensively). His thesis: “Foundations of computer programming in Britain, 1945-55″, can be freely downloadable from the British Library; registration required.

Paul E. Ceruzzi. An academic and museum curator; interested in aeronautics and computers. I found the one book of his that I have read, ok; others might like it more than me. Others cite him and he wrote an interesting paper on Konrad Zuse (The Early Computers of Konrad Zuse, 1935 to 1945).

James W. Cortada. Ex-IBM (1974-2012) and now working at the Charles Babbage Institute. Written extensively on the history of computing. More of a hardware than software orientation. Written lots of detail oriented books and must have pole position for most extensive collection of material to cite (his end notes are very extensive). His “Buy The Digital Flood: The Diffusion of Information Technology Across the U.S., Europe, and Asia” is likely to be the definitive work on the subject for some time to come. For me this book is spoiled by the author towing the company line in his analysis of the IBM antitrust trial; my analysis of the work Cortada cites reaches the opposite conclusion.

Nathan Ensmenger. An academic; more of a people person than hardware/software. His paper Letting the Computer Boys Take Over contains many interesting insights. His book The Computer Boys Take Over Computers, Programmers, and the Politics of Technical Expertise is a combination of topics that have been figured and back with references and topics still being figured out (I wish he would not cite Datamation, a trade mag back in the day, so often).

Kenneth S. Flamm. An academic and has held senior roles in government. Writes from a industry evolution, government interests, economic perspective. The books: “Targeting the Computer: Government Support and International Competition” and “Creating the Computer: Government, Industry and High Technology” are packed with industry related economic data and covers all the major industrial countries.

Michael S. Mahoney. An academic who is sadly no longer with us. A historian of mathematics before becoming involved with primarily software.

Jeffrey R. Yost. An academic. I have only read his book “Making IT Work: A history of the computer services industry”, which was really a collection of vignettes about people, companies and events; needs some analysis. Must try to track down some of his papers (which are not available via his web page :-(.

Who have I missed? This list is derived from papers/books I have encountered while working on a book, not an active search for historians. Suggestions welcome.

Updates

Completely forgot Kenneth S. Flamm, despite enjoying both his computer books.

Forgot Paul E. Ceruzzi because I was unimpressed by his “A History of Modern Computing”. Perhaps his other books are better.

Tags: , ,

Building a regression model is easy and informative

March 6th, 2018 4 comments

Running an experiment is very time-consuming. I am always surprised that people put so much effort into gathering the data and then spend so little effort analyzing it.

The Computer Language Benchmarks Game looks like a fun benchmark; it compares the performance of 27 languages using various toy benchmarks (they could not be said to be representative of real programs). And, yes, lots of boxplots and tables of numbers; great eye-candy, but what do they all mean?

The authors, like good experimentalists, make all their data available. So, what analysis should they have done?

A regression model is the obvious choice and the following three lines of R (four lines if you could the blank line) build one, providing lots of interesting performance information:

cl=read.csv("Computer-Language_u64q.csv.bz2", as.is=TRUE)
 
cl_mod=glm(log(cpu.s.) ~ name+lang, data=cl)
summary(cl_mod)

The following is a cut down version of the output from the call to summary, which summarizes the model built by the call to glm.

                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.299246   0.176825   7.348 2.28e-13 ***
namechameneosredux  0.499162   0.149960   3.329 0.000878 ***
namefannkuchredux   1.407449   0.111391  12.635  < 2e-16 ***
namefasta           0.002456   0.106468   0.023 0.981595    
namemeteor         -2.083929   0.150525 -13.844  < 2e-16 ***

langclojure         1.209892   0.208456   5.804 6.79e-09 ***
langcsharpcore      0.524843   0.185627   2.827 0.004708 ** 
langdart            1.039288   0.248837   4.177 3.00e-05 ***
langgcc            -0.297268   0.187818  -1.583 0.113531 
langocaml          -0.892398   0.232203  -3.843 0.000123 *** 
  
    Null deviance: 29610  on 6283  degrees of freedom
Residual deviance: 22120  on 6238  degrees of freedom

What do all these numbers mean?

We start with glm's first argument, which is a specification of the regression model we are trying to fit: log(cpu.s.) ~ name+lang

cpu.s. is cpu time, name is the name of the program and lang is the language. I found these by looking at the column names in the data file. There are other columns in the data, but I am running in quick & simple mode. As a first stab, I though cpu time would depend on the program and language. Why take the log of the cpu time? Well, the model fitted using cpu time was very poor; the values range over several orders of magnitude and logarithms are a way of compressing this range (and the fitted model was much better).

The model fitted is:

cpu.s. = e^{Intercept+name+prog}, or cpu.s. = e^{Intercept}*e^{name}*e^{prog}

Plugging in some numbers, to predict the cpu time used by say the program chameneosredux written in the language clojure, we get: cpu.s. = e^{1.3}*e^{0.5}*e^{1.2}=20.1 (values taken from the first column of numbers above).

This model assumes there is no interaction between program and language. In practice some languages might perform better/worse on some programs. Changing the first argument of glm to: log(cpu.s.) ~ name*lang, adds an interaction term, which does produce a better fitting model (but it's too complicated for a short blog post; another option is to build a mixed-model by using lmer from the lme4 package).

We can compare the relative cpu time used by different languages. The multiplication factor for clojure is e^{1.2}=3.3, while for ocaml it is e^{-0.9}=0.4. So clojure consumes 8.2 times as much cpu time as ocaml.

How accurate are these values, from the fitted regression model?

The second column of numbers in the summary output lists the estimated standard deviation of the values in the first column. So the clojure value is actually e^{1.2 pm (0.2*1.96)}, i.e., between 2.2 and 4.9 (the multiplication by 1.96 is used to give a 95% confidence interval); the ocaml values are e^{-0.9 pm (0.2*1.96)}, between 0.3 and 0.6.

The fourth column of numbers is the p-value for the fitted parameter. A value of lower than 0.05 is a common criteria, so there are question marks over the fit for the program fasta and language gcc. In fact many of the compiled languages have high p-values, perhaps they ran so fast that a large percentage of start-up/close-down time got included in their numbers. Something for the people running the benchmark to investigate.

Isn't it easy to get interesting numbers by building a regression model? It took me 10 minutes, ok I spend a lot of time fitting models. After spending many hours/days gathering data, spending a little more time learning to build simple regression models is well worth the effort.