Author Archive

Machine learning in SE research is a bigger train wreck than I imagined

November 23rd, 2015 No comments

I am at the CREST Workshop on Predictive Modelling for Software Engineering this week.

Magne Jørgensen, who virtually single handed continues to move software cost estimation research forward, kicked-off proceedings. Unfortunately he is not a natural speaker and I think most people did not follow the points he was trying to get over; don’t panic, read his papers.

In the afternoon I learned that use of machine learning in software engineering research is a bigger train wreck that I had realised.

Machine learning is great for situations where you have data from an application domain that you don’t know anything about. Lets say you want to do fault prediction but don’t have any practical experience of software engineering (because you are an academic who does not write much code), what do you do? Well you could take some source code measurements (usually on a per-file basis, which is a joke given that many of the metrics often used only have meaning on a per-function basis, e.g., Halstead and cyclomatic complexity) and information on the number of faults reported in each of these files and throw it all into a machine learner to figure the patterns and build a predictor (e.g., to predict which files are most likely to contain faults).

There are various ways of measuring the accuracy of the predictions made by a model and there is a growing industry of researchers devoted to publishing papers showing that their model does a better job at prediction than anything else that has been published (yes, they really do argue over a percent or two; use of confidence bounds is too technical for them and would kill their goose).

I long ago learned to ignore papers on machine learning in software engineering. Yes, sooner or later somebody will do something interesting and I will miss it, but will have retained my sanity.

Today I learned that many researchers have been using machine learning “out of the box”, that is using whatever default settings the code uses by default. How did I learn this? Well, one of the speakers talked about using R’s carat package to tune the options available in many machine learners to build models with improved predictive performance. Some slides showed that the performance of carat tuned models were often substantially better than the non-carat tuned model and many people in the room were aghast; “If true, this means that all existing papers [based on machine learning] are dead” (because somebody will now come along and build a better model using carat; cannot recall whether “dead” or some other term was used, but you get the idea), “I use the defaults because of concerns about breaking the code by using inappropriate options” (obviously somebody untroubled by knowledge of how machine learning works).

I think that use of machine learning, for the purpose of prediction (using it to build models to improve understanding is ok), in software engineering research should be banned. Of course there are too many clueless researchers who need the crutch of machine learning to generate results that can be included in papers that stand some chance of being published.

The Empirical Investigation of Perspective-Based Reading: Data analysis

November 20th, 2015 No comments

Questions about the best way to perform code reviews go back almost to the start of software development. The perspective-based reading approach focuses reviewers’ attention on the needs of the users of the document/code, e.g., tester, user, designer, etc, and “The Empirical Investigation of Perspective-Based Reading” is probably the most widely cited paper on the subject. This paper is so widely cited I decided it was worth taking the time to email the authors of a 20 year old paper asking if the original data was available and could I have a copy to use in a book I am working on. Filippo Lanubile’s reply included two files containing the data (original files, converted files+code)!

How do you compare the performance of different approaches to finding problems in documents/code? Start with experienced subjects, to minimize learning effects during the experiment (doing this also makes any interesting results an easier sell; professional developers know how unrealistic student performance tends to be); the performance of subjects using what they know has to be measured first, learning another technique first would contaminate any subsequent performance measurements.

In this study subjects reviewed four documents over two days; the documents were two NASA specifications and two generic domain specifications (bank ATM and parking garage); the documents were seeded with faults. Subjects were split into two groups and read documents in the following sequences:

                Group 1     Group 2
Day 1
                NASA A      NASA B
                ATM         PG
Day 2
                Perspective-based reading training
                PG          ATM
                NASA B      NASA A

The data contains repeated measurements of the same subject (i.e., their performance on different documents using one of two techniques), so mixed-model regression has to be used to build a model.

I built two models, one for number of faults detected and the another for the number of false positive faults flagged (i.e., something that was not a fault flagged as a fault).

The two significant predictors of percentage of known faults detected were kind of document (higher percentage detected in the NASA documents) and order of document processing on each day (higher percentage reported on the first document; switching document kind ordering across groups would have enabled more detail to be teased out).

The false positive model was more complicated, predictors included number of pages reviewed (i.e., more pages reviewed more false positive reports; no surprise here), perspective-based reading technique used (this also included an interaction with number of years of experience) and kind of document.

So use of perspective-based reading did not make a noticeable difference (the false positive impact was in amongst other factors). Possible reasons that come to mind include subjects not being given enough time to switch reading techniques (people need time to change established habits) and some of the other reading techniques used may have been better/worse than perspective-based reading and overall averaged out to no difference.

This paper is worth reading for the discussion of the issues involved in trying to control factors that may have a noticeable impact on experimental results and the practical issues of using professional developers as subjects (the authors clearly put a lot of effort into doing things right).

Please let me know if you build any interesting model using the data.

Peak memory transfer rate is best SPECint performance predictor

November 18th, 2015 No comments

The idea that cpu clock rate is the main driver of system performance on the SPEC cpu benchmark is probably well entrenched in developers’ psyche. Common knowledge, or folklore, is slow to change. The apparently relentless increase in cpu clock rate, a side-effect of Moore’s law, stopped over 10 years ago, but many developers still behave as-if it is still happening today.

The plot below shows the SPEC cpu integer performance of 4,332 systems running at various clock rates; the colors denote the different peak memory transfer rates of the memory chips in these systems (code and data).

SPEC cpu integer performance vs. cpu clock rate

Fitting a regression model to this data, we find that the following equation predicts 80% of the variance (more complicated models fit better, but let’s keep it simple):

SPECint=-2.4*10^1+Processor.MHz 7.3*10^{-3}+memrate 2.5*10^{-3}+memfreq 1.0*10^{-2}

where: Processor.MHz is the processor clock rate, memrate the peak memory transfer rate and memfreq the frequency at which the memory is clocked.

Analysing the relative contribution of each of the three explanatory variables turns up the surprising answer that peak memory transfer rate explains significantly more of the variation in the data than processor clock rate (by around a factor of five).

If you are in the market for a new computer and are interested in relative performance, you can obtain a much more accurate estimate of performance by using the peak memory transfer rate of the DIMMs contained in the various systems you were considering. Good luck finding out these numbers; I bet the first response to your question will be “What is a DIMM?”

Developer focus on cpu clock rate is no accident, there is one dominant supplier who is willing to spend billions on marketing, with programs such as Intel Inside and rebates to manufacturers for using their products. There is no memory chip supplier enjoying the dominance that Intel has in cpus, hence nobody is willing to spend the billions in marketing need to create customer awareness of the importance of memory performance to their computing experience.

The plot below shows the SPEC cpu integer performance against peak memory transfer rates; the colors denote the different cpu clock rates in these systems.

SPEC cpu integer performance against peak memory transfer rates

Word length lexical decision data

November 4th, 2015 No comments

Well chosen identifier names can reduce the effort needed to understand the code containing them, compared to badly chosen identifiers.

An identifier might have a name that creates a semantic association in the mind of the reader about the role of the variable within a function definition (e.g., outer_counter) or an association with the information contained in the variable (e.g., max_fruit).

Code is not always read like prose text, developers might quickly scan through the source looking for something. In this case short identifier names are best because it reduces the number of characters that need to be scanned.

If you want to make life difficult for anyone who has to read your code, add a visually boring common prefix to every identifier, e.g., uacc. Readers start looking up a word in memory based in the first few characters while they process the remaining characters in a word, and eye tracking studies have found that that character sequence information is used to plan eye saccades (where to move the eyes next). A short bland sequence can really throw a spanner in the works of our over-learned reading skills.

I once researched a detailed analysis of the issues involved in a cost/benefit analysis of identifier selection. The good news is that I think I covered everything, the bad news is that the various kinds of data on human character sequence usage needed to perform the analysis was/is not available.

Today I got my hands on lots of performance data on the affect of word length on visual word recognition; thanks to Boris New (code and data)

The plot below shows the mean response time of 819 subjects performing a lexical decision task (respond yes/no on whether a character sequence is a word or nonword); each subject was tested on a subset of around 3,000 out of 33,608 words.

Lexical decision time for words of various lengths

Note, this data is for single words. There are bound to be all sorts of interaction effects when two words/nonwords occur together in an identifier, e.g., semantic priming.

Word length is only one of several factors that have been found to effect people’s performance in processing words; others include the word frequency effect and age of acquisition (when the word was learned, which is correlated with word frequency).

Have fun with this.

Tags: , ,

Operator assignment may be in the next Ada standard

October 28th, 2015 No comments

WG9, the Ada Standard committee, are looking into adding support for operator assignment in the next revision of the language standard. Ada was designed back in the day when C’s operator assignments, such as +=, were a novelty (all the languages we know today that follow C’s lead, including choice of operator precedence, did not yet exist); developers using serious languages knew that typing everything out in full was good for you.

It’s amazing to read about “… those who have been damaged by terser languages…”, in the proposal comments. Do I only think that My_Package.My_Array(I).Field +:= 1; is so much easier to read than My_Package.My_Array(I).Field := My_Package.My_Array(I).Field + 1; because my brain has been damaged? In the second statement I have to parse two subexpressions and notice that they refer to the same object, while the first statement only requires me to parse one subexpression and the +:= operator tell me that something is being added to its value. The first statement requires a lot less effort and is likely to be less error prone (I bet Ada programs make cut and paste error just like the rest of us).

Did you notice the use of round brackets to specify an array index? Yes, the eighties were the decade when the programming language world was badgered into being politically correct and only require the use of characters that appeared on every European country’s keyboard.

The Ada proposal started out following the C convention, e.g., +=, *=, etc, then switched to :+, :*, etc. The assignment token in Ada is := and there is a logic to switching out the = for another operator. However, I would argue that +:, *:, etc would be a less error prone ordering or characters. People pay more attention to the first character in a sequence and somebody in a hurry may not notice that the colon is not followed by a =; putting the ‘unexpected’ character first is more likely to catch reader attention.

Another proposal is for the @ token to act as a placeholder. The only place I can see this being useful on any regular basis is in bitwise manipulation: My_Package.My_Array(I).Field := (@ << 16) or @ ; (Ada does not support << as an operator, so reality is a bit more complicated than this). I don't think there are enough situations where placeholders would be useful to warrant using a solution that is more complicated than operator assignment.

One proposed advantage of the new fangled operators is making life easier for non-Ada users (see the discussion in the proposal comments). The I have put lots of efforts into Ada, why should others have it easy die-hards are fighting a rear guard action to make sure life is not too easy for newcomers.

Researching just to create software patents

October 21st, 2015 No comments

I was recently reading the paper “ARC++: Effective Typestate and Lifetime Dependency Analysis” (cannot find a public pdf now). What caught my attention was the fact it involved static analysis of C++ source, something that few researchers get involved in because of the technical complexity of parsing C++. I wanted to know more and guess what popped up near the top of a Google search, a patent with almost the same title as the paper and with many of the paper’s authors listed as the inventors!

As far as I can tell the Claims section of the patent does not list anything new or novel (in fact the only prior art listed was added by the patent examiner). It is not hard to understand why this patent exists; the main paper author works in an NEC research laboratory and commercial research labs have to ‘pay’ their way by producing patents that the parent company can brandish at competitors.

At least it looks like the patent was applied for before the paper was published. The most extreme case of paper published followed by patent application, sometime later, I have encountered was during my spell as an adviser to the Monitoring Trustee appointed by the European Commission in the EU/Microsoft competition court case. One of the non-patented innovations being claimed by Microsoft (or perhaps the patent had been granted, my memory is fuzzy) was for something described in a PhD thesis whose author went to work for Microsoft, after completing his PhD; this was back in the day when typing a couple of technical terms and the name listed as the patent inventor into a search engine was still something new for bean counters.

Lisp and functional languages discourage free riders

October 15th, 2015 3 comments

While many developers have a favorite programming language, there are a few who believe they have found the One True Language and refuse to even consider coding in another language.

Why is the One True Language invariably a dialect of Lisp or a functional language?

I think the reason is the same as why strict churches are strong, both these language families make life difficult for free riders, i.e., the casual programmer.

Superficially Lisp-like languages look unwelcoming because of all those brackets and the tiresome reverse polish notation, but once past these surface speed bumps life is not a bed of roses, there are mind bending language challenges to master at every abstraction level; developers get sucked into the community working on mastering each level (this suggests an alternative explanation that coding in Lisp is a way of continuing to play Dungeons & Dragons while appearing to work).
True functional languages don’t have global variables and certainly don’t let you create stateful information. The self-flagulation no global variable languages have a limited clientèle; its the writing of programs that don’t make use of side-effects (e.g., iteration via recursion, not explicit loops) that marks out the true community members; a short conversation with a developer is enough to tell whether they are one-of-us who joyfully tells the world of their latest assignment-free solution to an apparently intractable problem (intractable in the sense of appearing to require the use of assignment statements). There are always umpteen different ways of writing something in functional languages, providing plenty of scope for sects to splinter off by requiring disciples to follow a particular approved style.

Why have Lisp and functional programming continued to survive for so long? Some interesting research on communal societies has found a correlation between the number of costly requirements entailed by community membership and community longevity, the greater the number of costly requirements the longer a community survives. Having sunk so much time and effort into the costly signaling required for community membership, people are loath to leave it all behind.

Citation patterns in my two books

October 5th, 2015 No comments

When writing my C book, I cited any paper or book whose material I made use of and/or that I thought would be useful to the reader. One of the rules for academic papers is to cite the paper that ‘invented’ the idea; this is intended to incentivize researchers to work hard to discover new things that will be cited many times (citation count is a measure of the importance of the work and these days a metric used when deciding promotions and awarding grants).

When I started writing the C book the premier search engine was AltaVista, with Google becoming number one a few years before the book was completed. Finding papers online was still a wondrous experience and Citeseer was a godsend.

The plot below shows the numbers of works cited by year of publication, for the C book.

C book: Number of papers referenced in any year

These days all the information we could possible need is said to be online. I don’t think this is true, but it might be a good enough approximation. But being online does not mean being available for free, a lot of academic papers remain behind paywalls.

It used to be possible to visit a good University library and copy papers of interest (I have a filing cabinet full of paper from the C book). Those days are gone, with libraries moving their reference stock off-site (the better ones offer on-premise online access).

My book on empirical software engineering is driven by what data is available, which means most cited work is going to be relatively new. There is another big factor driving the work I cite this time around; I am fed up with tax payer funded research ending up behind paywalls, so I am only citing papers that can be downloaded for free (in practice they also have to be found by a search engine or linked to from somewhere or other) and when the ‘discovery’ paper is not available for download I cite a later work that is.

The plot below shows the numbers of works cited by year of publication, for the empirical software engineering draft book.

Empirical book: Number of papers referenced in any year

The rising slope is much sharper for the latest book. I think most of the difference is driven by the newness of the subject, software researchers tend to be very good (at least the non-business department ones) at making pdfs available for download.

Another factor might be how Citeseer and Google Scholar cross reference papers; Citeseer links to works cited by a paper (i.e., link back in time), while Google Scholar links to works that cite the paper (i.e., link forward in time).

Tags: ,

Forces driving patterns found in call graphs of human written code

September 30th, 2015 No comments

It has been a while since I posted anything about generating source that mimics the characteristics of human written code.

Generating function definitions is the easy bit, although variable selection is fiddly and naming needs to be handled.

The hardest part of mimicking human written code is linking functions together, via calls, in apparently meaningful ways. Within one source file it is probably possible to get away with individual calls to other local functions (most function definitions are called once) and a couple of calls to third party libraries. At the complete program level the code needs to tell a story, and doing this is very hard.

I think function calls can be divided into three categories, all based on the relationship between the developer who writes the code than does the call and the developer(s) who wrote the called function:

  • Caller developer is the same person as the callee developer. One person gets to decide how things are done,
  • Caller developer works on the same team as the callee developer. Here an interface needs to be negotiated with one or more other people,
  • Caller developer is making use of a third-party library. The interface is pre-decided, take it or leave it (the same principle applies to library updates, the caller has to decide whether to stay with the existing version or make the changes needed to upgrade to the new version; Android is the (in)famous example of a frequently changed library that developers are under strong pressure to continually adapt to).

Calls to functions in third-party libraries tend to follow stylized sequences, e.g., open_* occurs before read_*/write_* and close_* appears last. Most of the time a generator could get away not calling functions from third-party libraries, because the number of such calls is often small, but when they occur they had better look correct.

Needing to call a function written by another developer on the team opens up all sorts of possibilities: there is always the option of them modifying the function to make life easier for the new call, but the cost of modifying all the existing call sites (plus any associated call chains) may be too high, perhaps a new global variable can be used to communicate the desired information, or perhaps the proposed usage is a small cog in a large wheel and has to make do (or perhaps they don’t like the person asking for the change and reasons are invented for staying as-is).

Then there is correlation in time and space (this has a big impact on patterns of evolution, at least I think so; models of forest fires, i.e., growth, death, fires of various sizes creating space for new growth, is the obvious parallel):

  • The longer a function exists the more likely it is to accumulate callers. A function definition can remain unchanged and yet over time become more and more difficult to change.
  • It can be very expensive to make changes to an existing function definition when there are lots of calls to it.

What do measurements of code have to say? Almost nothing; existing studies mostly do weird things like treating a system’s call graph as-if it were a social network (using a currently trendy metaphor is good for getting papers published, even if the mapping is all wrong), plotting power law-like graphs and sprouting portentous nonsense.

What is really needed is measurements of forked systems; comparing systems derived from a common past will tell us how much natural variation exists due to individual choice.

FreeBSD, NetBSD and OpenBSD are the obvious poster children of forked, common heritage, systems written in C; I cannot think of any such systems written in Java or C++.

Workshop on analyzing software engineering data

September 25th, 2015 No comments

I am teaching a workshop, analyzing software engineering data, on 16 January 2016. If you meet the assumed level of know-how (basic understanding of maths to GCSE level, fluent in at least one programming language {i.e., written 10k+ lines of code} and will turn up with a laptop that has R installed), then you are welcome to sign up, its free. The event is being organized by ACCU London.

The focus is on extracting information that is useful to practicing software developers for creating software systems; statistics is used as a tool to find patterns in the data (R is used for this and the programs have the form: read_data(); format_data(); appropriate_statistical_function(); plot_results() and are usually contained in 10-30 lines).

The maths/programming requirements are there because the focus is on the software engineering ideas implied by the data; people need to implicitly understand how an equation fits together (not because there will be lots of algebra, there isn’t) and to be able to pick up and use a new language quickly.

The material is based on a book I am working on.

Its a hands-on workshop, with me talking for an hour or so and then everybody analyzing data for an hour, repeating until end-of-day. I have plenty of data for you to work on, but if you do have some software engineering data that you are willing to share with everybody, please bring it along.

The workshop is something of an experiment; as far as I know there are no books or courses aimed at software developers interested in analyzing software engineering data (there are a few books containing an assortment of academic papers). If the material is too easy I can speed up, if it is too hard then I will slow down; if the material is of no practical use we can all leave early.

The plan is to start at the beginning and cover all the important topics in software engineering. Obviously this requires more than a one day workshop. If there is enough interest there will be more workshops covering different topics (assuming I have time to organize the material and an available venue permitting).

Tags: , ,