Archive

Archive for the ‘Uncategorized’ Category

Data analysis with a manual mindset

October 16th, 2017 No comments

A lot of software engineering data continues to be analysed using techniques designed for manual implementation (i.e., executed without a computer). Yes, these days computers are being used to do the calculation, but they are being used to replicate the manual steps.

Statistical techniques are often available that are more powerful than the ‘manual’ techniques. They were not used during the manual-era because they are too computationally expensive to be done manually, or had not been invented yet; the bootstrap springs to mind.

What is the advantage of these needs-a-computer techniques?

The main advantage is not requiring that the data have a Normal distribution. While data having a Normal, or normal-like, distribution is common is the social sciences (a big consumer of statistical analysis), it is less common in software engineering. Software engineering data is often skewed (at least the data I have analysed) and what appear to be outliers are common.

It seems like every empirical paper I read uses a Mann-Whitney test or Wilcoxon signed-rank test to compare two samples, sometimes preceded by a statement that the data is close to being Normal, more often being silent on this topic, and occasionally putting some effort into showing the data is Normal or removing outliers to bring it closer to being Normally distributed.

Why not use a bootstrap technique and not have to bother about what distribution the data has?

I’m not sure whether the reason is lack of knowledge about the bootstrap or lack of confidence in not following the herd (i.e., what will everybody say if my paper does not use the techniques that everybody else uses?)

If you are living on a desert island and don’t have a computer, then you will want to use the manual techniques. But then you probably won’t be interested in analyzing software engineering data.

Histogram using log scale creates a visual artifact

October 7th, 2017 No comments

The following plot appears in the paper Stack Overflow in Github: Any Snippets There?

Histogram of file and function counts: log scale on x-axis

Don’t those twin peaks in the top-left/bottom-right plots reach out and grab your attention? I immediately thought of fitting a mixture of two Poisson distributions; No, No, No, something wrong here. The first question of data analysis is: Do I believe the data?

The possibility of fake data does not get asked until more likely possibilities have been examined first.

The y-axis is a count of things and the x-axis shows the things being counted; source files per project and functions per file, in this case.

All the measurements I know of show a decreasing trend for these things, e.g., lots of projects have a few files and a few projects have lots of files. Twin peaks is very unexpected.

I have serious problems believing this data, because it does not conform to my prior experience. What have the authors done wrong?

My first thought was that a measuring mistake had been made; for some reason values over a certain range were being incorrectly miscounted.

Then I saw the problem. The plot was of a histogram and the x-axis had a logarithmic scale. A logarithmic axis compresses the range in a non-linear fashion, which means that variable width bins have to be used for histograms and the y-axis represents density (not a count).

Taking logs and using the result to plot a histogram usually produces a curve having a distorted shape, not twin peaks. I think the twin peaks occur here because integer data are involved and the bin width just happened to have the ‘right’ value.

Looking at the plot below, the first bin contains values for x=1 (on an un-logged scale), the second bin for x=2, the third bin for x=3, but the fourth bin contains values for x=c(4, 5, 6). The nonlinear logarithmic compression, mapped to integers, means that the contents of three values are added to a single bin, creating a total that is larger than the third bin.

Histogram of 'thing' counts: log scale on x-axis

The R code that generated the above plot:

x=1:1e6
y=trunc(1e6/x^1.5)
log_y=log10(y)
 
hist(log_y, n=40, main="", xlim=c(0, 3))

I tried to mimic the pattern seen in the first histogram by trying various exponents of a power law (a common candidate for this kind of measurement), but couldn’t get anything to work.

Change the bin width can make the second peak disappear, or rather get smeared out. Still a useful pattern to look out for in the future.

Expected variability in a program’s SLOC

October 2nd, 2017 No comments

If 10 people independently implement the same specification in the same language, how much variation will there be in the length of their programs (measured in lines of code)?

The data I have suggests that the standard deviation of program length is one quarter of the mean length, e.g., 10k mean length, 2.5k standard deviation.

The plot below (code+data) shows six points from the samples I have. The point in the bottom left is based on 6,300 C programs from a programming contest question requiring solutions to the 3n+1 problem and one of the points on the right comes from five Pascal compilers for the same processor.

Mean vs standard deviation of sample program SLOC

Multiple implementations of the same specification, in the same language, are very rare. If you know of any, please let me know.

Tags:

Experimental method for measuring benefits of identifier naming

September 28th, 2017 No comments

I was recently came across a very interesting experiment in Eran Avidan’s Master’s thesis. Regular readers will know of my interest in identifiers; while everybody agrees that identifier names have a significant impact on the effort needed to understand code, reliably measuring this impact has proven to be very difficult.

The experimental method looked like it would have some impact on subject performance, but I was not expecting a huge impact. Avidan’s advisor was Dror Feitelson, who kindly provided the experimental data, answered my questions and provided useful background information (Dror is also very interested in empirical work and provides a pdf of his book+data on workload modeling).

Avidan’s asked subjects to figure out what a particular method did, timing how long it took for them to work this out. In the control condition a subject saw the original method and in the experimental condition the method name was replaced by xxx and local and parameter names were replaced by single letter identifiers; in all cases the method name was replaced by xxx. The hypothesis was that subjects would take longer for methods modified to use ‘random’ identifier names.

A wonderfully simple idea that does not involve a lot of experimental overhead and ought to be runnable under a wide variety of conditions, plus the difference in performance is very noticeable.

The think aloud protocol was used, i.e., subjects were asked to speak their thoughts as they processed the code. Having to do this will slow people down, but has the advantage of helping to ensure that a subject really does understand the code. An overall slower response time is not important because we are interested in differences in performance.

Each of the nine subjects sequentially processed six methods, with the methods randomly assigned as controls or experimental treatments (of which there were two, locals first and parameters first).

The procedure, when a subject saw a modified method was as follows: the subject was asked to explain the method’s purpose, once an answer was given (or 10 mins had elapsed) either the local or parameter names were revealed and the subject had to again explain the method’s purpose, and when an answer was given the names of both locals and parameters was revealed and a final answer recorded. The time taken for the subject to give a correct answer was recorded.

The summary output of a model fitted using a mixed-effects model is at the end of this post (code+data; original experimental materials). There are only enough measurements to have subject as a random effect on the treatment; no order of presentation data is available to look for learning effects.

Subjects took longer for modified methods. When parameters were revealed first, subjects were 268 seconds slower (on average), and when locals were revealed first 342 seconds slower (the standard deviation of the between subject differences was 187 and 253 seconds, respectively; less than the treatment effect, surprising, perhaps a consequence of information being progressively revealed helping the slower performers).

Why is subject performance less slow when parameter names are revealed first? My thoughts: parameter names (if well-chosen) provide clues about what incoming values represent, useful information for figuring out what a method does. Locals are somewhat self-referential in that they hold local information, often derived from parameters as initial values.

What other factors could impact subject performance?

The number of occurrences of each name in the body of the method provides an opportunity to deduce information; so I think time to figure out what the method does should less when there are many uses of locals/parameters, compared to when there are few.

The ability of subjects to recognize what the code does is also important, i.e., subject code reading experience.

There are lots of interesting possibilities that can be investigated using this low cost technique.

Linear mixed model fit by REML ['lmerMod']
Formula: response ~ func + treatment + (treatment | subject)
   Data: idxx
 
REML criterion at convergence: 537.8
 
Scaled residuals:
     Min       1Q   Median       3Q      Max
-1.34985 -0.56113 -0.05058  0.60747  2.15960
 
Random effects:
 Groups   Name                      Variance Std.Dev. Corr
 subject  (Intercept)               38748    196.8
          treatmentlocals first     64163    253.3    -0.96
          treatmentparameters first 34810    186.6    -1.00  0.95
 Residual                           43187    207.8
Number of obs: 46, groups:  subject, 9
 
Fixed effects:
                          Estimate Std. Error t value
(Intercept)                  799.0      110.2   7.248
funcindexOfAny              -254.9      126.7  -2.011
funcrepeat                  -560.1      135.6  -4.132
funcreplaceChars            -397.6      126.6  -3.140
funcreverse                 -466.7      123.5  -3.779
funcsubstringBetween        -145.8      125.8  -1.159
treatmentlocals first        342.5      124.8   2.745
treatmentparameters first    267.8      106.0   2.525
 
Correlation of Fixed Effects:
            (Intr) fncnOA fncrpt fncrpC fncrvr fncsbB trtmntlf
fncndxOfAny -0.524
funcrepeat  -0.490  0.613
fncrplcChrs -0.526  0.657  0.620
funcreverse -0.510  0.651  0.638  0.656
fncsbstrngB -0.523  0.655  0.607  0.655  0.648
trtmntlclsf -0.505 -0.167 -0.182 -0.160 -0.212 -0.128
trtmntprmtf -0.495 -0.184 -0.162 -0.184 -0.228 -0.213  0.673

An Almanac of the Internet

September 22nd, 2017 No comments

My search for software engineering data has turned me into a frequent buyer of second-hand computer books, many costing less than the postage of £2.80. When the following suggestion popped up along-side a search, I could not resist; there must be numbers in there!

internet almanac

The concept of an Almanac will probably be a weird idea to readers who grew up with search engines and Wikipedia. But yes, many years ago, people really did make a living by manually collecting information and selling it in printed form.

One advantage of the printed form is that updating it requires a new copy, the old copy lives on unchanged (unlike web pages); the disadvantage is taking up physical space (one day I will probably abandon this book in a British rail coffee shop).

Where did Internet users hang out in 1997?

top websites 97

The history of the Internet, as it appeared in 1997.

internet history viewed from 1997

Of course, a list of web sites is an essential component of an Internet Almanac:

website list from  1997

Tags:

Investing in the gcc C++ front-end

September 12th, 2017 No comments

I recently found out that RedHat are investing in improving the C++ front-end of gcc, i.e., management have assigned developers to work in this area. What’s in it for RedHat? I’m told there are large companies (financial institutions feature) who think that using some of the features added to recent C++ standards (these have been appearing on a regular basis) will improve the productivity of their developers. So, RedHat are hoping this work will boost their reputation and increase their sales to these large companies. As an ex-compiler guy (ex- in the sense of being promoted to higher levels that require I don’t do anything useful), I am always in favor or companies paying people to work on compilers; go RedHat.

Is there any evidence that features that have been added to any programming language improved developer productivity? The catch to this question is defining programmer productivity. There have been several studies showing that if productivity is defined as number of assembly language lines written per day, then high level languages are more productive than assembler (the lines of assembler generated by the compiler were counted, which is rather compiler dependent).

Of the 327 commits made this year to the gcc C++ front-end, by 29 different people, 295 were made by one of 17 people employed by RedHat (over half of these commits were made by two people and there is a long tail; nine people each made less than four commits). Measuring productivity by commit counts has plenty of flaws, but has the advantage of being easy to do (thanks Jonathan).

The fuzzy line between reworking and enhancing

August 29th, 2017 No comments

One trick academics use to increase their publication count is to publish very similar papers in different conferences/journals; they essentially plagiarize themselves. This practice is frowned upon, but unless referees spot the ‘duplication’, it is difficult to prevent such plagiarized versions being published. Sometimes the knock-off paper will include additional authors and may not include some of the original authors.

How do people feel about independent authors publishing a paper where all the interesting material was derived from someone else’s paper, i.e., no joint authors? I have just encountered such a case in empirical software engineering.

“Software Cost Estimation: Present and Future” by Siba N. Mohanty from 1981 (cannot find a non-paywall pdf via Google; must exist because I have a copy) has been reworked to create “Cost Estimation: A Survey of Well-known Historic Cost Estimation Techniques” by Syed Ali Abbas and Xiaofeng Liao and Aqeel Ur Rehman and Afshan Azam and M. I. Abdullah (published in 2012; pdf here); they cite Mohanty as the source of their data, some thought has obviously gone into the reworked material and I found it useful and there is a discussion on techniques created since 1981.

What makes the 2012 stand out as interesting is the depth of analysis of the 1970s models and the data, all derived from the 1981 paper. The analysis of later models is not as interesting and doe snot include any data.

The 2012 paper did ring a few alarm bells (which rang a lot more loudly after I read the 1981 paper):

  • Why was such a well researched and interesting paper published in such an obscure (at least to me) journal? I have encountered such cases before and had email conversations with the author(s). The well-known journals have not always been friendly towards empirical research, so an empirical paper appearing in less than a stellar publication is not unusual.

    As regular readers will know I am always on the look-out for software engineering data and am willing to look far and wide. I judge a paper by its content, not the journal it was published in

  • Why, in 2012, were researchers comparing effort estimation models proposed in the 1970s? Well, I am, so why not others? It did seem odd that I could not track down papers on some of the models cited, perhaps the pdfs had disappeared since 2012??? I think I just wanted to believe others were interested in what I was interested in.

What now? Retraction watch offers some advice.

The Journal of Emerging Trends in Computing and Information Sciences has an ethics page, I will email them a link to this post and see what happens (the article in question is listed as their second most cited article last year, with 19 citations).

Microcomputers ‘killed’ Ada

August 25th, 2017 No comments

In the mid-70s the US Department of Defense decided it could save lots of money by getting all its contractors to write code in the same programming language. In February 1980 a language was chosen, Ada, but by the end of the decade the DoD had snatched defeat from the jaws of victory; what happened?

I think microcomputers is what happened; these created whole new market ecosystems, some of which were much larger than the ecosystems that mainframes and minicomputers sold into.

These new ecosystems sucked up nearly all the available software developer mind-share; the DoD went from being a major employer of developers to a niche player. Developers did not want a job using Ada because they thought that being type-cast as Ada programmers would overly restricted their future job opportunities; Ada was perceived as a DoD only language (actually there was so little Ada code in the DoD, that only by counting new project starts could it get any serious ranking).

Lots of people were blindsided by the rapid rise (to world domination) of microcomputers. Compilers could profitably sold (in some cases) for tends or hundreds of dollars/pounds because the markets were large enough for this to be economically viable. In the DoD ecosystems compilers had to be sold for thousands or hundreds of thousands of dollars/pounds because the markets were small. Micros were everywhere and being programmed in languages other than Ada; cheap Ada compilers arrived after today’s popular languages had taken off. There is no guarantee that cheap compilers would have made Ada a success, but they would have ensured the language was a serious contender in the popularity stakes.

By the start of the 90s Ada supporters were reduced to funding studies to produce glowing reports of the advantages of Ada compared to C/C++ and how Ada had many more compilers, tools and training than C++. Even the 1991 mandate “… where cost effective, all Department of Defense software shall be written in the programming language Ada, in the absence of special exemption by an official designated by the Secretary of Defense.” failed to have an impact and was withdrawn in 1997.

The Ada mandate was cancelled as the rise of the Internet created even bigger markets, which attracted developer mind-share towards even newer languages, further reducing the comparative size of the Ada niche.

Astute readers will notice that I have not said anything about the technical merits of Ada, compared to other languages. Like all languages, Ada has its fanbois; these are essentially much older versions of the millennial fanbois of the latest web languages (e.g., Go and Rust). There is virtually no experimental evidence that any feature of any language is best/worse than any feature in any other language (a few experiments showing weak support for stronger typing). To its credit the DoD did fund a few studies, but these used small samples (there was not yet enough Ada usage to make larger sample possible) that were suspiciously glowing in their support of Ada.

Tags: ,

We hereby retract the content of this paper

August 17th, 2017 No comments

Yesterday I came across a paper in software engineering that had been retracted, the first time I had encountered such a paper (I had previously written about how software engineering is great discipline for an academic fraudster).

Having an example of the wording used by the IEEE to describe a retracted paper (i.e., “this paper has been found to be in violation of IEEE’s Publication Principles”), I could search for more. I get 24,400 hits listed when “software” is included in the search, but clicking through the pages there are just 71 actual results.

A search of Retraction Watch using “software engineering” returns nine hits, none of which appear related to a software paper.

I was beginning to think that no software engineering papers had been retracted, now I know of one and if I am really interested the required search terms are now known.

Tags:

Two approaches to arguing the 1969 IBM antitrust case

August 16th, 2017 No comments

My search for software engineering data has made me a regular customer of second-hand book sellers; a recent acquisition is: “Big Blue: IBM’s use and abuse of power” by Richard DeLamarter, which contains lots of interesting sales and configuration data for IBM mainframes from the first half of the 1960s.

DeLamarter’s case, that IBM systematically abused its dominant market position, looked very convincing to me, but I saw references to work by Franklin Fisher (and others) that, it was claimed, contained arguments for IBM’s position. Keen to find more data and hear alternative interpretations of the data, I bought “Folded, Spindled, and Mutilated” by Fisher, McGowan and Greenwood (by far the cheaper of the several books that have written on the subject).

The title of the book, Folded, Spindled, and Mutilated, is an apt description of the arguments contained in the book (which is also almost completely devoid of data). Fisher et al obviously recognized the hopelessness of arguing IBM’s case and spend their time giving general introductions to various antitrust topics, arguing minor points or throwing up various smoke-screens.

An example of the contrasting approaches is calculation of market share. In order to calculate market share, the market has to be defined. DeLamarter uses figures from internal IBM memos (top management were obsessed with maintaining market share) and quote IBM lawyers’ advice to management on phrases to use (e.g., ‘Use the term market leadership, … avoid using phrasing such as “containment of competitive threats” and substitute instead “maintain position of leadership.”‘); Fisher et al arm wave at length and conclude that the appropriate market is the entire US electronic data processing industry (the more inclusive the market used, the lower the overall share that IBM will have; using this definition IBM’s market share drops from 93% in 1952 to 43% in 1972 and there is a full page graph showing this decline), the existence of IBM management memos is not mentioned.

Why do academics risk damaging their reputation by arguing these hopeless cases (I have seen it done in other contexts)? Part of the answer is a fat pay check, but also many academics’ consider consulting for industry akin to supping with the devil (so they get a free pass on any nonsense sprouted when “just doing it for the money”).

Tags: ,