Archive

Posts Tagged ‘R’

Reliability chapter added to “Empirical software engineering using R”

April 3rd, 2018 No comments

The Reliability chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I have been working on this draft for four months and it still needs lots of work; time to move on and let it stew for a while. Part of the problem is lack of public data; cost and schedule overruns can be rather public (projects chapter), but reliability problems are easier to keep quiet.

Originally there was a chapter covering reliability and another one covering faults. As time passed, these merged into one. The material kept evaporating in front of my eyes (around a third of the initial draft, collected over the years, was deleted); I have already written about why most fault prediction research is a waste of time. If it had not been for Rome I would not have had much to write about.

Perhaps what will jump out at people most, is that I distinguish between mistakes in code and what I call a fault experience. A fault_experience=mistake_in_code + particular_input. Most fault researchers have been completely ignoring half of what goes into every fault experience, the input profile (if the user does not notice a fault, I do not consider it experienced) . It’s incredibly difficult to figure out anything about the input profile, so it has been quietly ignored (one of the reasons why research papers on reported faults are such a waste of time).

I’m also missing an ‘interesting’ figure on the opening page of the chapter. Suggestions welcome.

I have not said much about source code characteristics. There is a chapter covering source code, perhaps some of this material will migrate to reliability.

All sorts of interesting bits and pieces have been added to earlier chapters. Ecosystems keeps growing and in years to come somebody will write a multi-volume tomb on software ecosystems.

I have been promised all sorts of data. Hopefully some of it will arrive.

As always, if you know of any interesting software engineering data, please tell me.

Source code chapter next.

Data-set update to “Empirical software engineering using R”

November 26th, 2017 No comments

The pile of papers, books and data-sets, relating to previously released draft chapters of my Empirical software engineering book, has been growing, and cluttering up my mind. I decided to have a clear-out.

A couple of things stood out.

There are around 25 data-sets that have been promised but not yet arrived. If you encounter anybody who mentions they promised to send me data, please encourage them to spend some time doing this. I don’t want to add a new category, promised but never delivered, to the list of email responses.

There has been an increase in data-sets not being used because I already have something better. This is a good sign, data quality is increasing. One consequence is that a growing number of ‘historical’ data-sets have fallen by the wayside. This is a good thing, most data-sets analysed in papers are very low quality and only used because nothing else was available.

One of my reasons for making draft releases was to prompt people to suggest data I had missed. This has not happened yet; come on people, suggest some data I don’t yet know about.

About a third of the pile got included in the latest draft, a third had been superseded by something better, and a third are still waiting for promised data.

Now, back to the reliability chapter.

Tags: , ,

Projects chapter added to “Empirical software engineering using R”

October 27th, 2017 2 comments

The Projects chapter of my Empirical software engineering book has been added to the draft pdf (download here).

This material turned out to be harder to bring together than I had expected.

Building software projects is a bit like making sausages in that you don’t want to know the details, or in this case those involved are not overly keen to reveal the data.

There are lots of papers on requirements, but remarkably little data (Soo Ling Lim’s work being the main exception).

There are lots of papers on effort prediction, but they tend to rehash the same data and the quality of research is poor (i.e., tweaking equations to get a better fit; no explanation of why the tweaks might have any connection to reality). I had not realised that Norden did all the heavy lifting on what is sometimes called the Putnam model; Putnam was essentially an evangelist. The Parr curve is a better model (sorry, no pdf), but lacked an evangelist.

Accurate estimates are unrealistic: lots of variation between different people and development groups, the client keeps changing the requirements and developer turnover is high.

I did turn up a few interesting data-sets and Rome came to the rescue in places.

I have been promised more data and am optimistic some will arrive.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Reliability chapter next.

Tags: , ,

Ecosystems chapter added to “Empirical software engineering using R”

July 17th, 2017 No comments

The Ecosystems chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I don’t seem to be able to get away from rewriting everything, despite working on the software engineering material for many years. Fortunately the sparsity of the data keeps me in check, but I keep finding new and interesting data (not a lot, but enough to slow me down).

There is still a lot of work to be done on the ecosystems chapter, not least integrating all the data I have been promised. The basic threads are there, they just need filling out (assuming the promised data sets arrive).

I did not get any time to integrate in the developer and economics data received since those draft chapters were released; there has been some minor reorganization.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Projects chapter next.

Tags: , ,

Warp your data to make it visually appealing

May 4th, 2017 No comments

Data plots can sometimes look very dull and need to be jazzed up a bit. Now, nobody’s suggesting that the important statistical properties of the data be changed, but wouldn’t it be useful if the points could be moved around a bit, to create something visually appealing without losing the desired statistical properties?

Readers have to agree that the plot below looks like fun. Don’t you wish your data could be made to look like this?

Rabbit image

Well, now you can (code here, inspired by Matejka and Fitzmaurice who have not released their code yet). It is also possible to thin-out the points, while maintaining the visual form of the original image.

The idea is to perturb the x/y position of very point by a small amount, such that the desired statistical properties are maintained to some level of accuracy:

check_prop=function(new_pts, is_x)
{
if (is_x)
   return(abs(myx_mean-stat_cond(new_pts)) < 0.01)
else
   return(abs(myy_mean-stat_cond(new_pts)) < 0.01)
}
 
 
mv_pts=function(pts)
{
repeat
   {
   new_x=pts$x+runif(num_pts, -0.01, 0.01)
   if (check_prop(new_x, TRUE))
      break()
   }
 
repeat
   {
   new_y=pts$y+runif(num_pts, -0.01, 0.01)
   if (check_prop(new_y, FALSE))
      break()
   }
 
return(data.frame(x=new_x, y=new_y))
}

The distance between the perturbed points and the positions of the target points then needs to be calculated. For each perturbed point its nearest neighbor in the target needs to be found and the distance calculated. This can be done in n log n using kd-trees and of course there is an R package, RANN, do to this (implemented in the nn2 function). The following code tries to minimize the sum of the distances, another approach is to minimize the mean distance:

mv_closer=function(pts)
{
repeat
   {
   new_pts=mv_pts(pts)
   new_dist=nn2(rabbit, new_pts, k=1)
   if (sum(new_dist$nn.dists) < cur_dist)
      {
      cur_dist <<- sum(new_dist$nn.dists)
      return(new_pts)
      }
   }
 
}

Now it’s just a matter of iterating lots of times, existing if the distance falls below some limit:

iter_closer=function(tgt_pts, src_pts)
{
cur_dist <<- sum(nn2(tgt_pts, src_pts, k=1)$nn.dists)
cur_pts=src_pts
for (i in 1:5000)
   {
   new_pts=mv_closer(cur_pts)
   cur_pts=new_pts
   if (cur_dist < 13)
      return(cur_pts)
   }
return(cur_pts)
}

This code handles a single statistical property. Matejka and Fitzmaurice spent more than an hour on their implementation, handle multiple properties and use simulated annealing to prevent being trapped in local minima.

An example, with original points in yellow:

Warp towards rabbit image

Enjoy.

Tags: , , , ,

Economics chapter added to “Empirical software engineering using R”

March 26th, 2017 No comments

The Economics chapter of my Empirical software engineering book has been added to the draft pdf (download here).

This is a slim chapter, it might grow a bit, but I suspect not by a huge amount. Reasons include lots of interesting data being confidential and me not having spent a lot of time on this topic over the years (so my stash of accumulated data is tiny). Also, a significant chunk of the economics data I have is used to discuss issues in the Ecosystems and Projects chapters, perhaps some of this material will migrate back once these chapters are finalized.

You might argue that Economics is more important than Human cognitive characteristics and should have appeared before it (in chapter order). I would argue that hedonism by those involved in producing software is the important factor that pushes (financial) economics into second place (still waiting for data to argue my case in print).

Some of the cognitive characteristics data I have been waiting for arrived, and has been added to this chapter (some still to be added).

As always, if you know of any interesting software engineering data, please tell me.

I am after a front cover. A woodcut of alchemists concocting a potion appeals, perhaps with various software references discretely included, or astronomy related (the obvious candidate has already been used). The related modern stuff I have seen does not appeal. Suggestions welcome.

Ecosystems next.

Tags: , ,

Empirical Software Engineering using R: first draft available for download

January 29th, 2017 4 comments

A draft of my book Empirical Software Engineering using R is now available for download.

The book essentially comes in two parts:

  • statistical techniques that are useful for analyzing software engineering data. This draft release contains most of the techniques I plan to cover. I am interested in hearing about any techniques you think ought to be covered, but I only cover techniques when real data is available to use in an example,
  • six chapters covering what I consider to be the primary aspects of software engineering. This draft release includes the Human Cognitive Characteristics chapter and I am hoping to release one each of the remaining chapters every few months (Economics is next).

There is a page for making suggestions and problem reports.

All the code+data is available and I am claiming to have a copy of all the important, publicly available, software engineering data. If you know of any I don’t have, please let me know.

I am looking for a publisher. The only publisher I have had serious discussions with decided not to go ahead because of my insistence of releasing a free copy of the pdf. Self-publishing is a last resort.

Tags: ,

Automatically generated join-the-dots images

December 16th, 2016 1 comment

It is interesting to try and figure out what picture emerges from a join-the-dots puzzle (connect-the-dots in some parts of the world). Let’s have a go at some lightweight automatic generation such a puzzle (some heavy-weight techniques).

If an image is available, expressed as an boolean matrix, R’s sample function can be used to select a small percentage of the black points.

Taking the output of the following equation:

x=seq(-4.7, 4.7, by=0.002)
 
y1 = c(1,-.7,.5)*sqrt(c(1.3, 2,.3)^2 - x^2) - c(.6,1.5,1.75)  # 3
y2 =0.6*sqrt(4 - x^2)-1.5/as.numeric(1.3 <= abs(x))           # 1
y3 = c(1,-1,1,-1,-1)*sqrt(c(.4,.4,.1,.1,.8)^2 -(abs(x)-c(.5,.5,.4,.4,.3))^2) - c(.6,.6,.6,.6,1.5) # 5
y4 =(c(.5,.5,1,.75)*tan(pi/c(4, 5, 4, 5)*(abs(x)-c(1.2,3,1.2,3)))+c(-.1,3.05, 0, 2.6))/
	as.numeric(c(1.2,.8,1.2,1) <= abs(x) & abs(x) <= c(3,3, 2.7, 2.7))                 # 4
y5 =(1.5*sqrt(x^2 +.04) + x^2 - 2.4) / as.numeric(abs(x) <= .3)                            # 1
y6 = (2*abs(abs(x)-.1) + 2*abs(abs(x)-.3)-3.1)/as.numeric(abs(x) <= .4)                    # 1
y7 =(-.3*(abs(x)-c(1.6,1,.4))^2 -c(1.6,1.9, 2.1))/
	as.numeric(c(.9,.7,.6) <= abs(x) & abs(x) <= c(2.6, 2.3, 2))                       # 3

and sampling 300 of the 20,012 points we get images such as the following:

Sampled rabbit image

A relatively large sample size is needed to reduce the possibility that a random selection fails to return any points within a significant area, but we do end up with many points clustered here and there.

library("plyr")
 
rab_points=adply(x, 1, function(X) data.frame(x=rep(X, 18), y=c(
	c(1, -0.7, 0.5)*sqrt(c(1.3, 2, 0.3)^2-X^2) - c(0.6, 1.5 ,1.75),
	0.6*sqrt(4 - X^2)-1.5/as.numeric(1.3 <= abs(X)),
	c(1, -1, 1, -1, -1)*sqrt(c(0.4, 0.4, 0.1, 0.1, 0.8)^2-(abs(X)-c(0.5, 0.5, 0.4, 0.4, 0.3))^2) - c(0.6, 0.6, 0.6, 0.6, 1.5),
	(c(0.5, 0.5, 1, 0.75)*tan(pi/c(4, 5, 4, 5)*(abs(X)-c(1.2, 3, 1.2, 3)))+c(-0.1, 3.05, 0, 2.6))/
		as.numeric(c(1.2, 0.8, 1.2, 1) <= abs(X) & abs(X) <= c(3,3, 2.7, 2.7)),
	(1.5*sqrt(X^2+0.04) + X^2 - 2.4) / as.numeric(abs(X) <= 0.3),
	(2*abs(abs(X)-0.1)+2*abs(abs(X)-0.3)-3.1)/as.numeric(abs(X) <= 0.4),
	(-0.3*(abs(X)-c(1.6, 1, 0.4))^2-c(1.6, 1.9, 2.1))/
		as.numeric(c(0.9, 0.7, 0.6) <= abs(X) & abs(X) <= c(2.6, 2.3, 2))
						)))
rab_points$X1=NULL
rb=subset(rab_points, (!is.na(x)) & (!is.na(y) & is.finite(y)))
 
x=sample.int(nrow(rb), 300)
plot(rb$x[x], rb$y[x],
	bty="n", xaxt="n", yaxt="n", pch=4, cex=0.5, xlab="", ylab="")

A more uniform image can produced by removing all points less than a given distance from some selected set of points. In this case the point in the first element is chosen, everything close to it removed and the the processed repeated with the second element (still remaining) and so on.

rm_nearest=function(jp)
{
keep=((dot_im$x[(jp+1):(jp+window_size)]-dot_im$x[jp])^2+
      (dot_im$y[(jp+1):(jp+window_size)]-dot_im$y[jp])^2) < min_dist
keep=c(keep, TRUE) # make sure which has something to return
return(jp+which(keep))
}
 
window_size=500
cur_jp=1
dot_im=rb
 
while (cur_jp <= nrow(dot_im))
   {
#   min_dist=0.05+0.50*runif(window_size)
   min_dist=0.05+0.30*runif(1)
   dot_im=dot_im[-rm_nearest(cur_jp), ]
   cur_jp=cur_jp+1
   }
 
plot(dot_im$x, dot_im$y,
	bty="n", xaxt="n", yaxt="n", pch=4, cex=0.5, xlab="", ylab="")

Since R supports vector operations I want to do everything without using loops or if-statements. Yes, there is a while loop :-(, alternative, simple, non-loop suggestions welcome.

Removing points with an average squared distance less than 0.3 and 0.5 we get (with around 135-155 points) the images:

Remove closest points rabbit image

I was going to come up with a scheme for adding numbers, perhaps I will do this in another post.

Click for more equations generating images.

Tags: , ,

Producing software for money and/or recognition

October 24th, 2016 No comments

In the commercial environment money makes the world go around, while in academia recognition (e.g., number of times your work is cited, being fawned over at conferences, impressive job titles) is the coin of the realm (there are a few odd balls who do it out of love for the subject or a desire to understand how things work, but modern academia is a large bureaucracy whose primary carrot is recognition).

There is an incentive problem for those writing software in academia; software does not attract much, if any, recognition.

Does the lack of recognition for writing software matter? Surely what counts are the research results, not the tools used to get there (be they writing software or doing mathematics).

A recent paper bemoans the lack of recognition for the development of Python packages for Astronomy researchers. Well, its too late now, they have written the software and everybody gets to make perfect copies for free.

What the authors of Astropy want, is for researchers who use this software to include a citation to it in any published papers. Do all 162 authors deserve equal credit? If a couple of people add a new package, should they get a separate citation? What if a new group of people take over maintenance, when should the citation switch over from the old authors/maintainers to the new ones? These are a couple of the thorny questions that need to be answered.

R is perhaps the most widely used academic developed software ecosystem. A small dedicated group of people has invested a lot of their time over many years to make something special. A lot more people have invested effort to create a wide variety of add-on packages.

The base R library includes the citation function, which returns the BibTeX information for a given package; ready to be added to a research papers work flow.

Both commercial and academic producers need to periodically create new versions to keep ahead of the competition, attract more customers and obtain income. While they both produce software to obtain ‘income’, commercial and academic software systems have different incentives when it comes to support for end users of the software.

Keeping existing customers happy is the way to get them to pay for upgrades and this means maintaining compatibility with what went before. Managers in commercial companies make sure that developers don’t break backwards compatibility (developers hate having to code around what went before and would love to throw it all away).

In the academic world it does not matter whether end users upgrade, as long as they cite the package, the version used is irrelevant; so there is a lot less pressure to keep backwards compatibility. Academics are supposed to create new stuff, they are researchers after all, so the incentives are pushing them to create brand new packages/systems to be seen as doing new stuff (and obtain a whole new round of citations). A good example is Hadley Wickham, who has created some great R packages, who seems to be continually moving on, e.g., reshape -> reshape2 -> tidyr (which is what any good academic is supposed to do).

The run-time performance of a system is something end users always complain about, but often get used to. The reason is invariably that there is little or no incentive to address this issue (for both commercial and academic systems). Microsoft Windows is slower than it need be and the R interpreter could go a lot faster (the design of the interpreter looks like something out of the 1980s; I’m seeing a lot of packages in R only, so the idea that R programs spend all their time executing in C/Fortran libraries may be out of date. Where is the incentive to use post-2000 designs?)

How many new versions of a software package can be produced before enough people stop being willing to pay for an update? How many different packages solving roughly the same problem can academics produce?

I don’t think producing new packages for income has a long term future.

Software engineering data sets

September 5th, 2016 No comments

The pretty pictures from my empirical software engineering book are now online, along with the 210 data sets and R code (330M).

Plotting the number of data sets in each year shows that empirical software engineering has really taken off in the last 10 years (code+data). Around dozen or so confidential data sets are not included; I am only writing about data that can be made public.

Number of data sets per year

It used to be rare to find the data associated with a paper on the author’s website. Of course, before around 1995 there was no web, but since around 2012 the idea has started to take off.

Contact via email goes back to 1985 and before that people sent mag tapes through the post and many years ago somebody sent me punched tape (there is nothing like seeing the bits with the naked eye).

I have sent several hundred emails asking for data and received 55 data sets. I’m hoping this release will spur those who have promised me data to invest some time to send it.

My experience is that research data often lives on laptops and dies when the laptop is replaced (a study of biologists, who have been collecting data for hundreds of years, found a data ‘death rate’ of 17% a year). Had I started actively collecting data before 2010 the red line in the plot would be much higher for earlier years; I often received data from authors when writing my C book at the start of the century (Google went from nothing to being the best place to search, while I wrote).

In nine cases I extracted the data, either from the pdf or an image and then reverse engineered values.

I have around 50 data sets waiting to be processed. Given that lots more are bound to arrive before the book is finished, I expect to easily reach the 300 mark. A tiny number given my aim of writing about all software engineering issues for which public data exists.

If you know of interesting software engineering data, that is not to be found in these plots, please let me know.

Tags: , ,