Home > Uncategorized > Background to my book project “Empirical Software Engineering with R”

Background to my book project “Empirical Software Engineering with R”

This post provides background information that can be referenced by future posts.

For the last 18 months I have been working in fits and starts on a book that has the working title “Empirical Software Engineering with R”. The idea is to provide broad coverage of software engineering issues from an empirical perspective (i.e., the discussion is driven by the analysis of measurements obtained from experiments); R was chosen for the statistical analysis because it is becoming the de-facto language of choice in a wide range of disciplines and lots of existing books provide example analysis using R, so I am going with the crowd.

While my last book took five years to write I had a fixed target, a template to work to and a reasonably firm grasp of the subject. Empirical software engineering has only really just started, the time interval between new and interesting results appearing is quiet short and nobody really knows what statistical techniques are broadly applicable to software engineering problems (while the normal distribution is the mainstay of the social sciences a quick scan of software engineering data finds few occurrences of this distribution).

The book is being driven by the empirical software engineering rather than the statistics, that is I take a topic in software engineering and analyse the results of an experiment investigating that topic, providing pointers to where readers can find out more about the statistical techniques used (once I know which techniques crop up a lot I will write my own general introduction to them).

I’m assuming that readers have a reasonable degree of numeric literacy, are happy dealing with probabilities and have a rough idea about statistical ideas. I’m hoping to come up with a workable check-list that readers can use to figure out what statistical techniques are applicable to their problem; we will see how well this pans out after I have analysed lots of diverse data sets.

Rather than wait a few more years before I can make a complete draft available for review I have decided to switch to making available individual parts as they are written, i.e., after writing a draft discussion and analysis of each experiment I will published it on this blog (along with the raw data and R code used in the analyse). My reasons for doing this are:

  • Reader feedback (I hope I get some) will help me get a better understanding of what people are after from a book covering empirical software engineering from a statistical analysis of data perspective.
  • Suggestions for topics to cover. I am being very strict and only covering topics for which I have empirical data and can make that data available to readers. So if you want me to cover a topic please point me to some data. I will publish a list of important topics for which I currently don’t have any data, hopefully somebody will point me at the data that can be used.
  • Posting here will help me stay focused on getting this thing done.

Links to book related posts

Distribution of uptimes for high-performance computing systems

Break even ratios for development investment decisions

Agreement between code readability ratings given by students

Changes in optimization performance of gcc over time

Descriptive statistics of some Agile feature characteristics

Impact of hardware characteristics on detectable fault behavior

Prioritizing project stakeholders using social network metrics

Preferential attachment applied to frequency of accessing a variable

Amount of end-user usage of code in Firefox

How many ways of programming the same specification?

Ways of obtaining empirical data in software engineering

What is the error rate for published mathematical proofs?

Changes in the API/non-API method call ratio with program size

Honking the horn for go faster memory (over go faster cpus)

How to avoid being a victim of Brooks’ law

Evidence for the benefits of strong typing, where is it?

Hardware variability may be greater than algorithmic improvement

Extracting the original data from a heatmap image

Entropy: Software researchers go to topic when they have no idea what else to talk about

Debian has cast iron rules for package growth & death

Joke: Student subjects in software engineering experiments

  1. June 22, 2012 06:10 | #1

    I’ve been out of the software game for awhile (I left and went to grad school) but have been in the statistics game for the same ‘while’. This is a fascinating topic that I would love to know more about, specifically the two topics you mention 1) What are the distributions (either theoretical or empirical) of user behavior that an engineer can take advantage of?; and 2) What are good strategies for designing experiments (that will not alienate your customers/clients)?. I’d love see more!

  2. June 22, 2012 11:10 | #2

    Frank,

    Many source code measurements show an exponential/power law like distribution and using extreme value statistics for fault analysis is on my list of things to look at. In my C book I argue (pdf page 109) that software development expertise does not really exist in the sense that other domains understand expertise, so developer behavior is random punctuated by a few naturally talented individuals.

    In terms of general behavior I think that one important characteristic that is rarely made use of is peoples’ ability to improve with practice. So doing everything the same way, rather than doing everything different because it is ‘interesting’, can have big payoffs.

  3. MichaelBerkowitz
    February 25, 2013 07:43 | #3

    Derek,

    I happened on your blog while looking around for empirical data about various aspects of software-engineering practice. So far I’ve found precious little else out there, and I’m wondering whether I’ve now hit the tip of the iceberg or the iceberg itself, so perhaps you can help me:

    Although many of the topics mentioned here interest me, at the moment I’m specifically looking for empirical data on the supposed benefits of “modern” techniques such as OOA/D/P, Agile development and ORMs. If you can direct me to any of that I’ll be grateful.

    Thanks in advance.

    Michael

  4. February 25, 2013 10:27 | #4

    @MichaelBerkowitz
    As far as I can tell you have more or less hit the iceberg. Jørgensen has done some very interesting work that I have not written about yet, most other researchers seem to be doing unimaginative stuff with small data sets that are very old.

    Do let me know if you manage to find any solid data.

  5. MichaelBerkowitz
    February 25, 2013 12:55 | #5

    Gladly, but I can’t say I’m hopeful…
    @Derek Jones

  6. Clifford Dibble
    October 27, 2019 02:27 | #6

    WRT “Many source code measurements show an exponential/power law like distribution and using extreme value statistics for fault analysis is on my list of things to look at.” => Please see Erik Bernhardsson’s “Why software projects take longer than you think – a statistical model” at

    https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html

  7. October 28, 2019 00:56 | #7

    @Clifford Dibble
    Thanks for the link. The data used comes from an article I a co-authored, and the post ignores most of the estimates (i.e., all estimates of seven and below), so it probably gives a skewed view (I have not checked).

    I’ve just posted an update to the Projects chapter of my book, which contains some new data (from other papers).

    If you find any people with data, please let them know I am offering a free analysis, provided they are willing to make the data public (in anonymized form).

  1. September 18th, 2012 at 10:46 | #1