Home > Uncategorized > Evidence-based software engineering: book released

Evidence-based software engineering: book released

My book, Evidence-based software engineering, is now available; the pdf can be downloaded here, here and here, plus all the code+data. Report any issues here. I’m investigating the possibility of a printed version. Mobile friendly pdf (layout shaky in places).

The original goals of the book, from 10-years ago, have been met, i.e., discuss what is currently known about software engineering based on an analysis of all the publicly available software engineering data, and having the pdf+data+code freely available for download. The definition of “all the public data” started out as being “all”, but as larger and higher quality data was discovered the corresponding were ignored.

The intended audience has always been software developers and their managers. Some experience of building software systems is assumed.

How much data is there? The data directory contains 1,142 csv files and 985 R files, the book cites 895 papers that have data available of which 556 are cited in figure captions; there are 628 figures. I am currently quoting the figure of 600+ for the ‘amount of data’.


Cover image of book Evidence-based software engineering.

Things that might be learned from the analysis has been discussed in previous posts on the chapters: Human cognition, Cognitive capitalism, Ecosystems, Projects and Reliability.

The analysis of the available data is like a join-the-dots puzzle, except that the 600+ dots are not numbered, some of them are actually specs of dust, and many dots are likely to be missing. The future of software engineering research is joining the dots to build an understanding of the processes involved in building and maintaining software systems; work is also needed to replicate some of the dots to confirm that they are not specs of dust, and to discover missing dots.

Some missing dots are very important. For instance, there is almost no data on software use, but there can be lots of data on fault experiences. Without software usage data it is not possible to estimate whether the software is very reliable (i.e., few faults experienced per amount of use), or very unreliable (i.e., many faults experienced per amount of use).

The book treats the creation of software systems as an economically motivated cognitive activity occurring within one or more ecosystems. Algorithms are now commodities and are not discussed. The labour of the cognitariate is the means of production of software systems, and this is the focus of the discussion.

Existing books treat the creation of software as a craft activity, with developers applying the skills and know-how acquired through personal practical experience. The craft approach has survived because building software systems has been a sellers market, customers have paid what it takes because the potential benefits have been so much greater than the costs.

Is software development shifting from being a sellers market to a buyers market? In a competitive market for development work and staff, paying people to learn from mistakes that have already been made by many others is an unaffordable luxury; an engineering approach, derived from evidence, is a lot more cost-effective than craft development.

As always, if you know of any interesting software engineering data, please let me know.

  1. November 9, 2020 07:46 | #1

    Hi, Derek! I’m reading your book. It’s very interesting as regarding cases described and data used. Here are some cases of mine as open data science researcher: https://rpubs.com/alex-lev/553777, https://rpubs.com/alex-lev/229888, https://rpubs.com/alex-lev/229888.

  2. November 13, 2020 01:43 | #2

    Hi Derek, what a cool idea, thanks for creating this! My initial contribution to software engineering data can be found at: https://www.gitclear.com/line_impact_factors#line_impact_distribution. I’ve spent the last 5 years iterating on an algorithm that measures the rate of repo evolution on a per-committer basis. The data we publish on the linked page is high level, but if you’d like to dig into more granular details, we’ve got gigabytes of data accumulated that could be shared.

    I’m currently trading emails with a couple of the often-cited Comp Sci profs (e.g., Dr. Alain Abran) who specialize in studying developer productivity patterns. The long-term goal is to get a validated academic paper published that could answer questions like
    * How is the rate of repo evolution impacted by the number of developers on a product? Our preliminary findings are that developer throughput decays exponentially with each new dev added to a project
    * How does a developers output pattern change as they become more experienced? Preliminary finding is that more senior developers spend substantially more time deleting and revising legacy code, whereas new developers spend a lot more time adding code
    * How can tech debt be quantified on a per-directory basis, and what are the long-term implications of letting tech debt linger?

    If any of this stuff sounds interesting to you, drop me an email and maybe we could collaborate?

  3. November 13, 2020 02:39 | #3

    @Bill Harding
    Thanks for your offer.

    There is lots of work done analysing source code repos because they exist and tools can be written to process them. This analysis tells one side of a multi-sided story.

    What is the work profile of the developers working on source code? For instance, are they spending most of their time coding, or coding and testing, or are they slowly moving into management, etc.

    Getting timesheet information on about what people are doing is very hard. It does not always exist for very long and companies are not keen to make it available (at least publicly).

    Why does the number of commits made by a developer tend to fall over time? Have they moved into managing, has the project matured and it’s all small tweaks to keep the users happy? Answering these questions requires lots of hands on analysis (which rarely gets done).

  4. A nony mouse
    November 13, 2020 04:57 | #4

    Page 16: equation simplifies to P(D|S) = P(D) not P(DS)

  5. November 13, 2020 10:56 | #5

    @A nony mouse
    Oops, fixed (⌒_⌒;) (I’m told that is the ascii for embarrassed smiley)

  6. Kealan
    November 15, 2020 11:07 | #6

    Hi, it looks like the 939 source here is broken, I will post the link you sent me to:

    https://github.com/Derek-Jones/ESEUR

    It 404’s

  7. November 15, 2020 16:48 | #7

    @Kealan
    I have searched for broken links, but cannot find any. Can you be more specific about which link you clicked on? Thanks.

  8. Nemo
    November 16, 2020 16:03 | #8

    I have sent your ESEUR link to a few embedded-s/w managers to gauge their opinions.

  9. November 16, 2020 16:23 | #9

    @Nemo
    Thanks. They might be interested in figure 7.30, probably the most interesting embedded developer data I have (not that I have very much from this huge unresearched field).

  1. November 23rd, 2020 at 15:10 | #1