Archive

Posts Tagged ‘projects’

Huge effort data-set for project phases

November 3rd, 2017 No comments

I am becoming a regular reader of software engineering articles written in Chinese and Japanese; or to be more exact, I am starting to regularly page through pdfs looking at figures and tables of numbers, every now and again cutting-and-pasting sequences of logograms into Google translate.

A few weeks ago I saw the figure below, and almost fell off my chair; it’s from a paper by Yong Wang and Jing Zhang. These plots are based on data that is roughly an order of magnitude larger than the combined total of all the public data currently available on effort break-down by project phase.

Histogram and density plot of project phase efforts

Projects are often broken down into phases, e.g., requirements, design, coding (listed as ‘produce’ above), testing (listed as ‘optimize’), deployment (listed as ‘implement’), and managers are interested in knowing what percentage of a project’s budget is typically spent on each phase.

Projects that are safety-critical tend to have high percentage spends in the requirements and testing phase, while in fast moving markets resources tend to be invested more heavily in coding and deployment.

Research papers on project effort usually use data from earlier papers. The small number of papers that provide their own data might list effort break-down for half-a-dozen projects, a few require readers to take their shoes and socks off to count, a small number go higher (one from the Rome period), but none get into three-digits. I have maybe a few hundred such project phase effort numbers.

I emailed the first author and around a week later had 2,570 project phase effort (man-hours) percentages (his co-author was on marriage leave, which sounded a lot more important than my data request); see plot below (code+data).

Fraction of effort invested in each project phase

I have tried to fit some of the obvious candidate distributions to each phase, but none of the fits were consistently good across the phases (see code for details).

This project phase data is from small projects, i.e., one person over a few months to ten’ish people over more than a year (a guess based on the total effort seen in other plots in the paper).

A typical problem with samples in software engineering is their small size (apart from bugs data, lots of that is available, at least in uncleaned form). Having a sample of this size means that it should be possible to have a reasonable level of confidence in the results of statistical tests. Now we just need to figure out some interesting questions to ask.

Tags: , ,

Projects chapter added to “Empirical software engineering using R”

October 27th, 2017 2 comments

The Projects chapter of my Empirical software engineering book has been added to the draft pdf (download here).

This material turned out to be harder to bring together than I had expected.

Building software projects is a bit like making sausages in that you don’t want to know the details, or in this case those involved are not overly keen to reveal the data.

There are lots of papers on requirements, but remarkably little data (Soo Ling Lim’s work being the main exception).

There are lots of papers on effort prediction, but they tend to rehash the same data and the quality of research is poor (i.e., tweaking equations to get a better fit; no explanation of why the tweaks might have any connection to reality). I had not realised that Norden did all the heavy lifting on what is sometimes called the Putnam model; Putnam was essentially an evangelist. The Parr curve is a better model (sorry, no pdf), but lacked an evangelist.

Accurate estimates are unrealistic: lots of variation between different people and development groups, the client keeps changing the requirements and developer turnover is high.

I did turn up a few interesting data-sets and Rome came to the rescue in places.

I have been promised more data and am optimistic some will arrive.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Reliability chapter next.

Tags: , ,

Failed projects + the Cloud = Software reuse

December 26th, 2016 No comments

Code reuse is one of those things that sounds like a winning idea to those outside of software development; those who write software for a living are happy to reuse other peoples’ code but don’t want the hassle involved with others reusing their own code. From the management point of view, where is the benefit in having your developers help others get their product out the door when they should be working towards getting your product out the door?

Lots of projects get canceled after significant chunks of software have been produced, some of it working. It would be great to get some return on this investment, but the likely income from selling software components is rarely large enough to make it worthwhile investing the necessary resources. The attractions of the next project are soon appear more enticing than hanging around baby-sitting software from a cancelled project.

Cloud services, e.g., AWS and Azure to name two, look like they will have a big impact on code reuse. The components of a failed project, i.e., those bits that work tolerably well, can be packaged up as a service and sold/licensed to other companies using the same cloud provider. Companies are already offering a wide variety of third-party cloud services, presumably the new software got written because no equivalent services was currently available on the provider’s cloud; well perhaps others are looking for just this service.

The upfront cost of sales is minimal, the services your failed re-purposed software provides get listed in various service directories. The software can just sit there waiting for customers to come along, or you could put some effort into drumming up customers. If sales pick up, it may become worthwhile offering support and even making enhancements.

What about the software built for non-failed projects? Software is a force multiplier and anybody working on a non-failed project wants to use this multiplier for their own benefit, not spend time making it available for others (I’m not talking about creating third-party APIs).

Student projects for 2016/2017

October 6th, 2016 No comments

This is the time of year when students have to come up with an idea for their degree project. I thought I would suggest a few interesting ideas related to software engineering.

  • The rise and fall of software engineering myths. For many years a lot of people (incorrectly) believed that there existed a 25-to-1 performance gap between the best/worst software developers (its actually around 5 to 1). In 1999 Lutz Prechelt wrote a report explaining out how this myth came about (somebody misinterpreted values in two tables and this misinterpretation caught on to become the dominant meme).

    Is the 25-to-1 myth still going strong or is it dying out? Can anything be done to replace it with something closer to reality?

    One of the constants used in the COCOMO effort estimation model is badly wrong. Has anybody else noticed this?

  • Software engineering papers often contain trivial mathematical mistakes; these can be caused by cut and paste errors or mistakenly using the values from one study in calculations for another study. Simply consistency checks can be used to catch a surprising number of mistakes, e.g., the quote “8 subjects aged between 18-25, average age 21.3″ may be correct because 21.3*8 == 170.4, ages must add to a whole number and the values 169, 170 and 171 would not produce this average.

    The Psychologies are already on the case of Content Mining Psychology Articles for Statistical Test Results and there is a tool, statcheck, for automating some of the checks.

    What checks would be useful for software engineering papers? There are tools available for taking pdf files apart, e.g., qpdf, pdfgrep and extracting table contents.

  • What bit manipulation algorithms does a program use? One way of finding out is to look at the hexadecimal literals in the source code. For instance, source containing 0x33333333, 0x55555555, 0x0F0F0F0F and 0x0000003F in close proximity is likely to be counting the number of bits that are set, in a 32 bit value.

    Jörg Arndt has a great collection of bit twiddling algorithms from which hex values can be extracted. The numbers tool used a database of floating-point values to try and figure out what numeric algorithms source contains; I’m sure there are better algorithms for figuring this stuff out, given the available data.

Feel free to add suggestions in the comments.

Tags: ,