Archive

Author Archive

Burger flippers with STEM degrees

May 30th, 2017 2 comments

There continues to be a lot of fuss over the shortage of STEM staff (Science, technology, engineering, and mathematics). But analysis of the employment data suggests there is no STEM shortage.

One figure than jumps out is the unemployment rate for computing graduates, why is this rate of so consistently much higher than other STEM graduates, 12% vs. under 9% for the rest?

Based on my somewhat limited experience of sitting on industrial panels in university IT departments (the intended purpose of such panels is to provide industry input, but in practice we are there to rubber stamp what the department has already decided to do) and talking with recent graduates, I would explain the situation described above was follows:

The dynamics from the suppliers’ side (i.e., the Universities) is that students want a STEM degree, students are the customer (i.e., they are paying) and so the university had better provide degrees that the customer can pass (e.g., minimise the maths content and make having to learn how to program optional on computing courses). Students get a STEM degree, but those taking the ‘easy’ route are unemployable in STEM jobs.

There is not a shortage of people with STEM degrees, but there is a shortage of people with STEM degrees who can walk the talk.

Today’s burger flippers have STEM degrees that did not require students to do any serious maths or learn to program.

Tags:

Evidence for 28 possible compilers in 1957

May 21st, 2017 2 comments

In two earlier posts I discussed the early compilers for languages that are still widely used today and a report from 1963 showing how nothing has changed in programming languages

The Handbook of Automation Computation and Control Volume 2, published in 1959, contains some interesting information. In particular Table 23 (below) is a list of “Automatic Coding Systems” (containing over 110 systems from 1957, or which 54 have a cross in the compiler column):

Computer System Name or      Developed by        Code M.L. Assem Inter Comp Oper-Date Indexing Fl-Pt Symb. Algeb.
           Acronym   
IBM 704 AFAC                 Allison G.M.         C                      X    Sep 57    M2       M    2      X
        CAGE                 General Electric                 X          X    Nov 55    M2       M    2
        FORC                 Redstone Arsenal                            X    Jun 57    M2       M    2      X
        FORTRAN              IBM                  R                      X    Jan 57    M2       M    2      X
        NYAP                 IBM                              X               Jan 56    M2       M    2
        PACT IA              Pact Group                                  X    Jan 57    M2       M    1
        REG-SYMBOLIC         Los Alamos                       X               Nov 55    M2       M    1
        SAP                  United Aircraft      R           X               Apr 56    M2       M    2 
        NYDPP                Servo Bur. Corp.                 X               Sep 57    M2       M    2 
        KOMPILER3            UCRL Livermore                              X    Mar 58    M2       M    2      X
IBM 701 ACOM                 Allison G.M.         C                X          Dec 54    S1       S    0
        BACAIC               Boeing Seattle       A           X          X    Jul 55             S    1      X
        BAP                  UC Berkeley                X     X               May 57                  2
        DOUGLAS              Douglas SM                       X               May 53             S    1
        DUAL                 Los Alamos                 X          X          Mar 53             S    1
        607                  Los Alamos                       X               Sep 53                  1
        FLOP                 Lockheed Calif.            X     X    X          Mar 53             S    1 
        JCS 13               Rand Corp.                       X               Dec 53                  1
        KOMPILER 2           UCRL Livermore                              X    Oct 55    S2            1      X
        NAA ASSEMBLY         N. Am. Aviation                       X
        PACT I               Pact Groupb          R                      X    Jun 55    S2            1
        QUEASY               NOTS Inyokern                         X          Jan 55             S
        QUICK                Douglas ES                            X          Jun 53             S    0
        SHACO                Los Alamos                            X          Apr 53             S    1
        SO 2                 IBM                              X               Apr 53                  1
        SPEEDCODING          IBM                  R           X    X          Apr 53    S1       S    1
IBM 705-1, 2 ACOM            Allison G.M.         C                X          Apr 57    S1            0
        AUTOCODER            IBM                  R    X      X          X    Dec 56             S    2
        ELI                  Equitable Life       C                X          May 57    S1            0
        FAIR                 Eastman Kodak                         X          Jan 57             S    0
        PRINT I              IBM                  R    X      X    X          Oct 56    82       S    2
        SYMB. ASSEM.         IBM                              X               Jan 56             S    1
        SOHIO                Std. Oil of Ohio          X      X    X          May 56    S1       S    1
        FORTRAN              IBM-Guide            A                      X    Nov 58    S2       S    2      X
        IT                   Std. Oil of Ohio     C                      X              S2       S    1      X
        AFAC                 Allison G.M.         C                      X              S2       S    2      X
IBM 705-3 FORTRAN            IBM-Guide            A                      X    Dec 58    M2       M    2      X
        AUTOCODER            IBM                  A           X          X    Sep 58             S    2
IBM 702 AUTOCODER            IBM                       X      X          X    Apr 55             S    1
        ASSEMBLY             IBM                              X               Jun 54                  1
        SCRIPT G. E.         Hanford              R    X      X    X     X    Jul 55    Sl       S    1 
IBM 709 FORTRAN              IBM                  A                      X    Jan 59    M2       M    2      X
        SCAT                 IBM-Share            R           X          X    Nov 58    M2       M    2
IBM 650 ADES II              Naval Ord. Lab                              X    Feb 56    S2       S    1      X
        BACAIC               Boeing Seattle       C           X    X     X    Aug 56             S    1      X
        BALITAC              M.I.T.                    X      X          X    Jan 56    Sl            2
        BELL L1              Bell Tel. Labs            X           X          Aug 55    Sl       S    0
        BELL L2,L3           Bell Tel. Labs            X           X          Sep 55    Sl       S    0
        DRUCO I              IBM                                   X          Sep 54             S    0
        EASE II              Allison G.M.                     X    X          Sep 56    S2       S    2
        ELI                  Equitable Life       C                X          May 57    Sl            0
        ESCAPE               Curtiss-Wright                   X    X     X    Jan 57    Sl       S    2
        FLAIR                Lockheed MSD, Ga.         X           X          Feb 55    Sl       S    0
        FOR TRANSIT          IBM-Carnegie Tech.   A                      X    Oct 57    S2       S    2      X
        IT                   Carnegie Tech.       C                      X    Feb 57    S2       S    1      X
        MITILAC              M.I.T.                    X           X          Jul 55    Sl       S    2
        OMNICODE             G. E. Hanford                         X     X    Dec 56    Sl       S    2
        RELATIVE             Allison G.M.                     X               Aug 55    Sl       S    1
        SIR                  IBM                                   X          May 56             S    2
        SOAP I               IBM                              X               Nov 55                  2
	SOAP II              IBM                  R           X               Nov 56    M        M    2
        SPEED CODING         Redstone Arsenal          X           X          Sep 55    Sl       S    0
        SPUR                 Boeing Wichita            X      X    X          Aug 56    M        S    1
        FORTRAN (650T)       IBM                  A                      X    Jan 59    M2       M    2
Sperry Rand 1103A COMPILER I  Boeing Seattle                  X          X    May 57             S    1      X
        FAP                  Lockheed MSD              X           X          Oct 56    Sl       S    0
        MISHAP               Lockheed MSD                     X               Oct 56    M1       S    1
        RAWOOP-SNAP          Ramo-Wooldridge                  X    X          Jun 57    M1       M    1
        TRANS-USE            Holloman A.F.B.                  X               Nov 56    M1       S    2
        USE                  Ramo-Wooldridge      R           X          X    Feb 57    M1       M    2
        IT                   Carn. Tech.-R-W      C                      X    Dec 57    S2       S    1      X
        UNICODE              R Rand St. Paul      R                      X    Jan 59    S2       M    2      X
Sperry Rand 1103 CHIP        Wright A.D.C.             X           X          Feb 56    S1       S    0
        FLIP/SPUR            Convair San Diego         X           X          Jun 55    SI       S    0
        RAWOOP               Ramo-Wooldridge      R           X               Mar 55    S1            1
        8NAP                 Ramo-Wooldridge      R           X    X          Aug 55    S1       S    1
Sperry Rand Univac I and II AO Remington Rand          X      X          X    May 52    S1       S    1
        Al                   Remington Rand            X      X          X    Jan 53    S1       S    1
        A2                   Remington Rand            X      X          X    Aug 53    S1       S    1
        A3,ARITHMATIC        Remington Rand       C    X      X          X    Apr 56    SI       S    1
        AT3,MATHMATIC        Remington Rand       C           X          X    Jun 56    SI       S    2      X
        BO,FLOWMATIC         Remington Rand       A    X      X          X    Dec 56    S2       S    2
        BIOR                 Remington Rand            X      X          X    Apr 55                  1
        GP                   Remington Rand       R    X      X          X    Jan 57    S2       S    1
        MJS (UNIVAC I)       UCRL Livermore            X      X               Jun 56                  1
        NYU,OMNIFAX          New York Univ.                              X    Feb 54             S    1
        RELCODE              Remington Rand            X      X               Apr 56                  1
        SHORT CODE           Remington Rand            X           X          Feb 51             S    1
        X-I                  Remington Rand       C    X      X               Jan 56                  1
        IT                   Case Institute       C                      X              S2       S    1      X
        MATRIX MATH          Franklin Inst.                              X    Jan 58    
Sperry Rand File Compo ABC   R Rand St. Paul                                  Jun 58
Sperry Rand Larc K5          UCRL Livermore                   X          X              M2       M    2      X
        SAIL                 UCRL Livermore                   X                         M2       M    2
Burroughs Datatron 201, 205 DATACODEI Burroughs                          X    Aug 57    MS1      S    1
        DUMBO                Babcock and Wilcox                    X     X   
        IT                   Purdue Univ.         A                      X    Jul 57    S2       S    1      X
        SAC                  Electrodata               X      X               Aug 56             M    1
        UGLIAC               United Gas Corp.                      X          Dec 56             S    0
                               Dow Chemical                        X
        STAR                 Electrodata                      X
Burroughs UDEC III UDECIN-I  Burroughs                             X              57    M/S       S   1
        UDECOM-3             Burroughs                                   X        57    M         S   1
M.I.T. Whirlwind ALGEBRAIC   M.I.T.               R                      X              S2        S   1      X
        COMPREHENSIVE        M.I.T.                    X      X    X          Nov 52    Sl        S   1
        SUMMER SESSION       M.I.T.                                X          Jun 53    Sl        S   1
Midac   EASIAC               Univ. of Michigan                     X     X    Aug 54    SI        S
        MAGIC                Univ. of Michigan         X      X          X    Jan 54    Sl        S
Datamatic ABC I              Datamatic Corp.                             X   
Ferranti TRANSCODE           Univ. of Toronto     R           X    X     X    Aug 54    M1        S
Illiac DEC INPUT             Univ. of Illinois    R           X               Sep 52    SI        S
Johnniac EASY FOX            Rand Corp.           R           X               Oct 55              S
Norc NORC COMPILER           Naval Ord. Lab                   X          X    Aug 55    M2        M
Seac BASE 00                 Natl. Bur. Stds.          X           X
        UNIV. CODE           Moore School                                X    Apr 55

Chart Symbols used:

Code
R = Recommended for this computer, sometimes only for heavy usage.
C = Common language for more than one computer.
A = System is both recommended and has common language.
 
Indexing
M = Actual Index registers or B boxes in machine hardware.
S = Index registers simulated in synthetic language of system.
1 = Limited form of indexing, either stopped undirectionally or by one word only, or having
certain registers applicable to only certain variables, or not compound (by combination of
contents of registers).
2 = General form, any variable may be indexed by anyone or combination of registers which may
be freely incremented or decremented by any amount.
 
Floating point
M = Inherent in machine hardware.
S = Simulated in language.
 
Symbolism
0 = None.
1 = Limited, either regional, relative or exactly computable.
2 = Fully descriptive English word or symbol combination which is descriptive of the variable
or the assigned storage.
 
Algebraic
A single continuous algebraic formula statement may be made. Processor has mechanisms for
applying associative and commutative laws to form operative program.
 
M.L. = Machine language.
Assem. = Assemblers.
Inter. = Interpreters.
Compl. = Compilers.

Are the compilers really compilers as we know them today, or is this terminology that has not yet settled down? The computer terminology chapter refers readers interested in Assembler, Compiler and Interpreter to the entry for Routine:

Routine. A set of instructions arranged in proper sequence to cause a computer to perform a desired operation or series of operations, such as the solution of a mathematical problem.

Compiler (compiling routine), an executive routine which, before the desired computation is started, translates a program expressed in pseudo-code into machine code (or into another pseudo-code for further translation by an interpreter).

Assemble, to integrate the subroutines (supplied, selected, or generated) into the main routine, i.e., to adapt, to specialize to the task at hand by means of preset parameters; to orient, to change relative and symbolic addresses to absolute form; to incorporate, to place in storage.

Interpreter (interpretive routine), an executive routine which, as the computation progresses, translates a stored program expressed in some machine-like pseudo-code into machine code and performs the indicated operations, by means of subroutines, as they are translated. …”

The definition of “Assemble” sounds more like a link-load than an assembler.

When the coding system has a cross in both the assembler and compiler column, I suspect we are dealing with what would be called an assembler today. There are 28 crosses in the Compiler column that do not have a corresponding entry in the assembler column; does this mean there were 28 compilers in existence in 1957? I can imagine many of the languages being very simple (the fashionability of creating programming languages was already being called out in 1963), so producing a compiler for them would be feasible.

The citation given for Table 23 contains a few typos. I think the correct reference is: Bemer, Robert W. “The Status of Automatic Programming for Scientific Problems.” Proceedings of the Fourth Annual Computer Applications Symposium, 107-117. Armour Research Foundation, Illinois Institute of Technology, Oct. 24-25, 1957.

Programming Languages: nothing changes

May 14th, 2017 No comments

While rummaging around today I came across: Programming Languages and Standardization in Command and Control by J. P. Haverty and R. L. Patrick.

Much of this report could have been written in 2013, but it was actually written fifty years earlier; the date is given away by “… the effort to develop programming languages to increase programmer productivity is barely eight years old.”

Much of the sound and fury over programming languages is the result of zealous proponents touting them as the solution to the “programming problem.”

I don’t think any major new sources of sound and fury have come to the fore.

… the designing of programming languages became fashionable.

Has it ever gone out of fashion?

Now the proliferation of languages increased rapidly as almost every user who developed a minor variant on one of the early languages rushed into publication, with the resultant sharp increase in acronyms. In addition, some languages were designed practically in vacuo. They did not grow out of the needs of a particular user, but were designed as someone’s “best guess” as to what the user needed (in some cases they appeared to be designed for the sake of designing).

My post on this subject was written 49 years later.

…a computer user, who has invested a million dollars in programming, is shocked to find himself almost trapped to stay with the same computer or transistorized computer of the same logical design as his old one because his problem has been written in the language of that computer, then patched and repatched, while his personnel has changed in such a way that nobody on his staff can say precisely what the data processing Job is that his machine is now doing with sufficient clarity to make it easy to rewrite the application in the language of another machine.

Vendor lock-in has always been good for business.

Perhaps the most flagrantly overstated claim made for POLs is that they result in better understanding of the programming operation by higher-level management.

I don’t think I have ever heard this claim made. But then my programming experience started a bit later than this report.

… many applications do not require the services of an expert programmer.

Ssshh! Such talk is bad for business.

The cost of producing a modern compiler, checking it out, documenting it, and seeing it through initial field use, easily exceeds $500,000.

For ‘big-company’ written compilers this number has not changed much over the years. Of course man+dog written compilers are a lot cheaper and new companies looking to aggressively enter the market can spend an order of magnitude more.

In a young rapidly growing field major advances come so quickly or are so obvious that instituting a measurement program is probably a waste of time.

True.

At some point, however, as a field matures, the costs of a major advance become significant.

Hopefully this point in time has now arrived.

Language design is still as much an art as it is a science. The evaluation of programming languages is therefore much akin to art criticism–and as questionable.

Calling such a vanity driven activity an ‘art’ only helps glorify it.

Programming languages represent an attack on the “programming problem,” but only on a portion of it–and not a very substantial portion.

In fact, it is probably very insubstantial.

Much of the “programming problem” centers on the lack of well-trained experienced people–a lack not overcome by the use of a POL.

Nothing changes.

The following table is for those of you complaining about how long it takes to compile code these days. I once had to compile some Coral 66 on an Elliot 903, the compiler resided in 5(?) boxes of paper tape, one box per compiler pass. Compilation involved: reading the paper tape containing the first pass into the computer, running this program and then reading the paper tape containing the program source, then reading the second paper tape containing the next compiler pass (there was not enough memory to store all compiler passes at once), which processed the in-memory output of the first pass; this process was repeated for each successive pass, producing a paper tape output of the compiled code. I suspect that compiling on the machines listed gave the programmers involved plenty of exercise and practice splicing snapped paper-tape.

Computer   Cobol statements   Machine instructions   Compile time (mins)
UNIVAC II     630                    1,950                240
RCA 501       410                    2,132                 72
GE 225        328                    4,300                 16
IBM 1410      174                      968                 46
NCR 304       453                      804                 40

What is empirical software engineering?

May 12th, 2017 No comments

Writing a book about empirical software engineering requires making decisions about which subjects to discuss and what to say about them.

The obvious answer to “which subjects” is to include everything that practicing software engineers do, when working on software systems; there are some obvious exclusions like traveling to work. To answer the question of what software engineers do, I have relied on personal experience, my own and what others have told me. Not ideal, but it’s the only source of information available to me.

The foundation of an empirical book is data, and I only talk about subjects for which public data is available. The data requirement is what makes writing this book practical; there is not a lot of data out there. This lack of data has allowed me to avoid answering some tricky questions about whether something is, or is not, part of software engineering.

What approach should be taken to discussing the various subjects? Economics, as in cost/benefit analysis, is the obvious answer.

In my case, I am a developer and the target audience is developers, so the economic perspective is from the vendor side, not the customer side (e.g., maximizing profit, not minimizing cost or faults). Most other books have a customer oriented focus, e.g., high reliability is important (with barely a mention of the costs involved); this focus on a customer oriented approach comes from the early days of computing where the US Department of Defense funded a lot of software research driven by their needs as a customer.

Again the lack of data curtails any in depth analysis of the economic issues involved in software development. The data is like a patchwork of islands, the connections between them are under water and have to be guessed at.

Yes, I am writing a book on empirical software engineering that contains the sketchiest of outlines on what empirical software engineering is all about. But if the necessary data was available, I would not live long enough to get close to completing the book.

What of the academic take on empirical software engineering? Unfortunately the academic incentive structure strongly biases against doing work of interest to industry; some of the younger generation do address industry problems, but they eventually leave or get absorbed into the system. Academics who spend time talking to people in industry, to find out what problems exist, tend to get offered jobs in industry; university managers don’t like loosing their good people and are incentivized to discourage junior staff fraternizing with industry. Writing software is not a productive way of generating papers (academics are judged by the number and community assessed quality of papers they produce), so academics spend most of their work time babbling in front of spotty teenagers; the result is some strange ideas about what industry might find useful.

Academic software engineering exists in its own self-supporting bubble. The monkeys type enough papers that it is always possible to find some connection between an industry problem and published ‘research’.

Warp your data to make it visually appealing

May 4th, 2017 No comments

Data plots can sometimes look very dull and need to be jazzed up a bit. Now, nobody’s suggesting that the important statistical properties of the data be changed, but wouldn’t it be useful if the points could be moved around a bit, to create something visually appealing without losing the desired statistical properties?

Readers have to agree that the plot below looks like fun. Don’t you wish your data could be made to look like this?

Rabbit image

Well, now you can (code here, inspired by Matejka and Fitzmaurice who have not released their code yet). It is also possible to thin-out the points, while maintaining the visual form of the original image.

The idea is to perturb the x/y position of very point by a small amount, such that the desired statistical properties are maintained to some level of accuracy:

check_prop=function(new_pts, is_x)
{
if (is_x)
   return(abs(myx_mean-stat_cond(new_pts)) < 0.01)
else
   return(abs(myy_mean-stat_cond(new_pts)) < 0.01)
}
 
 
mv_pts=function(pts)
{
repeat
   {
   new_x=pts$x+runif(num_pts, -0.01, 0.01)
   if (check_prop(new_x, TRUE))
      break()
   }
 
repeat
   {
   new_y=pts$y+runif(num_pts, -0.01, 0.01)
   if (check_prop(new_y, FALSE))
      break()
   }
 
return(data.frame(x=new_x, y=new_y))
}

The distance between the perturbed points and the positions of the target points then needs to be calculated. For each perturbed point its nearest neighbor in the target needs to be found and the distance calculated. This can be done in n log n using kd-trees and of course there is an R package, RANN, do to this (implemented in the nn2 function). The following code tries to minimize the sum of the distances, another approach is to minimize the mean distance:

mv_closer=function(pts)
{
repeat
   {
   new_pts=mv_pts(pts)
   new_dist=nn2(rabbit, new_pts, k=1)
   if (sum(new_dist$nn.dists) < cur_dist)
      {
      cur_dist <<- sum(new_dist$nn.dists)
      return(new_pts)
      }
   }
 
}

Now it’s just a matter of iterating lots of times, existing if the distance falls below some limit:

iter_closer=function(tgt_pts, src_pts)
{
cur_dist <<- sum(nn2(tgt_pts, src_pts, k=1)$nn.dists)
cur_pts=src_pts
for (i in 1:5000)
   {
   new_pts=mv_closer(cur_pts)
   cur_pts=new_pts
   if (cur_dist < 13)
      return(cur_pts)
   }
return(cur_pts)
}

This code handles a single statistical property. Matejka and Fitzmaurice spent more than an hour on their implementation, handle multiple properties and use simulated annealing to prevent being trapped in local minima.

An example, with original points in yellow:

Warp towards rabbit image

Enjoy.

Tags: , , , ,

Array bound checking in C: the early products

April 28th, 2017 No comments

Tools to support array bound checking in C have been around for almost as long as the language itself.

I recently came across product disks for C-terp; at 360k per 5 1/4 inch floppy, that is a C compiler, library and interpreter shipped on 720k of storage (the 3.5 inch floppies with 720k and then 1.44M came along later; Microsoft 4/5 is the version of MS-DOS supported). This was Gimpel Software’s first product in 1984.

C-terp release floppy discs

The Model Implementation C checker work was done in the late 1980s and validated in 1991.

Purify from Pure Software was a well-known (at the time) checking tool in the Unix world, first available in 1991. There were a few other vendors producing tools on the back of Purify’s success.

Richard Jones (no relation) added bounds checking to gcc for his PhD. This work was done in 1995.

As a fan of bounds checking (finding bugs early saves lots of time) I was always on the lookout for C checking tools. I would be interested to hear about any bounds checking tools that predated Gimpel’s C-terp; as far as I know it was the first such commercially available tool.

Hedonism: The future economics of compiler development

April 27th, 2017 No comments

In the past developers paid for compilers, then gcc came along (followed by llvm) and now a few companies pay money to have a bunch of people maintain/enhance/support a new cpu; developers get a free compiler. What will happen once the number of companies paying money shrinks below some critical value?

The current social structure has an authority figure (ISO committees for C and C++, individuals or companies for other languages) specifying the language, gcc/llvm people implementing the specification and everybody else going along for the ride.

Looking after an industrial strength compiler is hard work and requires lots of know-how; hobbyists have little chance of competing against those paid to work full time. But as the number of companies paying for support declines, the number of people working full-time on the compilers will shrink. Compiler support will become a part-time job or hobby.

What are the incentives for a bunch of people to get together and spend their own time maintaining a compiler? One very powerful incentive is being able to decide what new language features the compiler will support. Why spend several years going to ISO meetings arguing with everybody about whether the next version of the standard should support your beloved language construct, it’s quicker and easier to add it directly to the compiler (if somebody else wants their construct implemented, let them write the code).

The C++ committee is populated by bored consultants looking for an outlet for their creative urges (the production of 1,600 page documents and six extension projects is driven by the 100+ people who attend meetings; a lot fewer people would produce a simpler C++). What is the incentive for those 100+ (many highly skilled) people to attend meetings, if the compilers are not going to support the specification they produce? Cut out the middle man (ISO) and organize to support direct implementation in the compiler (one downside is not getting to go to Hawaii).

The future of compiler development is groups of like-minded hedonists (in the sense of agreeing what features should be in a language) supporting a compiler for the bragging rights of having created/designed the languages features used by hundreds of thousands of developers.

Average maintenance/development cost ratio is less than one

April 20th, 2017 3 comments

Part of the (incorrect) folklore of software engineering is that more money is spent on maintaining an application than was spent on the original development.

Bossavit’s The Leprechauns of Software Engineering does an excellent job of showing the probably source of the folklore did not base their analysis on any cost data (I’m not going to add to an already unwarranted number of citations by listing the source).

I have some data, actually two data sets, each measuring a different part of the problem, i.e., 1) system lifetime and 2) maintenance/development costs. Both sets of measurements apply to IBM mainframe software, so a degree of relatedness can be claimed.

Analyzing this data suggests that the average maintenance/development cost ratio, for a IBM applications, is around 0.81 (code+data). The data also provides a possible explanation for the existing folklore in terms of survivorship bias, i.e., most applications do not survive very long (and have a maintenance/development cost ratio much less than one), while a few survive a long time (and have a maintenance/development cost ratio much greater than one).

At any moment around 79% of applications currently being maintained will have a maintenance/development cost ratio greater than one, 68% a ratio greater than two and 51% a ratio greater than five.

Another possible cause of incorrect analysis is the fact we are dealing with ratios; the harmonic mean has to be used, not the arithmetic mean.

Existing industry practice of not investing in creating maintainable software probably has a better cost/benefit than the alternative because most software is not maintained for very long.

Early compiler history

April 12th, 2017 No comments

Who wrote the early compilers, when did they do it and what were the languages compiled?

Answering these questions requires defining what we mean by ‘compiler’ and deciding what counts as a language.

Donald Knuth does a masterful job of covering the early development of programming languages. People were writing programs in high level languages well before the 1950s and Konrad Zuse is the forgotten pioneer from the 1940s (it did not help that Germany had just lost a war and people were not inclined to given German’s much public credit).

What is the distinction between a compiler and an assembler? Some early ‘high-level’ languages look distinctly low-level from our much later perspective; some assemblers support fancy subroutine and macro facilities that give them a high-level feel.

Glennie’s Autocode is from 1952 and depending on where the compiler/assembler line is drawn might be considered a contender for the first compiler. The team led by Grace Hopper produced A-0, a fancy link-loaded, in 1952; they called it a compiler, but today we would not consider it to be one.

Talking of Grace Hopper, from a biography of her technical contributions she sounds like a person who could be technical but chose to be management.

Fortran and Cobol hog the limelight of early compiler history.

Work on the first Fortran compiler started in the summer of 1954 and was completed 2.5 years later, say the beginning of 1957. A very solid claim to being the first compiler (assuming Autocode and A-0 are not considered to be compilers).

Compilers were created for Algol 58 in 1958 (a long dead language implemented in Germany with the war still fresh in people’s minds; not a recipe for wide publicity) and 1960 (perhaps the first one-man-band implemented compiler; written by Knuth, an American, but still a long dead language). The 1958 implementation sounds like it has a good claim to being the second compiler after the Fortran compiler.

In December 1960 there were at least two Cobol compilers in existence (one of which was created by a team containing Grace Hopper in a management capacity). I think the glory attached to early Cobol compiler work is the result of having a female lead involved in the development team of one of these compilers. Why isn’t the other Cobol compiler ever mentioned? Why is the Algol 58 compiler implementation that occurred before the Cobol compiler ever mentioned?

What were the early books on compiler writing?

“Algol 60 Implementation” by Russell and Randell (from 1964) still appears surprisingly modern.

Principles of Compiler Design, the “Dragon book”, came out in 1977 and was about the only easily obtainable compiler book for many years. The first half dealt with the theory of parser generators (the audience was CS undergraduates and parser generation used to be a very trendy topic with umpteen PhD thesis written about it); the book morphed into Compilers: Principles, Techniques, and Tools. Old-timers will reminisce about how their copy of the book has a green dragon on the front, rather than the red dragon on later trendier (with less parser theory and more code generation) editions.

Tags: ,

Collecting all of the publicly available source code

April 6th, 2017 2 comments

Collecting together all the software ever written is impossible, but collecting everything that can be found would create something very useful (since I have always been interested in source code analysis, my opinion is biased).

Collecting together source available on the internet is easy. Creating a copy of Github is one of the first actions of anybody with ambitions of collecting lots of source (back in the day it was collecting shareware 5″ 1/4, then 3″ 1/2 floppy discs, then CDs and then DVDs).

The very hard to collect items “exist” on line-printer paper, punched cards, rolls of punched tape and mag tape. These are the items where serious collectors should concentrate their efforts (NASA lost a lot of Voyager data when magnetic particles fell off the plastic strips of tape reels in storage, because the adhesive had degraded).

The Internet Archive is doing a great job of collecting and making available ‘antique’ source code (old computer games is a popular genre; other collectors concentrate on being able to execute ROM images of games), but they are primarily US based.

Collecting the World’s source code requires collection organizations in every country. Collecting old code is a people intensive business and requires lots of local knowledge.

A new source code collection organization has recently been setup in France; the Software Heritage currently aims to collect all software that is publicly available. So far they have done what everybody else does, made a copy of Github and a couple of the well-known source repos.

I hope this organization is not just the French government throwing money at another one-upmanship US vs. France project.

If those involved are serious about collecting source code, rather than enjoying the perks of a tax-funded show project, they will realise that lots of French specific source code is dotted around the country needing to be collected now (before the media decomposes and those who know how to read it die).