Archive

Posts Tagged ‘C’

An ISO Standard for R (just kidding)

July 24th, 2014 4 comments

IST/5, the British Standards’ committee responsible for programming languages in the UK, has a new(ish) committee secretary and like all people in a new role wants to see a vision of the future; IST/5 members have been emailed asking us what we see happening in the programming language standards’ world over the next 12 months.

The answer is, off course, that the next 12 months in programming language standards is very likely to be the same as the previous 12 months and the previous 12 before that. Programming language standards move slowly, you don’t want existing code broken by new features and it would be a huge waste of resources creating a standard for every popular today/forgotten tomorrow language.

While true the above is probably not a good answer to give within an organization that knows its business intrinsically works this way, but pines for others to see it as doing dynamic, relevant, even trendy things. What could I say that sounded plausible and new? Big data was the obvious bandwagon waiting to be jumped on and there is no standard for R, so I suggested that work on this exciting new language might start in the next 12 months.

I am not proposing that anybody start work on an ISO standard for R, in fact at the moment I think it would be a bad idea; the purpose of suggesting the possibility is to create some believable buzz to suggest to those sitting on the committees above IST/5 that we have our finger on the pulse of world events.

The purpose of a standard is to create agreement around one way of doing things and thus save lots of time/money that would otherwise be wasted on training/tools to handle multiple language dialects. One language for which this worked very well is C, for which there were 100+ incompatible compilers in the early 1980s (it was a nightmare); with the publication of the C Standard users finally had a benchmark that they could require their suppliers to meet (it took 4-5 years for the major suppliers to get there).

R is not suffering from a proliferation of implementations (incompatible or otherwise), there is no problem for an R standard to solve.

Programming language standards do get created for reasons other than being generally useful. The ongoing work on C++ is a good example of consultant driven standards development; consultants who make their living writing and giving seminars about the latest new feature of C++ require a steady stream of new feature to talk about and have an obvious need to keep new versions of the standard rolling down the production line. Feeling that a language is unappreciated is another reason for creating an ISO Standard; the Modula-2 folk told me that once it became an ISO Standard the use of Modula-2 would take off. R folk seem to have a reasonable grip on reality, or have I missed a lurking distorted view of reality that will eventually give people the drive to spend years working their fingers to the bone to create a standard that nobody is really that interested in?

C++ vs. Ada: Which language is more strongly typed?

April 17th, 2014 No comments

Programming languages are sometimes categorized as being either weakly or strongly typed. I’m not going to join the often rabid debates over which category a particular language belongs to, but rather discuss the relative type strengths of two languages, C++ and Ada, to see if it is possible to claim that one of them is more strongly typed than the other.

Most programming languages support variables having more than one type (e.g., integer and float are two very common types) and have rules specifying which combinations of differently typed values/variables are permitted to occur in a given context, e.g., C++ allows a value of type int to be assigned to a variable of type float (an implicit conversion is performed), but Ada not perform this implicit conversion and the integer value has to be explicit converted to float before it can be assigned (otherwise a compile time error will be generated).

The more implicit type conversions a language supports the weaker its type system is said to be.

C++ supports more implicit conversions than Ada (others include boolean/int and char/int) and loose type strength points because of this (there is plenty of scope for debate about whether some implicit conversions are more evil than others, but cost/benefit debates are harder to come by).

While C++/Ada differ in their support for implicit conversions they are pretty equal in their support for explicit conversions (e.g., in Ada the code float(23) would convert the integer 23 to a float type). In some cases Ada requires that various hoops be jumped through to make the conversion happen (representation clauses are a great topic to bring up when being lectured about how type safe Ada is, a bit like telling somebody being snobbish that they go to the bathroom like everybody else).

The underlying idea is that the compilation errors generated by these ‘undesirable’ attempted implicit conversions alert the developer to a possible mistake in what they have written. These kinds of messages from the compiler have certainly caught errors in my code, but often the error has been a failure to write the required explicit conversion; every now and again a ‘real’ error is flagged. But I digress, this discussion is about what weak/strong typing is, not about what its costs and benefits might be.

Does Ada have any other feature that increases its type strength with respect to C++?

Both languages allow names to be given to existing types: typedef length_t int; in C++ and subtype length_t is integer; in Ada both define length_t to be a synonym for the integer type, but without resulting in any extra type checks occurring. However, Ada supports a kind of type definition mechanism that does result in extra checks being made by the compiler. In the following code:

subtype length_t is integer;
type time_t is integer;
 
a : integer;
b : length_t;
c : time_t;
 
begin
a := b;          -- OK
a := c;          -- Error, type mismatch
a := integer(c); -- OK, explicit conversion

time_t is defined to have a type that is not compatible with integer, even although its underlying representation is the same as the integer type. Mixing variables having types integer and time_t results in a compile time error.

The intended purpose for defining a ‘new’ type and creating variables having that type is to restrict operations on those variables to being with other variables having the same type, e.g., assignment and addition between any variables having type time_t is fine but involving other types results in a compile time error (there are special rules that allow integer literals to general get mixed in). I find that the errors flagged by this kind of checking are mostly very useful.

It is also possible to achieve the same kind of type checking in C++ using template metaprogramming, e.g., the SIunits library. In fact using this technique enables C++ to support a much more general and user friendly range of of ‘strong type’ functionality than is supported by the built-in Ada functionality (it is also possible to use general language functionality in Ada to implement the kind of checking possible in C++, however prior to the 2012 Ada standard the checks occurred at runtime but it now looks like there is a mechanism for doing them at compile time {because it is often possible to switch off runtime checks some people consider them to be weaker than compile time checks})

Fans of subranges (I dearly miss this feature when using C-like languages) can get their fix here.

Is there a rule that extra type strength points are given if a language contains explicit type creation syntax (Ada contains a menagerie of syntax and semantics for doing this kind of stuff), compared to languages that require the use of constructs having many other uses? I don’t see why there should be. The fact that template metaprogramming makes a lot of C++ developers’ head’s hurt means that many will limit themselves to using what others have created, rather than growing project specific libraries; but since when have usability and frequency of use been a major issue in the weak/string type debate?

The score so far is that C++ has lots points to Ada because of its greater support for implicit type conversions, but has held its ground everywhere else.

Can either language pick up any more points?

There is the culture angle. Ada developers have a culture of making use of the type checking functionality provided by the language; this is a skill that has to be learned, you get some type checking for free but the rest has to be designed into the code. C++ developers also have a culture of making use of the type checking functionality provided by the language, i.e., most do not use add-on packages like SIunits.

I am not aware of any studies that have investigated the extent to which developers make use of type checking functionality; pointers to such studies welcome. If there is more ‘strongly typed’ C++ than Ada code out there it is only because there is a lot more C++ code out there.

It is my experience that culture and existing code do color developers’ position on where to draw the line in the weak/strong debate, but don’t effect relative language orderings.

The conclusion is that Ada is more strongly typed than C++, but how much more strongly typed remaines an open question. Both languages require effort from the developer to make full use of the typing functionality that is available.

Testing compiler semantics with minimal manual input

November 11th, 2013 3 comments

The 2011 revision of the C++ Standard added lots of new constructs to the language and in the past few months both the GCC and LLVM teams have been claiming that the next release of their C++ compilers will fully support the 2011 Standard. How true are these claims? One way of answering this question is to run both compilers over an extensive test suite. There are commercial C++ compiler test suites available, but I don’t have access to them and if I did the license agreement would probably not allow me to talk in detail about the results. Writing compiler tests cases requires a very detailed knowledge of the language; I have done it often enough in previous lives to know that more than a year or so of my time would be needed just to get my head around the semantics of the new C++ features, before I could produce anything half decent.

Is there a way of automating the generation of test cases for language semantics? Automated statement/expression generation is very effective at finding problems with optimizers and code generators. Can this technique be applied to check the semantic requirements of a language?

Having concocted various elaborate schemes over the years I recently realised that life would be a lot simpler if I was willing to accept a very high percentage of erroneous test programs (the better manually written test suites usually contain around as many test cases that intended to fail to compile as tests that pass, i.e., intended to compile correctly; the not so good ones have few failing tests).

If two or more compilers are available the behavior of each of them on a given source file can be compared: differential testing. If both compile a file or fail to compile it, they may both be right or wrong; either way this shared behavior is not interesting, but is likely to be the common case. The interesting case is if one compiles a file and the other fails to compile it; this could be a fault in one of them, or one of those cases where the Standard permits compilers to do their own thing.

I hereby jump to the conclusion that behavior differences is a good proxy for compiler conformance to the language Standard (actually developers are often more interested in all compilers they are likely to use having the same external behavior than conforming to a Standard).

Lets implement this (source code here)!

First we need to generate lots of test cases. The process I used is based on program templates, such as the following (lines starting with ! are places where various constructs can be inserted):

int v;
 
! declare_id 2
 
int main(void)
{
! declare_id 2
 
! ref_id 2
 
}

the identifier after the ! is the name of a file containing lines to be inserted at the given location (the number after the identifier is the maximum number of lines that can be inserted at that point; default 1 if no number given). The following is example file contents for the above template:

declare_id

int i;
enum {i, j};
enum i {x, y};
struct i {int f;};
typedef int i;

ref_id

enum i ev_i;
struct i sv_i;
typedef i tv_i;
v=i;

It is then simply a matter of permuting through all of the possible combinations of lines that can be inserted in the program template, creating a stand alone file for each possibility (18,000 of them in the above example); I used the Python Natural Language Toolkit to do the heavy lifting.

A shell script compiles each source file and compares the compiler exit codes. For the above example there were 16,366 failures, 1,634 passes and no differences (this example contains well established C constructs and any difference would have been surprising).

Next, a feature new in C++11, lambda functions!

Here is the template used:

! declare_xy 2
 
int main(void)
{
 
! declare_xy 2
 
auto foo_bar =
! define_lambda
;
 
return 0;
}

I cut and pasted some examples from the Standard to create the following optional lines:

define_lambda

[](float a, float b) { return a + b; }
[=](float a, float b) { return a + b; }
[=,x](float a, float b) { return a + b; }
[y](float a, float b) { return a + b; }
[=]()->int { return operator()(this->x + y); }
[&, i]{ }
[=] { decltype(x) y1; decltype((x)) y2 = y1; decltype(y) r1 = y1; decltype((y)) r2 = y2; }

which generated 6,300 source files of which 5,865 failed, 396 passed and 39 were treated differently by the compilers (g++ version 4.7.2, clang version 3.3).

How should the percentages be calculated? If we take the human written numbers for well written test suites containing (roughly) equal numbers of pass/fail tests, then we have around 800 tests of which (say) 40 gave different behavior, giving us a 5% fault rate. Do we share that 5% equally between both compilers or assign 3% for both being wrong and 1% for each being uniquely wrong?

Submitting a bug report to both compiler teams pointing out that their behavior is different from the other’s is a sure fire way to make myself unpopular. Any suggestions for how to resolve this issue, that does not involve me having to study the tiresomely long and convoluted C++ Standard, welcome.

Single-quote as a digit separator soon to be in C++

September 30th, 2013 4 comments

At the C++ Standard’s meeting in Chicago last week agreement was finally reached on what somebody in the language standards world referred to as one of the longest bike-shed controversies; the C++14 draft that goes out for voting real-soon-now will include support for single-quotation-mark as a digit separator. Assuming the draft makes it through ISO voting you could soon be writing (Compiler support assumed) 32'767 and 0.000'001 and even 1'2'3'4'5'6'7'8'9 if you so fancied, in your conforming C++ programs.

Why use single-quote? Wouldn’t underscore have been better? This issue has been on the go since 2007 and if you feel really strongly about it the next bike-shed C++ Standard’s meeting is in Issaquah, WA at the start of next year.

Changing the lexical grammar of a language is fraught with danger; will there be a change in the behavior of existing code? If the answer is Yes, then the next question is how many people will be affected and how badly? Let’s investigate; here are the lexical details of the proposed change:

pp-number:
    digit
    . digit
    pp-number digit
    pp-number ' digit
    pp-number ' nondigit
    pp-number identifier-nondigit
    pp-number e sign
    pp-number E sign
    pp-number .

Ideally the change of behavior should cause the compiler to generate a diagnostic, when code containing it is encountered, so the developer gets to see the problem and do something about it. The following conforming C++ code will upset a C++14 compiler (when I write C++ I mean the C++ Standard as it exists in 2013, i.e., what was called C++11 before it was ratified):

#define M(x) #x   // stringize the macro argument
 
char *p=M(1'2,3'4);

At the moment the call to the macro M contains one argument, the sequence of three tokens {1}, {'2,3'} and {4} (the usual convention is to bracket the characters making up one token with matching curly braces).

In C++14 the call to M will contain the two arguments {1'2} and {3,4}. conforming compiler is required to complain when the number of arguments to a macro invocation don’t match the definition…. Unless the macro is defined to accept a variable number of arguments:

#define M(x, ...) __VA_ARGS__
 
          int x[2] = { M(1'2,3'4) };
// C++11: int x[2] = {};
// C++14: int x[2] = { 3'4 };

This is the worst kind of change in behavior, known as a silent change, the existing code compiles without complaint but has different behavior.

How much existing code contains either of these constructs? I suspect very very little human written code, maybe even none. This is the sort of stuff that is more likely to be produced by automatic code generators. But how much more likely? I have no idea.

How much benefit does the new feature provide? It certainly looks useful, but coming up with a number for the benefit is hard. I guess it has the potential to shave a fraction of a second off of the attention a developer has to pay when reading code, after they have invested in learning about the construct (which is lots of seconds). Multiplied over many developers and not that many instances (the majority of numeric literals contain a single digit), we could be talking a man year or two per year of worldwide development effort?

All of the examples I have seen require the ‘assistance’ of macros, here is another (courtesy of Jeff Snyer):

#define M(x) A ## x
#define A0xb
 
int operator "" _de(char);
int x = M(0xb'c'_de);

Are there any examples of a silent change that don’t involve the preprocessor?

2013 in the programming language standards’ world

August 29th, 2013 1 comment

Yesterday I was at the British Standards Institute for a meeting of IST/5, the committee responsible for programming languages. Things have been rather quiet since I last wrote about IST/5, 18 months ago. Of course lots of work has been going on (WG21 meetings, C++ Standard, are attracting 100+ people, twice a year, to spend a week trying to reach agreement on new neat features to add; I’m not sure how the ability to send email is faring these days ;-) .

Taken as a whole programming languages standards work, at the ISO level, has been in decline over the last 10 years and will probably decline further. IST/5 used to meet for a day four times a year and now meets for half a day twice a year. Chairman of particular language committees, at international and national levels, are retiring and replacements are thin on the ground.

Why is work on programming language standards in decline when there are languages in widespread use that have not been standardised (e.g., Perl and PHP do not have a non-source code specification)?

The answer is low hardware/OS diversity and Open source. These two factors have significantly reduced the size of the programming language market (i.e., there are far fewer people making a living selling compilers and language related tools). In the good old-days any computer company worth its salt had its own cpu and OS, which of course needed its own compiler and the bigger vendors had third parties offering competing compilers; writing a compiler was such a big undertaking that designers of new languages rarely gave the source away under non-commercial terms, even when this happened the effort involved in a port was heroic. These days we have a couple of cpus in widespread use (and unlikely to be replaced anytime soon), a couple of OSs and people are queuing up to hand over the source of the compiler for their latest language. How can a compiler writer earn enough to buy a crust of bread let alone attend an ISO meeting?

Creating an ISO language standard, through the ISO process, requires a huge amount of work (an estimated 62 person years for C99; pdf page 20). In a small market few companies have an incentive to pay for an employee to be involved in the development process. Those few languages that continue to be worked on at the ISO level have niche markets (Fortran has supercomputing and C had embedded systems) or broad support (C and now C++) or lots of consultants wanting to be involved (C++, not so much C these days).

The new ISO language standards are coming from national groups (e.g., Ruby from Japan and ECMAscript from the US) who band together to get the work done for local reasons. Unless there is a falling out between groups in different nations, and lots of money is involved, I don’t see any new language standards being developed within ISO.

Arm waver or expert?

May 30th, 2013 No comments

I was at a workshop yesterday where one of the speakers claimed his tool supported all of Standard C; needless to say I went over to chat during one of the breaks. Now a lot of the time speakers will immediately admit to implementing a subset when chatting over coffee, but this guy was claiming all of Standard C and had a very favorable opinion of his own expertise. It is impolite and somewhat confrontational to stand in front of somebody with a checklist of questions; what topic should be worked into the conversation to best gauge whether a person really does have a detailed grasp of C?

I tossed the phrase strictly conforming into the conversation and a bit later integer promotions, these resulted in just more huff and puff from him. Now people with a detailed knowledge of C are thin on the ground and an encounter with one usually results in a warm mutual exchange of war stories on the problems encountered during the implementation of some feature or other. Perhaps this guy’s tool had been implemented by a student and this was not part of the message; perhaps this was his first encountered with somebody who also had a detailed knowledge of the language and he did not know how to react (I imagine such people are even rarer in academia than industry).

Thinking about it on the train home I decided that “sequenced before” is the phrase I should have tossed into the conversation. The concept of sequence points existed in C90 and C99, but was replaced by “sequenced before” and unsequenced in C11 (a more complicated memory ordering model was necessitated by the newly added support for sharing objects between processes). Yesterday’s speaker was not there today, so I was not able to reinvestigate whether his knowledge was pre/post C11 or just bluster.

The C++ phrase to toss into a conversation used to be One definition rule, but I don’t know if this is still true today. I once saw an email exchange where a supposed expert had never heard about the “one definition” rule and jokingly asked if it was connected to the “one ring” in Lord of the Rings; oops, he had obviously never read the C++ Standard.

What about other languages? Suggestions for phrases I might use to gauge whether somebody really is an expert in some other language welcome (this would be pure bluff on my part and I accept the consequences of using any suggestion).

Writing language standards is a cottage industry

August 8th, 2012 No comments

In the beginning programming language standards were written by one country’s National Standards body (e.g., ANSI did C/Cobol/Fortran for the USA and BSI did Pascal for the UK) and other countries were free to write their own version, adopt the existing work or do nothing (I don’t know of any country writing their own version, a few countries sometimes stuck their own front page on an existing document and the majority did nothing; update 4 Dec 2012, thanks to David Muxworthy for pointing out that around 1974 the UK, US, Japan and ECMA were all independently developing a standard for BASIC, by 1982 this had evolved to just ANSI and ECMA).

The UK people who created the Pascal Standard wanted the rest of the world (i.e., the US) to adopt it and the way to do this was to have it adopted as an ISO Standard. The experience of making this happen convinced the folk at BSI that in future language standards should be produced as an international effort within ISO (those pesky Americans wanted changes made to the document before they would vote for it).

During the creation of the first C Standard various people from Europe joined the ANSI committee, X3J11, so they could take part. Initially the US members were not receptive to the European request for a mechanism to handle keyboards that did not contain certain characters (e.g., left/right square brackets) but responded promptly on hearing that those (pesky) Europeans planned to publish an ISO C Standard that would contain those changes to the ANSI Standard needed to support trigraphs; the published ANSI Standard included support for trigraphs. The C ANSI committee were very receptive to the idea of future work being done at the ISO level; Bill Plauger/Tom Plum did a lot of good work to ensure it happened.

The C++ language came along and long story short an ISO committee was set up to create an ISO Standard for it, then Java came along and the Java Study Group failed to become an ISO committee and then various nonspecific language committees happened.

A look at the SC22 web site shows that ISO Standards exist for Forth and ECMAscript (it has not yet been updated to include Ruby) with no corresponding ISO committees. What is going on?

One could be cynical and say that special interests are getting a document of their choosing accepted by ECMA and then abusing the ISO fast track procedure to sidestep the need to setup an international committee that has the authority to create a document of its choosing. The reality is that unless a language is very widely used by lots of people (e.g., in the top five or so most commonly used languages) there are unlikely to be enough people (or employers) willing and able to commit the time and money needed to be actively involved in an ISO Standard committee.

Once a documented has been fast tracked to become ISO Standard any updates to it are supposed to be carried out under ISO rules (i.e., an ISO committee). In practice this is not happening with ECMAscript which continues to be very active (I don’t know what is happening with Forth or how the Ruby people plan to handle any updates), holding bi-monthly meetings; over the years they have fast tracked two revisions to the original fast tracked document (the UK did raise the issue during balloting but nothing came of it, I don’t think anybody really cares).

Would moving the ECMAscript development work from ECMA to ISO make a worthwhile difference? There might be a few people out there who would attend an ISO meeting who are not currently attending ECMA meetings (to join ECMA companies with five or less employees pay an annual fee of 3,500 Swiss francs {about the same number of US dollars} and larger companies pay a lot more) but I suspect the number would not be large enough to make up for the extra hassle of running an ISO committee (e.g., longer ISO balloting timescales).

Production of programming language standards is really a cottage industry that relies on friends in high places (e.g., companies with an existing membership of ECMA or connections into the local country standards’ body) for them to appear on the international stage.

Would you buy second hand software from a formal methods researcher?

May 23rd, 2012 No comments

I have been reading a paper on formally proving software correct (Bridging the Gap: Automatic Verified Abstraction of C by Greenaway, Andronick and Klein) and as often the case with papers on this topic the authors have failed to reach the level of honest presentation required by manufacturers of soap power in their adverts.

The Greenaway et al paper describes a process that uses a series of translation steps to convert a C program into what is claimed to be a high level specification in Isabelle/HOL (a language+support tool for doing formal proofs).

The paper was published by an Australian research group; I could not find an Australian advertising standards code dealing with soap power but did find one covering food and beverages. Here is what the Australian Association of National Advertisers has to say in their Food & Beverages Advertising & Marketing Communications Code:

“2.1 Advertising or Marketing Communications for Food or Beverage Products shall be truthful and honest, shall not be or be designed to be misleading or deceptive or otherwise contravene Prevailing Community Standards, and shall be communicated in a manner appropriate to the level of understanding of the target audience of the Advertising or Marketing Communication with an accurate presentation of all information…”

So what claims and statements do Greenaway et al make?

2.1 “Before code can be reasoned about, it must first be translated into the theorem prover.” A succinct introduction to one of the two main tasks, the other being to prove the correctness of these translations.

“In this work, we consider programs in C99 translated into Isabelle/HOL using Norrish’s C parser … As the parser must be trusted, it attempts to be simple, giving the most literal translation of C wherever possible.”

“As the parser must be trusted”? Why must it be trusted? Oh, because there is no proof that it is correct, in fact there is not a lot of supporting evidence that the language handled by Norrish’s translator is an faithful subset of C (ok, for his PhD Norrish wrote a formal semantics of a subset of C; but this is really just a compiler written in mathematics and there are umpteen PhDs who have written compilers for a subset of C; doing it using a mathematical notation does not make it any more fault free).

The rest of the paper describes how the output of Norrish’s translator is generally massaged to make it easier for people to read (e.g., remove redundant statements and rename variables).

Then we get to the conclusion which starts by claiming: “We have presented a tool that automatically abstracts low-level C semantics into higher-level specifications with automatic proofs of correctness for each of the transformation steps.”

Oh no you didn’t. There is no proof for the main transformation step of C to Isabelle/HOL. The only proofs described in the paper are for the post processing fiddling about that was done after the only major transformation step.

And what exactly is this “high-level specification”? The output of the Norrish translator was postprocessed to remove the clutter that invariably gets generated in any high-level language to high-level language translator. Is the result of this postprocessing a specification? Surely it is just a less cluttered representation of the original C?

Actually this paper does contain a major advance in formally proving software correct, tucked away at the start it says “As the parser must be trusted…”. There it is in black and white, if you have some software that must be trusted don’t bother with formal proofs just simply follow the advice given here.

But wait a minute you say, I am ignoring the get out of jail wording “… shall be communicated in a manner appropriate to the level of understanding of the target audience …”. What is the appropriate level of understanding of the target audience, in fact who is the target audience? Is the target audience other formal methods researchers who are familiar with the level of intellectual honesty within their field and take claims made by professional colleagues with a pinch of salt? Are non-formal methods researchers not the target audience and so have no redress to being misled by the any claims made by papers in this field?

A free pdf that is not the C++ Standard

April 14th, 2012 No comments

One of the most annoying things about working on programming language standards is to see the exorbitant price ISO charge for the final published document created by experts toiling away for free over many years. This practice is unlikely to change.

Once a document has been through a successfully ballot (i.e., accepted for publication as a Standard) ISO has very strict rules about what changes can be made to it prior to actual publication (only typos of the most very innocuous kind can be fixed). Of course programming language committees live on after the publication of a Standard by ISO and it is very useful for them if the document they start deliberating on is one that has had all the typos corrected, not just the really innocuous ones.

The list of typos in the 2011 C++ Standard have been fixed in the freely down loadable committee document N3337 that is not the official C++ Standard, however uncannily similar it may look (no ISO gobbledygook at the front for instance). The equivalent committee document for C is not yet available on the WG14 site.

If you really do need a copy of the C++ Standard the 2011 (i.e., latest) document in pdf form is available for $30 from ANSI; the 2011 C Standard is also available for the same price. Don’t worry about the ANSI version being dated 2012 rather than 2011; National Standards bodies must sell the ISO version at ISO prices but are allowed to publish localized versions for which they can set their own price (so you pay less and get various US specific acronyms and verbiage printed on the front cover).

Generating code that looks like it is human written

April 11th, 2012 No comments

I am very interested in understanding the patterns of developer behavior that lead to the human characteristics that can be found in code. To help me get some idea of how well I understand this behavior I have decided to build a tool that generates source code that appears to be written by human programmers. I hope to reach a point where I can offer a challenge to tell the difference between generated code and human written code.

The three main production techniques I plan to use are, in increasing order of relatedness to humans production techniques, are:

  1. Random generation based on percentage occurrence of language constructs obtained from measurements of existing source. This is the simplest approach and the one furthest away from common developer behavior; even so there are things that can be learned from this information. For instance, the theory that developers are more likely to create a function once code becomes heavily nested code implies that the probability of encountering an if-statement decreases as nesting depth increases; measurements show the probability of encountering an if-statement remaining approximately constant as depth of nesting increases.
  2. Behavior templates. People have habits in everyday life and also when writing software. While some habits are idiosyncratic and not encountered very often there are some that appear to be generally used. For instance, developers tend to assign a fixed role to every variable they define (e.g., stepper for stepping through a succession of values and most-recent holder holding the latest value encountered in going through a succession of values).

    I am expecting/hoping that generation by behavioral templates will result in code having some of the probabilistic properties seen in human code, removing the need for purely random generation driven by low level language probability measurements. For instance, the probability of a local variable appearing in a function is proportional to the percentage of its previous occurrences up to that point in the source of the function (percentage = occurrences_of_X / occurrences_of_all_local_variables) and I am hoping that this property appears as emergent behavior from generating using the role of variable template.

  3. Story telling. A program is like the plot of a story, it has a cast of characters (e.g., classes, functions, libraries) that perform various actions and interact with each other in order to achieve various goals, there are subplots (intermediate results are calculated, devices are initialized, etc), there are resource limits, etc.

    While a lot of stories are application domain specific there are subplots common to many stories; also how a story is told can be heavily influenced by the language used, for instance Prolog programs have a completely different structure than those written in procedural languages such as Java. I want to stay away from being application specific and I don’t plan to tackle languages too far outside the common-or-garden procedural variety.

    Researchers have created automatic story generators; the early generators were template based while more recent systems have used an agent based approach. Story based generation of code is my ideal, but I am a long way away from having enough knowledge of developer behavior to be more than template based.

In a previous post I described a system for automatically generating very simply C programs. I plan to build on this system to incrementally improve the ‘humanness’ of the generated code. At some point, hopefully before the end of this year, I will challenge people to tell the difference between automatically generated and human written code.

The language I have studied the most is C and this will be the main target. I don’t want to be overly C specific and am trying to decide on a good second language (i.e., lots of source available for measurement, used by lots of developers and not too different from C). JavaScript is the current front runner, it is a class-less object oriented language which is not ‘wildly’ OO (the patterns of usage in human written OO code continue to evolve at a rapid rate which can make a lot of human C++/Java code look automatically generated).

As well as being a test bed for understanding of human generated code other uses for an automatic generator include compiler stress testing and providing code snippets to an automated fault fixing tool.