Posts Tagged ‘Python’

Simple generator for compiler stress testing source

April 25th, 2011 3 comments

Since writing my C book I have been interested in the problem of generating source that has the syntactic and semantic statistical characteristics of human written code.

Generating code that obeys a language’s syntax is straight forward. Take a specification of the syntax (say is some yacc-like form) and ‘generate’ each of the terminals/nonterminals on the right-hand-side of the start symbol. Nonterminals will lead to rules having right-hand-sides that in turn need to be ‘generated’, a random selection being made when a nonterminal has more than one possible rhs rule. Output occurs when a terminal is ‘generated’.

For the code to mimic human written code it is necessary to bias the random selection process; a numeric value at the start of each rhs rule can be used to specify the percentage probability of that rule being chosen for the corresponding nonterminal.

The following example generates a subset of C expressions; nonterminals in lowercase, terminals in uppercase and implemented as a call to a function having that name:

first_rule : def_ident " = " expr " ;\n" END_EXPR_STMT ;
def_ident : MK_IDENT ;
constant : MK_CONSTANT ;
identifier : KNOWN_IDENT ;
primary_expr :
	       30 constant |
               60 identifier |
               10 " (" expr ") " ;
multiplicative_expr :
		50 primary_expr |
                40 multiplicative_expr " * " primary_expr |
                10 multiplicative_expr " / " primary_expr ;
additive_expr :
		50 multiplicative_expr |
                25 additive_expr " + " multiplicative_expr |
                25 additive_expr " - " multiplicative_expr ;
expr : START_EXPR additive_expr FINISH_EXPR ;

A 250 line awk program (awk only because I use it often enough for simply text processing that it is second nature) translates this into two Python lists:

productions = [ [0],
[ 1, 1, 1, # first_rule
0, 5, [2, 1001, 3, 1002, 1003, ],
[ 2, 1, 1, # def_ident
0, 1, [1004, ],
[ 4, 1, 1, # constant
0, 1, [1005, ],
[ 5, 1, 1, # identifier
0, 1, [1006, ],
[ 6, 3, 0, # primary_expr
30, 1, [4, ],
60, 1, [5, ],
10, 3, [1007, 3, 1008, ],
[ 7, 3, 0, # multiplicative_expr
50, 1, [6, ],
40, 3, [7, 1009, 6, ],
10, 3, [7, 1010, 6, ],
[ 8, 3, 0, # additive_expr
50, 1, [7, ],
25, 3, [8, 1011, 7, ],
25, 3, [8, 1012, 7, ],
[ 3, 1, 1, # expr
0, 3, [1013, 8, 1014, ],
terminal = [ [0],
[ STR_TERM, " = "],
[ STR_TERM, " ;\n"],
[ STR_TERM, " ("],
[ STR_TERM, ") "],
[ STR_TERM, " * "],
[ STR_TERM, " / "],
[ STR_TERM, " + "],
[ STR_TERM, " - "],

which can be executed by a simply interpreter:

def exec_rule(some_rule) :
 while (cur_action < rule_len) :
    if (some_rule[cur_action] > term_start_base) :
    else :
while (ns < 2000) : # Loop generating lots of test cases

Naive syntax-directed generation results in a lot of code that violates one or more fundamental semantic constraints. For instance the assignment (1+1)=3 is syntactically valid in many languages, which invariably specify a semantic constraint on the lhs of an assignment operator being some kind of modifiable storage location. The simplest solution to this problem is to change the syntax to limit the kinds of constructs that can be generated on the lhs of an assignment.

The hardest semantic association to get right is the connection between variable declarations and references to those variables in expressions. One solution is to mimic how I think many developers write code, that is to generate the statements first and then generate the required definitions for the appropriate variables.

A whole host of minor semantic issues require the syntax generated code to be tweaked, e.g., division by zero occurs more often in untweaked generated code than human code. There are also statistical patterns within the semantics of human written code, e.g., frequency of use of local variables, that need to be addressed.

A few weeks ago the source of Csmith, a C source generator designed to stress the code generation phase of a compiler, was released. Over the years various people have written C compiler stress testers, most recently NPL implemented one in Java, but this is the first time that the source has been released. Imagine my disappointment on discovering that Csmith contained around 40 KLOC of code, only a bit smaller than a C compiler I had once help write. I decided to see if my ‘human characteristics’ generator could be used to create a compiler code generator stress tester.

The idea behind compiler code generator stress testing is to generate a program containing some complicated sequence of code, compile and run it, comparing the value produced against the value that is supposed to be produced.

I modified the human characteristics generator to produce pairs of statements like the following:

i = i_3 * i_6 & i_2 << i_7 ;
chk_result(i, 3 * 6 & 2 << 7, __LINE__);

the second argument to chk_result is the value that i should contain (while generating the expression to assign to i the corresponding constant expression with the variables replaced by their known values is also created).

Having the compiler evaluate the constant expression simplifies the stress tester and provides another check that the compiler gets things right (or gets two different things wrong in the same way, in which case we probably don’t get to see any failure message). The first gcc bug I found concerned this constant expression (in fact this same compiler bug crops up with alarming regularity in the generated code).

As previously mentioned connecting variables in expressions to a corresponding definition is a lot of work. I simplified this problem by assuming that an integer variable i would be predefined in the surrounding support code and that this would be the only variable ever assigned to in the generated code.

There is some simple house-keeping that wraps everything within a program and provides the appropriate variable definitions.

The grammar used to generate full C expressions is 228 lines, the awk translator 252 lines and the Python interpreter 55 lines; just over 1% of Csmith in LOC and it is very easy to configure. However, an awful lot functionality needs to be added before it starts to rival Csmith, not least of which is support for assignment to more than one integer variable!

SEC wants prospectus source code to be published

April 23rd, 2010 No comments

The US Securities and Exchange Commission are proposing new rules involving the prospectuses for public offerings of asset-backed securities including publishing the source code used to calculate the contractual cash flow provisions.

Requiring that the source code used to perform the financial modeling for a prospectus be made available is an excellent idea. A prospectus document contains a huge number of technical details and more importantly for anybody trying to understand the thinking behind it, a lots of assumption. Writing a program requires that all necessary details be enumerated and appropriately connected together and more importantly creating code that can be meaningfully executed usually means making explicit any assumptions that were previously implicit.

There are parallels here with having access to the source code and data used to make climate predictions.

The authors of the proposals are naive to think that simply requiring source to be written in a language for which there is an open source implementation (i.e., Python) is all they need to specify for others to duplicate the program output generated by the proposer (I have submitted some suggestions to the SEC about the issues that need to be addressed). The suggestions that a formally defined language be used is equally naive.

The availability of this source code opens up some interesting commercial prospects. No, not selling analysis tools to financial institutions but selling them program fault information, e.g., under circumstance X the program incorrectly predicts A will happen when in fact B will actually happen. Of course companies know this will happen and will put a lot more effort into ensuring that their models/code is correct.

Will these disclosure rules change the characteristics of financial software? One characteristic that I’m sure will change is the percentage of swear words in the comments and identifiers.

How not to treat loyal customers

December 6th, 2008 No comments

The designer of Python is about to get some on-the-job education in how loyal customers should be treated. The latest release of Python, version 3.0, is the first that is “ever intentionally backwards incompatible”, see the release notes for the litany of broken constructs. Just in case customers are slow to get the message, this latest release is also 10% slower.

If, six months from now, most people/sites are running Python 3.0 we can deduce that very few significant programs are written in the language. On the other hand if very few people/sites are running Python 3.0 we can deduce that many users of the language have significant amounts of code written in the language. At least one measure of program language usage puts Python in the top 10, so there should be plenty of data points.

Should we expect some back-peddling in 2009? Perhaps not; the tone of the release notes is breathtakingly casual about the pain users will experience if they update, not even trying to soften the blow by selling the benefits of the latest release. The ‘you are with me or against me’ attitude is nailed to the mast with “It is not recommended to try to write source code that runs unchanged under both Python 2.6 and 3.0;”