Archive

Posts Tagged ‘tools’

Parsing Fortran 95

December 20th, 2009 Derek-Jones No comments

I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser.

The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I decided to take the opportunity to learn something about the new features by writing a Fortran parser that did not require a symbol table.

The Eli project had a Fortran 90 grammar that was close to having a form acceptable to bison and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict. These conflicts looked like they could all be handled using glr parsing. The grammar contained 922 productions, somewhat large but I was only interested in actively making use of parts of it.

For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects. Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not. This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword do or the identifier do5i (if 1 is followed by a comma it must be a keyword):

      do 5 i = 1 , 10
      do 5 i = 1 . 10        ! assign 1.10 to do5i
5     continue

In my experience developers don’t break up literals or identifier names with whitespace and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).

In Fortran the I/O is specified in the language syntax while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime. I decided to to ignore I/O statements by skipping to the end of line (Fortran is line oriented).

Then the number of keywords hit me, around 190. Even with the simplifications I had made writing a Fortran lexer looked like it would be a lot of work; some of the keywords only had this status when followed by a = and I kept uncovering new issues. Cutting and pasting somebody else’s lexer would probably also involve a lot of work.

I went back and looked at some of the Fortran front ends I had found on the Internet. The GNU Fortran front-end was a huge beast and would need serious cutting back to be of use. moware was written in Fortran and used the traditional six character abbreviated names seen in ‘old-style’ Fortran source and not a lot of commenting. The Eli project seemed a lot more interested in the formalism side of things and Fortran was just one of the languages they claimed to support.

The Open Fortran Parser looked very interesting. It was designed to be used as a parsing skeleton that could be used to produce tools that processed source and already contained hooks that output diagnostic output when each language production was reduced during a parse. Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quiet often (an not documented in their manual). The tool source, in Java, looked straightforward to follow and it was obvious where my code needed to be added. This tool was exactly what I needed :-)

Software maintenance via genetic programming

November 27th, 2009 Derek-Jones No comments

Genetic algorithms have been used to find solution to a wide variety of problems, including compiler optimizations. It was only a matter of time before somebody applied these techniques to fixing faults in source code.

When I first skimmed the paper “A Genetic Programming Approach to Automated Software Repair” I was surprised at how successful the genetic algorithm was, using as it did such a relatively small amount of cpu resources. A more careful reading of the paper located one very useful technique for reducing the size of the search space; the automated software repair system started by profiling the code to find out which parts of it were executed by the test cases and only considered statements that were executed by these tests for mutation operations (they give a much higher weighting to statements only executed by the failing test case than to statements executed by the other tests; I am a bit surprised that this weighting difference is worthwhile). I hate to think of the amount of time I have wasted trying to fix a bug by looking at code that was not executed by the test case I was running.

I learned more about this very interesting system from one of the authors when he gave the keynote at a workshop organized by people associated with a source code analysis group I was a member of.

The search space was further constrained by only performing mutations at the statement level (i.e., expressions and declarations were not touched) and restricting the set of candidate statements for insertion into the code to those statements already contained within the code, such as if (x != NULL) (i.e., new statements were not randomly created and existing statements were not modified in any way). As measurements of existing code show most uses of a construct are covered by a few simple cases and most statements are constructed from a small number of commonly used constructs. It is no surprise that restricting the candidate insertion set to existing code works so well. Of course no fault fix that depends on using a statement not contained within the source will ever be found.

There is ongoing work looking at genetic modifications at the expression level. This
work shares a problem with GA driven test coverage algorithms; how to find ‘magic numbers’ (in the case of test coverage the magic numbers are those that will cause a controlling expression to be true or false). Literals in source code, like those on the web, tend to follow a power’ish law but the fit to Benford’s law is not good.

Once mutated source that correctly processes the previously failing test case, plus continuing to pass the other test cases, has been generated the code is passed to the final phase of the automated software repair system. Many mutations have no effect on program behavior (the DNA term intron is sometimes applied to them) and the final phase removes any of the added statements that have no effect on test suite output (Westley Weimer said that a reduction from 50 statements to 10 statements is common).

Might the ideas behind this very interesting research system end up being used in ‘live’ software? I think so. There are systems that operate 24/7 where faults cost money. One can imagine a fault being encountered late at night, a genetic based system fixing the fault which then updates the live system, the human developers being informed and deciding what to do later. It does not take much imagination to see the cost advantages driving expensive human input out of the loop in some cases.

An on-going research topic is the extent to which a good quality test suite is needed to ensure that mutated fault fixes don’t introduce new faults. Human written software is known to often be remarkably tolerant to the presence of faults. Perhaps ensuring that software has this characteristic is something that should be investigated.

Using Coccinelle to match if sequences

August 31st, 2009 Derek-Jones No comments

I have been using Coccinelle to obtain measurements of various properties of C if and switch statements. It is rare to find a tool that does exactly what is desired but it is often possible to combine various tools to achieve the desired result.

I am interested in measuring sequences of if-else-if statements and one of the things I wanted to know was how many sequences of a given length occurred. Writing a pattern for each possible sequence was the obvious solution, but what is the longest sequence I should search for? A better solution is to use a pattern that matches short sequences and writes out the position (line/column number) where they occur in the code, as in the following Coccinelle pattern:

@ if_else_if_else @
expression E_1, E_2; 
statement S_1, S_2, S_3;
position p_1, p_2;
@@
if@p_1 (E_1)
   S_1
else if@p_2 (E_2)
   S_2
else
   S_3
@
script:python @ expr_1 << if_else_if_else.E_1;
                expr_2 << if_else_if_else.E_2;
                loc_1 << if_else_if_else.p_1;
                loc_2 << if_else_if_else.p_2;
              @@
print "--- ifelseifelse"
print loc_1[0].line, " ", loc_1[0].column, " ", expr_1
print loc_2[0].line, " ", loc_2[0].column, " ", expr_2

noting that in a sequence of source such as:

if (x == 1)
   stmt_1;
else
   if (x == 2)
      stmt_2;
   else
      if (x == 3)
         stmt_3;

the tokens if (x == 2) will be matched twice, the first setting the position metavariable p_2 and then setting p_1. An awk script was written to read the Coccinelle output and merge together adjacent pairs of matches that were part of a longer if-else-if sequence.

The first pattern did not concern itself with the form of the controlling expression, it simply wrote it out. A second set of patterns was used to match those forms of controlling expression I was interested in, but first I had to convert the output into syntactically correct C so that it could be processed by Coccinelle. Again awk came to the rescue, converting the output:

--- ifelseifelse
186   2   op == FFEBLD_opSUBRREF
191   7   op == FFEBLD_opFUNCREF
--- ifelseifelse
1094   3   anynum && commit
1111   8   ( c [ colon + 1 ] == '*' ) && commit

into a separate function for each matched sequence:

void f_1(void) {
// --- ifelseifelse
/* 186   2 */   op == FFEBLD_opSUBRREF ;
/* 191   7 */   op == FFEBLD_opFUNCREF ;
}
void f_2(void) {
// --- ifelseifelse
/* 1094   3 */   anynum && commit ;
/* 1111   8 */   ( c [ colon + 1 ] == '*' ) && commit ;
}

The Coccinelle pattern:

@ if_eq_1 @
expression E_1;
constant C_1, C_2;
position p_1, p_2;
@@
 
E_1 == C_1@p_1 ;
E_1 == C_2@p_2 ;
 
@
script:python @ expr_1<< if_eq_1.E_1;
                const_1 << if_eq_1.C_1;
                const_2 << if_eq_1.C_2;
                loc_1 << if_eq_1.p_1;
                loc_2 << if_eq_1.p_2;
              @@
print loc_1[0].line, " ", loc_1[0].column, " 3 ", expr_1, " == ", const_1
print loc_2[0].line, " ", loc_2[0].column, " 2 ", expr_1, " == ", const_2

matches a sequence of two statements which consist of an expression being compared for equality against a constant, with the expression being identical in both statements. Again positions were written out for post-processing, i.e., joining together matched sequences.

I was interested in any sequence of if-else-if that could be converted to an equivalent switch-statement. Equality tests against a constant is just one form of controlling expression that meets this requirement, another is the between operation. Separate patterns could be written and run over the generated C source containing the extracted controlling expressions.

Breaking down the measuring process into smaller steps reduced the amount of time needed to get a final result (with Coccinelle 0.1.19 the first pattern takes round 70 minutes, thanks to Julia Lawall’s work to speed things up, an overhead that only has to occur once) and allows the same controlling expression patterns to be run against the output of both the if-else-if and if-if patterns.

At the end of this process I ended up with a list information (line numbers in source code and form of controlling expression) on if-statement sequences that could be rewritten as a switch-statement.

Parsing ambiguous grammars (part 1)

March 4th, 2009 Derek-Jones No comments

Parsing a language is often much harder than people think, perhaps because they have only seen examples that use a simple language that has been designed to make explanation easy. Most languages in everyday use contain a variety of constructs that make the life of a parser writer difficult. Yes, there are parser generators, tools like bison, that automate the process of turning a grammar into a parser and a language’s grammar is often found in the back of its reference manual. However, these grammars are often written to make the life of the programmer easier, not the life of the parse writer.

People may have spotted technical term like LL(1), LR(1) and LALR(1); what they all have in common is a 1 in brackets, because they all operate by looking one token ahead in the input stream. There is a big advantage to limiting the lookahead to one token, the generated tables are much smaller (back in the days when these tools were first created 64K was considered to be an awful lot of memory and today simple programs in embedded processors, with limited memory, often use simple grammars to parse communication’s traffic). Most existing parser generators operate within this limit and rely on compiler writers to sweat over, and contort, grammars to make them fit.

A simple example is provided by PL/1 (most real life examples tend to be more complicated) which did not have keywords, or to be exact did not restrict the spelling of identifiers that could be used to denote a variable, label or procedure. This meant that in the following code:

IF x THEN y = z; ELSE = w;

when the ELSE was encountered the compiler did not know whether it was the start of the alternative arm of the previously seen if-statement or an assignment statement. The token appearing after the ELSE needed to be examined to settle the question.

In days gone-by the person responsible for parsing PL/1 would have gotten up to some jiggery-pokery, such as having the lexer spot that an ELSE had been encountered and process the next token before reporting back what it had found to the syntax analysis.

A few years ago bison was upgraded to support GLR parsing. Rather than lookahead at more tokens a GLR parser detects that there is more than one way to parse the current input and promptly starts parsing each possibility (it is usually implemented by making copies of the appropriate data structures and updating each copy according to the particular parse being followed). The hope is that eventually all but one of these multiple parsers will reach a point where they cannot successfully parse the input tokens and can be killed off, leaving the one true parse (the case where multiple parses continue to exist was discussed a while ago; actually in another context).

Searching code for a specific kind of calculation

December 27th, 2008 Derek-Jones No comments

I am currently involved in a project that requires locating the line(s) of code in a program that calculate(s) the value 3n+1 (and various other constructs associated with coding the 3n+1 problem). Since there are over 4,000 independently written programs, each containing this calculation, the search effort is non-trivial. The obvious solution is to use grep to search for the expression 3*n+1 (the actual regular expression is a bit more complicated since any whitespace needs to be handled and the identifier n might have a different spelling).

This obvious solution works around 80% of the time (based on my manual analysis of the programs; searching for n*3+1 catches another 10%). For many of the authors of these 4,000+ programs simplicity does not seem to be an overriding goal and various alternative forms of the calculation are used (e.g., n+n+n or (n<<2)+n+1 or n+= (n<<2); n++ or even n+=n+++n). It looks like some authors have been unduly concerned with program performance.

The reason I am doing a manual search is that this problem is way beyond the capabilities of existing code search tools. Existing tools all require that the search pattern be specified in terms that are essential lexical. This would not be too much of a problem if I had a means of automatically enumerating a reasonable subset of the expressions that evaluate to 3n+1. (The problem of optimizing the sequence of operations needed to multiply a variable by a constant is a well known issue in compiler code generation and very good algorithms are known (papers+code and matrix multiplication)).

The existing abstract interpretation tools target complete programs, or at least complete functions, and aim to show that certain conditions are met and/or not violated. An abstract interpretation version of grep sounds like an interesting PhD.

After several thousand searches even the most obtuse methods rarely take me more than a few seconds to spot. I can also be easily ‘reprogrammed’ to search just as effectively for other code sequences having some simple result.

Contemplating the major problems that need to be solved before an automatic tool could perform a similar task I am starting to appreciate, once again, the vast gulf that exists between human and computer analysis of code.

C++ goes for too big to fail

December 8th, 2008 Derek-Jones No comments

If you believe the Whorfian hypothesis that language effects thought, even in one of its weaker forms, then major changes to a programming language will effect the shape of the code its users write.

I was at the first International C++ Standard meeting in London during 1991 and coming from a C Standard background I could not believe the number of new constructs being invented (the C committee had a stated principle that a construct be supported by at least one implementation before it be considered for inclusion in the standard; ok, this was not always followed to the letter). The C++ committee members continued to design away, putting in a huge amount of effort, and the document was ratified before the end of the century.

The standard is currently undergoing a major revision and the amount of language design going on puts the original committee to shame. With over 1,300 pages in the latest draft nobodies favorite construct is omitted. The UK C++ panel has over 10 people actively working on producing comments and may produce over 1,000 on the latest draft.

With so many people committed to the approach being taken in the development of the revised C++ Standard its current direction is very unlikely to change. The fact that most ‘real world’ developers only understand a fraction of what is contained in the existing standard has not stopped it being very widely used and generally considered as a ’success’. What is the big deal over a doubling of the number of pages in a language definition, the majority of developers will continue to use the small subset that they each individually have used for years.

The large number of syntactic ambiguities make it is very difficult to parse C++ (semantic information is required to resolve the ambiguities and the code to do this is an at least an order of magnitude bigger than the lexer+parser). This difficulty is why there are so few source code analysis tools available for C++, compared to C and Java which are much much easier to parse. The difficulty of producing tools means that researchers rarely analyse C++ code and only reasonably well funded efforts are capably of producing worthwhile static analysis tools.

Like many of the active committee members I have mixed feelings about this feature bloat. Yes it is bad, but it will keep us all actively employed on interesting projects for many years to come. As the current financial crisis has shown, one of the advantages of being big and not understood is that you might get to being too big to fail.

www.wenn.com
FireStats icon Powered by FireStatswww.tinynibbles.com generic levitra purchase

cialis one a day

lowest priced propecia

online ordering propecia

generic viagra canadian

order propecia

cialis professional 20 mg

buy cialis 5 mg

levitra viagra cialis

best price for generic cialis

cialis transdermal

cialis 5 mg italia

online propecia uk

generic propecia sale

levitra cost

cialis price in canada

bestellen levitra online

cialis for woman

50 mg cialis

generic cialis sale

cialis overnight delivery

cialis on women

online pharmacy propecia renova

cialis and canada custom

canada meds viagra

best viagra

buy cialis online uk

best price for propecia

brand cialis for sale

name brand cialis

buy cialis for daily use

cialis pharmacy

levitra 10mg

cheap propecia online prescription

lowest propecia prices in canada

buy levitra overnight

canadian pharmacy discount code viagra

cialis dosage mg

cialis price

buy cialis canada

next day delivery cialis

cialis fast delivery usa

canadian pharmacy viagra

china viagra

canada viagra

cialis profesional

cheap cialis soft

how much to buy viagra in pounds

generic propecia online pharmacy

canadian pharmacy cialis

cialis by mail

online cheap viagra

buy 5 mg cialis

how to get viagra

cheap propecia uk

cialis daily

buy propecia without prescription

generic levitra vardenafil

buy propecia in the uk

cialis overnight

lowest price on non generic levitra

buy online prescription propecia

cialis refractory

buying cialis

lowest propecia prices

cheap fast levitra

discount generic propecia

levitra online

buy propecia online pharmacy

buy cialis cannada

levitra buy online

canadian viagra india

get propecia online pharmacy

cialis 5 mg buy

cialis uk

low price levitra

overnight delivery viagra

canada propecia prescription

buy cheap generic levitra

levitra sales uk

buy viagra germany canadian meds

buying propecia online

buy propecia online

generic levitra online

levitra mail order

cialis cheap us pharmacy

combine cialis and levitra

best price levitra

canadian online pharmacy cialis

buy cheap levitra

buy propecia online prescription

buy viagra on line

low cost levitra

levitra in canada

lowest price propecia

gele viagra

buy propecia cheap

levitra viagra online

discount drug propecia

buy viagra online

how strong is 5 mg of cialis

levitra order prescription

buy generic levitra

can i get viagra in mexico

levitra online no prescription

levitra low price

generic propecia fda approved

levitra discount

cheapest propecia sale uk

buy cialis once daily

cialis and diarrhea

cheap viagra canada or india

cheap levitra tablets

cheapest overnight cialis

buy cheap levitra online

canada viagra pharmacies scam

info levitra

buy dosages levitra

levitra online us

buy canada levitra

cialis next day delivery

generic viagra made in india

drug generic propecia

online pharmacy propecia viagra

cialis 100 mg generic

levitra in india

hydrochlorothiazide cialis

buy levitra online viagra

generic propecia 5mg

order generic levitra

levitra for sale

buy levitra uk

buy cheapest propecia

levitra from canadian pharmacy

online viagra gel to buy

buy now propecia

cialis en mexico

does generic cialis work

generic cialis soft tabs

cheap cialis from india

healthcare canadian pharmacy

cheapest prices for viagra

for sale levitra

cheap cialis

lowest price levitra

low cost canadian viagra

cialis 20 mg

discount levitra rx

indian cialis

cialis from canada

cheap levitra prescription

cost of propecia

cheapest propecia prescription

canadian drugs propecia

get cialis

cheap levitra

genuine cialis pills

canadian viagra

generic propecia effective

online generic cialis 100 mg

natural viagra

cialis headaches

levitra pill

discount levitra online

canadian healthcare pharmacy

canadian pharmacies cialis

buy levitra online from canada

discount propecia online

cialis strenght mg

levitra online prescription

mail online order propecia

cialis tablets

cialis 5 mg

cheap viagra online

levitra mg

buying generic cialis mexico rx

cheapest propecia uk

ordering cialis gel

buy prescription propecia without

cialis online

cheap order prescription propecia

cialis next day

buying cialis next day delivery

generic cialis from india

discount levitra purchase

cialis daily in canada

indian generic levitra

canada online pharmacy levitra

cialis quick shipment

ordering propecia online

get cialis online

online cialis

levitra cheap fast

generic propecia for sale

cialis from mexico

cialis soft pills

buy generic viagra india rx

cheap levitra uk

buy propecia canada

buy branded viagra

cialis woman

generic viagra online

cheap canadian viagra

cialis price 100 mg

buy real cialis

lowest price for propecia

buy viagra china

generic viagra india

canada cheap propecia

buying generic propecia

low cost propecia

buy viagra online cheap us

i need to buy propecia

buy levitra us

levitra online sales

buy viagra mexico

ganeric cialis

cialis no prescription

levitra now online

cialis professional 100 mg

buy can from i propecia who

canadian healthcare viagra

buying propecia

order cheapest propecia online

discount cialis india

cheap levitra without prescription

cialis and ketoconazole

cialis to buy

how much is viagra

once a day viagra

cheapest viagra

canadian viagra 50mg

brand name cialis

cheap propecia online

cialis 50 mg

overnight delivery cialis

cialis discount

mexico levitra

buy cialis fedex shipping

mail order propecia

canadian viagra and healthcare

cheapest viagra online

buy viagra

cialis purchase

buying cialis soft tabs 100 mg

how much cialis

cialis alternative

order viagra or levitra

canada generic propecia

cialis discounts

canada viagra generic

generic levitra overnight delivery

best way to use cialis

canadian healthcare

generic propecia finasteride

cialis daily dosage pharmacy

cheapest viagra usa

cheapest price propecia cheap

buy generic propecia

discount propecia propecia

online propecia prescriptions

generic propecia alternative

best price cialis

buy cialis online canada

once daily cialis

levitra prescription

discount us propecia

generic levitra cheap

generic cialis next day shipping

how to get cialis in canada

cialis buy overnight

buy cialis usa

cialis prescription

cost levitra low

cialis vs levitra

online propecia prescription

canadian pharmacy

obtain viagra without prescription

buy real viagra online

cialis generic 100 mg

internet pharmacy propecia

cialis delivered overnight

levitra where to buy

herbal propecia

levitra tabs

fda levitra

get levitra

online levitra

order prescription propecia

order levitra online

buy propecia generic

canadian propecia rx

best price propecia

generic viagra made in usa

buy viagra without prescription

levitra online overnight delivery

female viagra pills

order cheap levitra

cialis 100 mg

buy propecia online from usa pharmacy