Archive

Posts Tagged ‘syntax’

GLR parsing is the future

August 27th, 2009 No comments

Traditionally parser generators have required that their input grammar be LALR(1) or some close variant (I would include LL(1) in this set). Back when 64k was an unimaginably large amount of memory being able to squeeze parser tables in a few kilobytes was very important; people received PhDs on parser table compression.

There is still a market for compact, fast parsers. Formal language grammars abound in communication protocols and vendors of communications hardware are very interested in keeping down costs by using minimizing the storage needed by their devices.

The trouble with LALR(1) is that value 1. It means that the parser only looks ahead one token in the input stream. This often means that a grammar is flagged as being ambiguous (i.e., it contains shift/reduce or reduce/reduce conflicts) when it is actually just locally ambiguous, i.e., reading tokens further head on the input stream would provide sufficient context to unambiguously specify the appropriate grammar production.

Restructuring a grammar to make it LALR(1) requires a lot of thought and skill and inexperienced users often give up. I once spent a month trying to remove the conflicts in the SQL/2 grammar specified by the SQL ISO standard; I managed to get the number down from over 1,000 to a small number that I decided I could live with.

It has taken a long time for parser generators to break out of the 64k mentality, but over the last few years it has started to happen. There have been two main approaches: 1) LR(n) provides a mechanism to look further ahead than one token, ie, n tokens, and 2) GLR parsing.

I think that GLR parsing is the future for two reasons:

  • It is supported by the most widely used parser generator, bison.
  • It enables working parsers to be created with much less thought and effort than a LALR(1) parser. (I don’t know how it compares against LR(n)).

GLR parsers resolve any language ambiguities by effectively delaying decisions until runtime in the hope that reading enough tokens will resolve local ambiguities. If an ambiguity in the token stream cannot be resolved a runtime error occurs (this is the one big downside of a GLR parser, the parser generated by a LALR(1) parser generator may produce lots of build time warnings but never produces errors when the parser is executed).

One example of a truly ambiguous construct (discussed here a while ago) is:

x * y;

which in C/C++ could be a declaration of y to be a pointer to x, or an expression that multiplies x and y.

Tools that can detect these global ambiguities in a grammar are starting to appear, e.g., DTWA is a bison extension.

I reviewed an early draft of the new O’Reilly book “flex & bison” and tried to get the author to be more upbeat on GLR support in bison; I think I got him to be a bit less cautious.

Share

Parsing ambiguous grammars (part 1)

March 4th, 2009 No comments

Parsing a language is often much harder than people think, perhaps because they have only seen examples that use a simple language that has been designed to make explanation easy. Most languages in everyday use contain a variety of constructs that make the life of a parser writer difficult. Yes, there are parser generators, tools like bison, that automate the process of turning a grammar into a parser and a language’s grammar is often found in the back of its reference manual. However, these grammars are often written to make the life of the programmer easier, not the life of the parse writer.

People may have spotted technical term like LL(1), LR(1) and LALR(1); what they all have in common is a 1 in brackets, because they all operate by looking one token ahead in the input stream. There is a big advantage to limiting the lookahead to one token, the generated tables are much smaller (back in the days when these tools were first created 64K was considered to be an awful lot of memory and today simple programs in embedded processors, with limited memory, often use simple grammars to parse communication’s traffic). Most existing parser generators operate within this limit and rely on compiler writers to sweat over, and contort, grammars to make them fit.

A simple example is provided by PL/1 (most real life examples tend to be more complicated) which did not have keywords, or to be exact did not restrict the spelling of identifiers that could be used to denote a variable, label or procedure. This meant that in the following code:

IF x THEN y = z; ELSE = w;

when the ELSE was encountered the compiler did not know whether it was the start of the alternative arm of the previously seen if-statement or an assignment statement. The token appearing after the ELSE needed to be examined to settle the question.

In days gone-by the person responsible for parsing PL/1 would have gotten up to some jiggery-pokery, such as having the lexer spot that an ELSE had been encountered and process the next token before reporting back what it had found to the syntax analysis.

A few years ago bison was upgraded to support GLR parsing. Rather than lookahead at more tokens a GLR parser detects that there is more than one way to parse the current input and promptly starts parsing each possibility (it is usually implemented by making copies of the appropriate data structures and updating each copy according to the particular parse being followed). The hope is that eventually all but one of these multiple parsers will reach a point where they cannot successfully parse the input tokens and can be killed off, leaving the one true parse (the case where multiple parses continue to exist was discussed a while ago; actually in another context).

Share

Parsing without a symbol table

December 19th, 2008 No comments

When processing C/C++ source for the first time through a compiler or static analysis tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names. One solution to this configuration problem is to be able to process source without handling preprocessing directives (e.g., skipping them, such as not reading the contents of header files or working out which arm of a conditional directive is applicable). Developers can do it, why not machines?

A few years ago GLR support was added to Bison, enabling it to process ambiguous grammars, and I decided to create a C parser that simply skipped all preprocessing directives. I knew that at least one reasonably common usage would generate a syntax error:

func_call(a,
#if SOME_FLAG
b_1);
#else
b_2);
#endif

c);
and wanted to minimize its consequences (i.e., cascading syntax errors to the end of the file). The solution chosen was to parse the source a single statement or declaration at a time, so any syntax error would be localized to a single statement or declaration.

Systems for parsing ambiguous grammars work on the basis that while the input may be locally ambiguous, once enough tokens have been seen the number of possible parses will be reduced to one. In C (and even more so in C++) there are some situations where it is impossible to resolve which of several possible parses apply without declaration information on one or more of the identifiers involved (a traditional parser would maintain a symbol table where this information could be obtained when needed). For instance, x * y; could be a declaration of the identifier y to have type x or an expression statement that multiplies x and y. My parser did not have a symbol table and even if it did the lack of header file processing meant that its contents would only contain a partial set of the declared identifiers. The ambiguity resolution strategy I adopted was to pick the most likely case, which in the example is the declaration parse.

Other constructs where the common case (chosen by me and I have yet to get around to actually verifying via measurement) was used to resolve an ambiguity deadlock included:

f(p);      // Very common, 
            // confidently picked function call as the common case
(m)*p;   // Not rare,
            // confidently picked multiplication as the common case
(s) - t;      // Quiet rare,
               // picked binary operator as the common case
(r) + (s) - t; // Very rare,
                  //an iteration on the case above

At the moment I am using the parser to measure language usage, so less than 100% correctness can be tolerated. Some of the constructs that cause a syntax error to be generated every few hundred statement/declarations include:

offsetof(struct tag, field_name)  // Declarators cannot be 
                                            //function arguments
int f(p, q)
int p;     // Tries to reduce this as a declaration without handling
char q;   // it as part of an old style function definition
{
 
MACRO(+); // Preprocessing expands to something meaningful

Some of these can be handled by extensions to the grammar, while others could be handled by an error recovery mechanism that recognized likely macro usage and inserted something appropriate (e.g., a dummy expression in the MACRO(x) case).

Share
www.wenn.com
www.tinynibbles.com