<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Shape of Code &#187; ambiguity</title>
	<atom:link href="http://shape-of-code.coding-guidelines.com/tag/ambiguity/feed/" rel="self" type="application/rss+xml" />
	<link>http://shape-of-code.coding-guidelines.com</link>
	<description></description>
	<lastBuildDate>Sun, 12 Feb 2012 20:42:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Parsing Fortran 95</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/#comments</comments>
		<pubDate>Sun, 20 Dec 2009 12:59:58 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[Climategate]]></category>
		<category><![CDATA[dimensional analysis]]></category>
		<category><![CDATA[Fortran]]></category>
		<category><![CDATA[lexing]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[whitespace]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=140</guid>
		<description><![CDATA[I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser. The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I [...]]]></description>
			<content:encoded><![CDATA[<p>I have been looking at doing some <a href="http://shape-of-code.coding-guidelines.com/2009/05/dimensional-analysis-of-source-code/">dimensional analysis</a> of the <a href="http://shape-of-code.coding-guidelines.com/2009/11/does-the-climategate-code-produce-reliable-output/">Climategate code</a> and so needed a <a href="http://en.wikipedia.org/wiki/Fortran">Fortran</a> parser.</p>
<p>The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03.  I decided to take the opportunity to learn something about the new features by writing a Fortran <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">parser that did not require a symbol table</a>.</p>
<p>The <a href="http://eli-project.sourceforge.net/">Eli project</a> had a <a href="http://eli-project.sourceforge.net/fortran_html/Parse.html">Fortran 90 grammar</a> that was close to having a form acceptable to <a href="http://en.wikipedia.org/wiki/GNU_bison">bison</a> and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict.  These conflicts looked like they could all be handled using <a href="http://shape-of-code.coding-guidelines.com/2009/08/glr-parsing-is-the-future/">glr parsing</a>. The grammar contained 922 productions, somewhat large but I was only interested in actively making use of parts of it.</p>
<p>For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects.  Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not.  This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword <code>do</code> or the identifier <code>do5i</code> (if <code>1</code> is followed by a comma it must be a keyword):</p>

<div class="wp_syntax"><div class="code"><pre class="fortran" style="font-family:monospace;">      <span style="color: #b1b100;">do</span> <span style="color: #cc66cc;">5</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span> , <span style="color: #cc66cc;">10</span>
      <span style="color: #b1b100;">do</span> <span style="color: #cc66cc;">5</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span> . <span style="color: #cc66cc;">10</span>        <span style="color: #666666; font-style: italic;">! assign 1.10 to do5i</span>
<span style="color: #cc66cc;">5</span>     <span style="color: #b1b100;">continue</span></pre></div></div>

<p>In my experience developers don&#8217;t break up literals or identifier names with whitespace and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).</p>
<p>In Fortran the I/O is specified in the language syntax while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime.  I decided to to ignore I/O statements by skipping to the end of line (Fortran is line oriented).</p>
<p>Then the number of keywords hit me, around 190.  Even with the simplifications I had made writing a Fortran lexer looked like it would be a lot of work; some of the keywords only had this status when followed by a <code>=</code> and I kept uncovering new issues.  Cutting and pasting somebody else&#8217;s lexer would probably also involve a lot of work.</p>
<p>I went back and looked at some of the Fortran front ends I had found on the Internet.  The <a href="http://en.wikipedia.org/wiki/Gfortran">GNU Fortran front-end</a> was a huge beast and would need serious cutting back to be of use.  <a href="http://www.ifremer.fr//ditigo/molagnon/fortran90/contenu.html">moware</a> was written in Fortran and used the traditional six character abbreviated names seen in &#8216;old-style&#8217; Fortran source and not a lot of commenting.  The Eli project seemed a lot more interested in the formalism side of things and Fortran was just one of the languages they claimed to support.</p>
<p>The <a href="http://fortran-parser.sourceforge.net/">Open Fortran Parser</a> looked very interesting.  It was designed to be used as a parsing skeleton that could be used to produce tools that processed source and already contained hooks that output diagnostic output when each language production was reduced during a parse.  Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quiet often (an not documented in their manual).  The tool source, in Java, looked straightforward to follow and it was obvious where my code needed to be added.  This tool was exactly what I needed <img src='http://shape-of-code.coding-guidelines.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F12%2F20%2Fparsing-fortran-95%2F&amp;title=Parsing%20Fortran%2095" id="wpa2a_2"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GLR parsing is the future</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 15:54:23 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[grammar]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[runtime error]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[syntax]]></category>
		<category><![CDATA[the future]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=113</guid>
		<description><![CDATA[Traditionally parser generators have required that their input grammar be LALR(1) or some close variant (I would include LL(1) in this set). Back when 64k was an unimaginably large amount of memory being able to squeeze parser tables in a few kilobytes was very important; people received PhDs on parser table compression. There is still [...]]]></description>
			<content:encoded><![CDATA[<p>Traditionally parser generators have required that their input grammar be LALR(1) or some close variant (I would include LL(1) in this set).  Back when 64k was an unimaginably large amount of memory being able to squeeze parser tables in a few kilobytes was very important; people received PhDs on parser table compression.</p>
<p>There is still a market for compact, fast parsers.  Formal language grammars abound in communication protocols and vendors of communications hardware are very interested in keeping down costs by using minimizing the storage needed by their devices.</p>
<p>The trouble with LALR(1) is that value 1.  It means that the parser only  looks ahead one token in the input stream.  This often means that a grammar is flagged as being ambiguous (i.e., it contains shift/reduce or reduce/reduce conflicts) when it is actually just locally ambiguous, i.e., reading tokens further head on the input stream would provide sufficient context to unambiguously specify the appropriate grammar production.</p>
<p>Restructuring a grammar to make it LALR(1) requires a lot of thought and skill and inexperienced users often give up.  I once spent a month trying to remove the conflicts in the SQL/2 grammar specified by the SQL ISO standard; I managed to get the number down from over 1,000 to a small number that I decided I could live with.</p>
<p>It has taken a long time for parser generators to break out of the 64k mentality, but over the last few years it has started to happen.  There have been two main approaches: 1) LR(n) provides a mechanism to look further ahead than one token, ie, <equ>n</equ> tokens, and 2) <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> parsing.</p>
<p>I think that GLR parsing is the future for two reasons:</p>
<ul>
<li>It is supported by the most widely used parser generator, <a href="http://www.gnu.org/software/bison/">bison</a>.</li>
<li>It enables working parsers to be created with much less thought and effort than a LALR(1) parser.  (I don&#8217;t know how it compares against LR(n)).</li>
</ul>
<p>GLR parsers resolve any language ambiguities by effectively delaying decisions until runtime in the hope that reading enough tokens will resolve local ambiguities.  If an ambiguity in the token stream cannot be resolved a runtime error occurs (this is the one big downside of a GLR parser, the parser generated by a LALR(1) parser generator may produce lots of build time warnings but never produces errors when the parser is executed).</p>
<p>One example of a truly ambiguous construct (discussed <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">here</a> a while ago) is:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">x <span style="color: #339933;">*</span> y<span style="color: #339933;">;</span></pre></div></div>

<p>which in C/C++ could be a declaration of <code>y</code> to be a pointer to <code>x</code>, or an expression that multiplies <code>x</code> and <code>y</code>.</p>
<p>Tools that can detect these global ambiguities in a grammar are starting to appear, e.g., <a href="http://www.lsv.ens-cachan.fr/~schmitz/software">DTWA</a> is a bison extension.</p>
<p>I reviewed an early draft of the new O&#8217;Reilly book &#8220;flex &#038; bison&#8221; and tried to get the <a href="http://www.johnlevine.com/">author</a> to be more upbeat on GLR support in bison; I think I got him to be a bit less cautious.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F08%2F27%2Fglr-parsing-is-the-future%2F&amp;title=GLR%20parsing%20is%20the%20future" id="wpa2a_4"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Parsing without a symbol table</title>
		<link>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/#comments</comments>
		<pubDate>Fri, 19 Dec 2008 01:28:09 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Datatypes]]></category>
		<category><![CDATA[empirical]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[common case]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[preprocessing]]></category>
		<category><![CDATA[syntax]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=34</guid>
		<description><![CDATA[When processing C/C++ source for the first time through a compiler or static analysis tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names. One solution to this configuration problem is to be able to process source without [...]]]></description>
			<content:encoded><![CDATA[<p>When processing C/C++ source for the first time through a compiler or <a href="http://en.wikipedia.org/wiki/Static_code_analysis">static analysis</a> tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names.  One solution to this configuration problem is to be able to process source without handling preprocessing directives (e.g., skipping them, such as not reading the contents of header files or working out which arm of a conditional directive is applicable).  Developers can do it, why not machines?</p>
<p>A few years ago <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> support was added to <a href="http://en.wikipedia.org/wiki/GNU_Bison">Bison</a>, enabling it to process ambiguous grammars, and I decided to create a C parser that simply skipped all preprocessing directives.  I knew that at least one reasonably common usage would generate a syntax error:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">func_call<span style="color: #009900;">&#40;</span>a<span style="color: #339933;">,</span>
<span style="color: #339933;">#if SOME_FLAG</span>
b_1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#else</span>
b_2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#endif</span></pre></div></div>

<p><del datetime="00">c);</del><br />
and wanted to minimize its consequences (i.e., cascading syntax errors to the end of the file).  The solution chosen was to parse the source a single statement or declaration at a time, so any syntax error would be localized to a single statement or declaration.</p>
<p>Systems for parsing ambiguous grammars work on the basis that while the input may be locally ambiguous, once enough tokens have been seen the number of possible parses will be reduced to one.  In C (and even more so in C++) there are some situations where it is impossible to resolve which of several possible parses apply without declaration information on one or more of the identifiers involved (a traditional parser would maintain a symbol table where this information could be obtained when needed).  For instance, <code>x * y;</code> could be a declaration of the identifier <code>y</code> to have type <code>x</code> or an expression statement that multiplies <code>x</code> and <code>y</code>.  My parser did not have a symbol table and even if it did the lack of header file processing meant that its contents would only contain a partial set of the declared identifiers.  The ambiguity resolution strategy I adopted was to pick the most likely case, which in the example is the declaration parse.</p>
<p>Other constructs where the common case (chosen by me and I have yet to get around to actually verifying via measurement) was used to resolve an ambiguity deadlock included:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">f<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Very common, </span>
            <span style="color: #666666; font-style: italic;">// confidently picked function call as the common case</span>
<span style="color: #009900;">&#40;</span>m<span style="color: #009900;">&#41;</span><span style="color: #339933;">*</span>p<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// Not rare,</span>
            <span style="color: #666666; font-style: italic;">// confidently picked multiplication as the common case</span>
<span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Quiet rare,</span>
               <span style="color: #666666; font-style: italic;">// picked binary operator as the common case</span>
<span style="color: #009900;">&#40;</span>r<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Very rare,</span>
                  <span style="color: #666666; font-style: italic;">//an iteration on the case above</span></pre></div></div>

<p>At the moment I am using the parser to measure language usage, so less than 100% correctness can be tolerated.  Some of the constructs that cause a syntax error to be generated every few hundred statement/declarations include:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">offsetof<span style="color: #009900;">&#40;</span><span style="color: #993333;">struct</span> tag<span style="color: #339933;">,</span> field_name<span style="color: #009900;">&#41;</span>  <span style="color: #666666; font-style: italic;">// Declarators cannot be </span>
                                            <span style="color: #666666; font-style: italic;">//function arguments</span>
<span style="color: #993333;">int</span> f<span style="color: #009900;">&#40;</span>p<span style="color: #339933;">,</span> q<span style="color: #009900;">&#41;</span>
<span style="color: #993333;">int</span> p<span style="color: #339933;">;</span>     <span style="color: #666666; font-style: italic;">// Tries to reduce this as a declaration without handling</span>
<span style="color: #993333;">char</span> q<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// it as part of an old style function definition</span>
<span style="color: #009900;">&#123;</span>
&nbsp;
MACRO<span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Preprocessing expands to something meaningful</span></pre></div></div>

<p>Some of these can be handled by extensions to the grammar, while others could be handled by an error recovery mechanism that recognized likely macro usage and inserted something appropriate (e.g., a dummy expression in the <code>MACRO(x)</code> case).</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2008%2F12%2F19%2Fparsing-without-a-symbol-table%2F&amp;title=Parsing%20without%20a%20symbol%20table" id="wpa2a_6"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

