<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Shape of Code &#187; parsing</title>
	<atom:link href="http://shape-of-code.coding-guidelines.com/tag/parsing/feed/" rel="self" type="application/rss+xml" />
	<link>http://shape-of-code.coding-guidelines.com</link>
	<description></description>
	<lastBuildDate>Sun, 29 Jan 2012 23:49:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>SQL usage: schema evolution</title>
		<link>http://shape-of-code.coding-guidelines.com/2011/01/30/sql-usage-schema-evolution/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2011/01/30/sql-usage-schema-evolution/#comments</comments>
		<pubDate>Sun, 30 Jan 2011 23:35:21 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[source code]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[validation suite]]></category>
		<category><![CDATA[Wikipedia]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=356</guid>
		<description><![CDATA[My first serious involvement with SQL, about 15 years ago, was writing a parser for the grammar specified in the ISO SQL-92 Standard. One of the things that surprised me about SQL was how little source code was generally available (for testing) and the almost complete lack of any published papers on SQL usage (its [...]]]></description>
			<content:encoded><![CDATA[<p>My first serious involvement with SQL, about 15 years ago, was writing a parser for the grammar specified in the <a href="http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt">ISO SQL-92 Standard</a>.  One of the things that surprised me about SQL was how little source code was generally available (for testing) and the almost complete lack of any published papers on SQL usage (its always better to find out about where the pot-holes are from other peoples&#8217; experience).</p>
<p>The source code availability surprie is largely answered by the very close coupling between source and data that occurs with SQL; most SQL source is closely tied to a <a href="http://en.wikipedia.org/wiki/Database_schema">database schema</a> and unless you have a need to process exactly the same kind of data you are unlikely to have any interest having access to the corresponding SQL source. The growth in usage of MySQL means that these days it is much easier to get hold of large amounts of SQL (large is a relative term here, I suspect that there are probably many orders of magnitude fewer lines of SQL in existence than there is of other popular languages).</p>
<p>In my case I was fortunate in that <a href="http://www.itl.nist.gov/div897/ctg/sql-testing/sqlman60.htm">NIST released their SQL validation suite</a> for beta testing just as I started to test my parser (it had taken me a month to get the grammar into a manageable shape).</p>
<p>Published research on SQL usage continues to be thin on the ground and I was pleased to recently discover a paper combining empirical work on SQL usage with another rarely researched topic, declaration usage e.g., variables and types or in this case <a href="http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf">schema evolution</a> (for instance, changes in the table columns over time).</p>
<p>The researchers only analyzed one database, the 171 releases of the schema used by Wikipedia between April 2003 and November 2007, but they also made their <a href="http://yellowstone.cs.ucla.edu/schema-evolution/index.php/Benchmark_Downloadables">scripts available for download</a> and hopefully the results of applying them to lots of other databases will be published.</p>
<p>Not being an experienced database person I don&#8217;t know how representative the Wikipedia figures are; the number of tables increased from 17 to 34 (100% increase) and the number of columns from 100 to 242 (142%).  A factor of two increase sounds like a lot but I suspect that all but one these columns occupy a tiny fraction of the <a href="http://stats.wikimedia.org/EN/TablesDatabaseSize.htm">14GB that is the current English Wikipedia</a>.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2011%2F01%2F30%2Fsql-usage-schema-evolution%2F&amp;title=SQL%20usage%3A%20schema%20evolution" id="wpa2a_2"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2011/01/30/sql-usage-schema-evolution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brief history of syntax error recovery</title>
		<link>http://shape-of-code.coding-guidelines.com/2010/04/19/brief-history-of-syntax-error-recovery/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2010/04/19/brief-history-of-syntax-error-recovery/#comments</comments>
		<pubDate>Mon, 19 Apr 2010 20:19:44 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[compiling]]></category>
		<category><![CDATA[error recovery]]></category>
		<category><![CDATA[mainframe]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[syntax error]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=194</guid>
		<description><![CDATA[Good recovery from syntax errors encountered during compilation is hard to achieve. The two most common strategies are to insert one or more tokens or to delete one or more tokens. Make the wrong decision and a second syntax error will occur, often leading to another and soon the developer is flooded by a nonsensical [...]]]></description>
			<content:encoded><![CDATA[<p>Good recovery from syntax errors encountered during compilation is hard to achieve.  The two most common strategies are to insert one or more tokens or to delete one or more tokens.  Make the wrong decision and a second syntax error will occur, often leading to another and soon the developer is flooded by a nonsensical list of error messages.  Compiler writers soon learn that their first priority is ensuring that syntax error recovery does not result in lots of cascading errors.  In languages that use a delimiter to indicate end of statement/declaration, usually a semicolon, the error recovery strategy of deleting all tokens until this delimiter is next encountered is remarkably effective.</p>
<p>The era of very good syntax error recovery was the 1970s and early 1980s.  Developers working on mainframes might only be able to achieve one or two compilations per day on a batch oriented mainframe and they were not happy if a misplaced comma or space resulted in a whole day being wasted.  Most compilers were rented for lots of money and customer demand resulted in some very fancy error recovery strategies.</p>
<p>Borland&#8217;s <a href="http://en.wikipedia.org/wiki/Turbo_Pascal">Turbo Pascal</a> had a very different approach to handling errors in code, it stopped processing the source as soon as one was detected.  The combination of amazing compilation rates and an interactive environment (MS-DOS running on the machine in front of the developer) made this approach hugely attractive.</p>
<p>To a large extent syntax error recovery has been driven by the methods commonly used to write parsers.  Many compilers use a table driven approach to syntax analysis with the tables being generated by parser generator tools such as <a href="http://en.wikipedia.org/wiki/Yacc">Yacc</a>.  During the 1970s and 80s a lot of the research on parser generators was aimed at reducing the size of the generated tables.  A table of 10k bytes was a significant percentage of available storage for machines that supported a maximum of 64k of memory.  Some parser table compression techniques involve assuming the default behavior and then handling any special cases when these defaults are found not to apply, but one consequence is that context information needed for good error recovery is often not available when an error is detected.  The last major release of Yacc from AT&#038;T in the early 1990s managed another reduction in table size, just as typical storage sizes were getting into the ten of megabytes, but at the expense of increasing the difficulty of doing good error recovery.</p>
<p>While there are still some application areas where the amount of storage occupied by parser tables is still a big issue, e.g., the embedded market, developers of parser generators such as <a href="http://en.wikipedia.org/wiki/GNU_bison">Bison</a> ought to start addressing the needs of users wanting to do good error recovery and who are willing to accept larger tables.</p>
<p>I am pleased to see that the LLVM project is making an <a href="http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html">effort to provide good syntax error recovery</a>.  A frustrating barrier to providing better error recovery is lack of information on the kinds of syntax errors commonly made by developers; there are a few papers and reports containing small scale measurements of errors made by students.  Perhaps the LLVM developers will provide a mechanism for automatically collecting compilation errors and providing users with the option to send the results to the LLVM project.</p>
<p>One of my favorite syntax error recovery techniques (implemented in a PL/1 mainframe compiler; I have never been able to justify implementing it on any project I worked on) is the following:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// Use of an undeclared identifier is a syntax error in C and some other</span>
<span style="color: #666666; font-style: italic;">// languages, while in other languages it is a semantic error.</span>
&nbsp;
<span style="color: #666666; font-style: italic;">// no identifier with name result visible here</span>
&nbsp;
   <span style="color: #009900;">&#123;</span>
   <span style="color: #993333;">int</span> result<span style="color: #339933;">;</span>
   ...
   <span style="color: #202020;">result</span><span style="color: #339933;">=</span>...
   ...
   <span style="color: #009900;">&#125;</span>
...
<span style="color: #202020;">calc</span><span style="color: #339933;">=</span>result<span style="color: #339933;">*</span><span style="color: #0000dd;">2</span><span style="color: #339933;">;</span>  <span style="color: #666666; font-style: italic;">// Error reported by most compilers is use of an undeclared variable</span></pre></div></div>

<p>The &#8216;real&#8217; error is probably the misplaced closing bracket.  Other possibilities include <code>result</code> being a misspelled version of another variable or the assignment to <code>calc</code> being in the wrong place.</p>
<p>There seems to be a trend over the last 20 years to create languages that require more and more semantic information during parsing.  Deciphering a syntax error today can involve a lot more than figuring out which surrounding tokens have been omitted or misplaced, information on which types are in scope and visible (oh for the days when that meant the same thing) and where they might be found in the umpteen thousand lines of included source has to be distilled and presented to the developer in a helpful message.</p>
<p>For a long time compilers have primarily been benchmarked on the quality of their code.  With every diminishing returns from improved optimization, the increasing complexity of languages and the increasing volume of header code pulled in during compilation perhaps the quality of syntax error recovery will grow in importance.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2010%2F04%2F19%2Fbrief-history-of-syntax-error-recovery%2F&amp;title=Brief%20history%20of%20syntax%20error%20recovery" id="wpa2a_4"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2010/04/19/brief-history-of-syntax-error-recovery/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Parsing Fortran 95</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/#comments</comments>
		<pubDate>Sun, 20 Dec 2009 12:59:58 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[Climategate]]></category>
		<category><![CDATA[dimensional analysis]]></category>
		<category><![CDATA[Fortran]]></category>
		<category><![CDATA[lexing]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[whitespace]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=140</guid>
		<description><![CDATA[I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser. The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I [...]]]></description>
			<content:encoded><![CDATA[<p>I have been looking at doing some <a href="http://shape-of-code.coding-guidelines.com/2009/05/dimensional-analysis-of-source-code/">dimensional analysis</a> of the <a href="http://shape-of-code.coding-guidelines.com/2009/11/does-the-climategate-code-produce-reliable-output/">Climategate code</a> and so needed a <a href="http://en.wikipedia.org/wiki/Fortran">Fortran</a> parser.</p>
<p>The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03.  I decided to take the opportunity to learn something about the new features by writing a Fortran <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">parser that did not require a symbol table</a>.</p>
<p>The <a href="http://eli-project.sourceforge.net/">Eli project</a> had a <a href="http://eli-project.sourceforge.net/fortran_html/Parse.html">Fortran 90 grammar</a> that was close to having a form acceptable to <a href="http://en.wikipedia.org/wiki/GNU_bison">bison</a> and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict.  These conflicts looked like they could all be handled using <a href="http://shape-of-code.coding-guidelines.com/2009/08/glr-parsing-is-the-future/">glr parsing</a>. The grammar contained 922 productions, somewhat large but I was only interested in actively making use of parts of it.</p>
<p>For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects.  Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not.  This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword <code>do</code> or the identifier <code>do5i</code> (if <code>1</code> is followed by a comma it must be a keyword):</p>

<div class="wp_syntax"><div class="code"><pre class="fortran" style="font-family:monospace;">      <span style="color: #b1b100;">do</span> <span style="color: #cc66cc;">5</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span> , <span style="color: #cc66cc;">10</span>
      <span style="color: #b1b100;">do</span> <span style="color: #cc66cc;">5</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span> . <span style="color: #cc66cc;">10</span>        <span style="color: #666666; font-style: italic;">! assign 1.10 to do5i</span>
<span style="color: #cc66cc;">5</span>     <span style="color: #b1b100;">continue</span></pre></div></div>

<p>In my experience developers don&#8217;t break up literals or identifier names with whitespace and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).</p>
<p>In Fortran the I/O is specified in the language syntax while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime.  I decided to to ignore I/O statements by skipping to the end of line (Fortran is line oriented).</p>
<p>Then the number of keywords hit me, around 190.  Even with the simplifications I had made writing a Fortran lexer looked like it would be a lot of work; some of the keywords only had this status when followed by a <code>=</code> and I kept uncovering new issues.  Cutting and pasting somebody else&#8217;s lexer would probably also involve a lot of work.</p>
<p>I went back and looked at some of the Fortran front ends I had found on the Internet.  The <a href="http://en.wikipedia.org/wiki/Gfortran">GNU Fortran front-end</a> was a huge beast and would need serious cutting back to be of use.  <a href="http://www.ifremer.fr//ditigo/molagnon/fortran90/contenu.html">moware</a> was written in Fortran and used the traditional six character abbreviated names seen in &#8216;old-style&#8217; Fortran source and not a lot of commenting.  The Eli project seemed a lot more interested in the formalism side of things and Fortran was just one of the languages they claimed to support.</p>
<p>The <a href="http://fortran-parser.sourceforge.net/">Open Fortran Parser</a> looked very interesting.  It was designed to be used as a parsing skeleton that could be used to produce tools that processed source and already contained hooks that output diagnostic output when each language production was reduced during a parse.  Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quiet often (an not documented in their manual).  The tool source, in Java, looked straightforward to follow and it was obvious where my code needed to be added.  This tool was exactly what I needed <img src='http://shape-of-code.coding-guidelines.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F12%2F20%2Fparsing-fortran-95%2F&amp;title=Parsing%20Fortran%2095" id="wpa2a_6"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GLR parsing is the future</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 15:54:23 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[grammar]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[runtime error]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[syntax]]></category>
		<category><![CDATA[the future]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=113</guid>
		<description><![CDATA[Traditionally parser generators have required that their input grammar be LALR(1) or some close variant (I would include LL(1) in this set). Back when 64k was an unimaginably large amount of memory being able to squeeze parser tables in a few kilobytes was very important; people received PhDs on parser table compression. There is still [...]]]></description>
			<content:encoded><![CDATA[<p>Traditionally parser generators have required that their input grammar be LALR(1) or some close variant (I would include LL(1) in this set).  Back when 64k was an unimaginably large amount of memory being able to squeeze parser tables in a few kilobytes was very important; people received PhDs on parser table compression.</p>
<p>There is still a market for compact, fast parsers.  Formal language grammars abound in communication protocols and vendors of communications hardware are very interested in keeping down costs by using minimizing the storage needed by their devices.</p>
<p>The trouble with LALR(1) is that value 1.  It means that the parser only  looks ahead one token in the input stream.  This often means that a grammar is flagged as being ambiguous (i.e., it contains shift/reduce or reduce/reduce conflicts) when it is actually just locally ambiguous, i.e., reading tokens further head on the input stream would provide sufficient context to unambiguously specify the appropriate grammar production.</p>
<p>Restructuring a grammar to make it LALR(1) requires a lot of thought and skill and inexperienced users often give up.  I once spent a month trying to remove the conflicts in the SQL/2 grammar specified by the SQL ISO standard; I managed to get the number down from over 1,000 to a small number that I decided I could live with.</p>
<p>It has taken a long time for parser generators to break out of the 64k mentality, but over the last few years it has started to happen.  There have been two main approaches: 1) LR(n) provides a mechanism to look further ahead than one token, ie, <equ>n</equ> tokens, and 2) <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> parsing.</p>
<p>I think that GLR parsing is the future for two reasons:</p>
<ul>
<li>It is supported by the most widely used parser generator, <a href="http://www.gnu.org/software/bison/">bison</a>.</li>
<li>It enables working parsers to be created with much less thought and effort than a LALR(1) parser.  (I don&#8217;t know how it compares against LR(n)).</li>
</ul>
<p>GLR parsers resolve any language ambiguities by effectively delaying decisions until runtime in the hope that reading enough tokens will resolve local ambiguities.  If an ambiguity in the token stream cannot be resolved a runtime error occurs (this is the one big downside of a GLR parser, the parser generated by a LALR(1) parser generator may produce lots of build time warnings but never produces errors when the parser is executed).</p>
<p>One example of a truly ambiguous construct (discussed <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">here</a> a while ago) is:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">x <span style="color: #339933;">*</span> y<span style="color: #339933;">;</span></pre></div></div>

<p>which in C/C++ could be a declaration of <code>y</code> to be a pointer to <code>x</code>, or an expression that multiplies <code>x</code> and <code>y</code>.</p>
<p>Tools that can detect these global ambiguities in a grammar are starting to appear, e.g., <a href="http://www.lsv.ens-cachan.fr/~schmitz/software">DTWA</a> is a bison extension.</p>
<p>I reviewed an early draft of the new O&#8217;Reilly book &#8220;flex &#038; bison&#8221; and tried to get the <a href="http://www.johnlevine.com/">author</a> to be more upbeat on GLR support in bison; I think I got him to be a bit less cautious.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F08%2F27%2Fglr-parsing-is-the-future%2F&amp;title=GLR%20parsing%20is%20the%20future" id="wpa2a_8"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/08/27/glr-parsing-is-the-future/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Parsing ambiguous grammars (part 1)</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/03/04/parsing-ambiguous-grammars-part-1/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/03/04/parsing-ambiguous-grammars-part-1/#comments</comments>
		<pubDate>Wed, 04 Mar 2009 03:06:53 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguous grammar]]></category>
		<category><![CDATA[embedded processor]]></category>
		<category><![CDATA[language grammar]]></category>
		<category><![CDATA[lexer]]></category>
		<category><![CDATA[limited memory]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[PL/1]]></category>
		<category><![CDATA[syntax]]></category>
		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=76</guid>
		<description><![CDATA[Parsing a language is often much harder than people think, perhaps because they have only seen examples that use a simple language that has been designed to make explanation easy. Most languages in everyday use contain a variety of constructs that make the life of a parser writer difficult. Yes, there are parser generators, tools [...]]]></description>
			<content:encoded><![CDATA[<p>Parsing a language is often much harder than people think, perhaps because they have only seen examples that use a simple language that has been designed to make explanation easy.  Most languages in everyday use contain a variety of constructs that make the life of a parser writer difficult.  Yes, there are <a href="http://en.wikipedia.org/wiki/Parser_generator">parser generators</a>, tools like <a href="http://en.wikipedia.org/wiki/GNU_Bison">bison</a>, that automate the process of turning a grammar into a parser and a language&#8217;s grammar is often found in the back of its reference manual.  However, these grammars are often written to make the life of the programmer easier, not the life of the parse writer.</p>
<p>People may have spotted technical term like <a href="http://en.wikipedia.org/wiki/LL_parser">LL</a>(1), <a href="http://en.wikipedia.org/wiki/LR(0)_parser">LR</a>(1) and <a href="http://en.wikipedia.org/wiki/LALR">LALR</a>(1); what they all have in common is a 1 in brackets, because they all operate by looking one token ahead in the input stream.  There is a big advantage to limiting the lookahead to one token, the generated tables are much smaller (back in the days when these tools were first created 64K was considered to be an awful lot of memory and today simple programs in embedded processors, with limited memory, often use simple grammars to parse communication&#8217;s traffic).  Most existing parser generators operate within this limit and rely on compiler writers to sweat over, and contort, grammars to make them fit.</p>
<p>A simple example is provided by <a href="http://en.wikipedia.org/wiki/PL/1">PL/1</a> (most real life examples tend to be more complicated) which did not have keywords, or to be exact did not restrict the spelling of identifiers that could be used to denote a variable, label or procedure.  This meant that in the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">IF x THEN y <span style="color: #339933;">=</span> z<span style="color: #339933;">;</span> ELSE <span style="color: #339933;">=</span> w<span style="color: #339933;">;</span></pre></div></div>

<p>when the <code>ELSE</code> was encountered the compiler did not know whether it was the start of the alternative arm of the previously seen if-statement or an assignment statement.  The token appearing after the <code>ELSE</code> needed to be examined to settle the question.</p>
<p>In days gone-by the person responsible for parsing PL/1 would have gotten up to some jiggery-pokery, such as having the lexer spot that an <code>ELSE</code> had been encountered and process the next token before reporting back what it had found to the syntax analysis.</p>
<p>A few years ago bison was upgraded to support <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> parsing.  Rather than lookahead at more tokens a GLR parser detects that there is more than one way to parse the current input and promptly starts parsing each possibility (it is usually implemented by making copies of the appropriate data structures and updating each copy according to the particular parse being followed).  The hope is that eventually all but one of these multiple parsers will reach a point where they cannot successfully parse the input tokens and can be killed off, leaving the one true parse (the case where <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">multiple parses continue to exist was discussed</a> a while ago; actually in another context).</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F03%2F04%2Fparsing-ambiguous-grammars-part-1%2F&amp;title=Parsing%20ambiguous%20grammars%20%28part%201%29" id="wpa2a_10"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/03/04/parsing-ambiguous-grammars-part-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using local context to disambiguate source</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/02/12/using-local-context-to-disambiguate-source/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/02/12/using-local-context-to-disambiguate-source/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 01:23:45 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguous]]></category>
		<category><![CDATA[context information]]></category>
		<category><![CDATA[declarations missing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[naming]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=67</guid>
		<description><![CDATA[Developers can often do a remarkably good job of figuring out what a snippet of code does without seeing (i.e., knowing anything about) most of the declarations of the identifiers involved. In a previous post I discussed how frequency of occurrence information could be used to help parse C without using a symbol table. Other [...]]]></description>
			<content:encoded><![CDATA[<p>Developers can often do a remarkably good job of figuring out what a snippet of code does without seeing (i.e., knowing anything about) most of the declarations of the identifiers involved.  In a previous post I discussed how frequency of occurrence information could be used to help <a href="http://shape-of-code.coding-guidelines.com/2008/12/parsing-without-a-symbol-table">parse C without using a symbol table</a>.  Other information that could be used is the context in which particular identifiers occur.  For instance, in:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">f<span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
y <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span>f<span style="color: #009900;">&#41;</span>z<span style="color: #339933;">;</span></pre></div></div>

<p>while the code <code>f(x);</code> is probably a function call, the use of <code>f</code> as the type in a cast means that <code>f(x)</code> is actually a definition an object <code>x</code> having type <code>f</code>.</p>
<p>A project investigating the <a href="http://www.sable.mcgill.ca/publications/techreports/2008-2/sable-tr-2008-2.pdf">analysis of partial Java programs</a> uses this context information as its sole means of disambiguating Java source (while they do build a symbol table they do not analyze the source of any packages that might be imported).  Compared to C Java parsers have it easy, but Java&#8217;s richer type system means that semantic analysis can be much more complicated.</p>
<p>On a set of benchmarks the <a href="http://bart.prologique.com/projects/ppa">researchers</a> obtained a very reasonable 91.2% accuracy in deducing the type of identifiers.</p>
<p>There are other kinds of information that developers probably use to disambiguate source: the operation that the code is intended to perform and the identifier names.  Figuring out the &#8216;high level&#8217; operation that code performs is a very difficult problem, but the names of Java identifiers have been used to <a href="http://shape-of-code.coding-guidelines.com/2008/12/naming-used-to-predict-object-lifetime/">predict object lifetime</a> and appear to be used to help deduce <a href="http://www.knosof.co.uk/cbook/accu07.html">operator precedence</a>.  Parsing source by just looking at the identifiers (i.e., treating all punctuators and operators as whitespace) has been on my list of interesting project to do for some time, but projects that are likely to provide a more immediate interesting result keep getting in the way.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F02%2F12%2Fusing-local-context-to-disambiguate-source%2F&amp;title=Using%20local%20context%20to%20disambiguate%20source" id="wpa2a_12"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/02/12/using-local-context-to-disambiguate-source/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The 30% of source that is ignored</title>
		<link>http://shape-of-code.coding-guidelines.com/2009/01/03/the-30-of-source-that-is-ignored/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2009/01/03/the-30-of-source-that-is-ignored/#comments</comments>
		<pubDate>Sat, 03 Jan 2009 00:21:43 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[comments]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[faults]]></category>
		<category><![CDATA[inconsistency]]></category>
		<category><![CDATA[measure]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=48</guid>
		<description><![CDATA[Approximately 30% of source code is not checked for correct syntax (developers can make up any rules they like for its internal syntax), semantic accuracy or consistency; people are content to shrug their shoulders at this this state of affairs and are generally willing to let it pass. I am of course talking about comments; [...]]]></description>
			<content:encoded><![CDATA[<p>Approximately 30% of source code is not checked for correct syntax (developers can make up any rules they like for its internal syntax), semantic accuracy or consistency; people are content to shrug their shoulders at this this state of affairs and are generally willing to let it pass.  I am of course talking about comments; the 30% figure comes from my own measurements with other published measurements falling within a similar ballpark.</p>
<p>Part of the problem is that comments often contain lots of natural language (i.e., human not computer language) and this is known to be very difficult to parse and is thought to be unusable without all sorts of semantic knowledge that is not currently available in machine processable form.</p>
<p>People are good at spotting patterns in ambiguous human communication and deducing possible meanings from it, and this has helped to keep comment usage alive, along with the fact that the information they provide is not usually available elsewhere and comments are right there in front of the person reading the code and of course management loves them as a measurable attribute that is cheap to do and not easily checkable (and what difference does it make if they <a href="http://seal.ifi.uzh.ch/pax/web/uploads/pdf/publication/703/fluri-wcre2007.pdf">don&#8217;t stay in sync with the code</a>).</p>
<p>One study that did attempt to <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2366">parse English sentences in comments</a> found that 75% of sentence-style comments were in the past tense, with 55% being some kind of operational description (e.g., &#8220;This routine reads the data.&#8221;) and 44% having the style of a definition (e.g., &#8220;General matrix&#8221;).</p>
<p>There is a growing collection of <a href="http://opennlp.sourceforge.net/">tools for processing natural language</a> (well at least for English).  However, given the traditionally poor punctuation used in comments, the use of variable names and very domain specific terminology, full blown English parsing is likely to be very difficult.  Some recent research has found that useful information can be extracted using something only a little more linguistically sophisticated than <a href="http://en.wikipedia.org/wiki/Word_sense_disambiguation">word sense disambiguation</a>.</p>
<p>The designers of the <a href="http://www.sosp2007.org/papers/sosp054-tan.pdf">iComment system</a> sensibly limited the analysis domain (to memory/file lock related activities), simplified the parsing requirements (to looking for limited forms of requirements wording) and kept developers in the loop for some of the processing (e.g., listing lock related function names).  The aim was to find inconsistencies between the requirements expressed in comments and what the code actually did.  Within the Linux/Mozilla/Wine/Apache sources they found 33 faults in the code and 27 in the comments, claiming a 38.8% false positive rate.</p>
<p>If these impressive figures can be replicated for other kinds of coding constructs then comment contents will start to leave the <a href="http://en.wikipedia.org/wiki/Dark_Ages#Modern_popular_use">dark ages</a>.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2009%2F01%2F03%2Fthe-30-of-source-that-is-ignored%2F&amp;title=The%2030%25%20of%20source%20that%20is%20ignored" id="wpa2a_14"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2009/01/03/the-30-of-source-that-is-ignored/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing without a symbol table</title>
		<link>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/#comments</comments>
		<pubDate>Fri, 19 Dec 2008 01:28:09 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Datatypes]]></category>
		<category><![CDATA[empirical]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[common case]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[preprocessing]]></category>
		<category><![CDATA[syntax]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=34</guid>
		<description><![CDATA[When processing C/C++ source for the first time through a compiler or static analysis tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names. One solution to this configuration problem is to be able to process source without [...]]]></description>
			<content:encoded><![CDATA[<p>When processing C/C++ source for the first time through a compiler or <a href="http://en.wikipedia.org/wiki/Static_code_analysis">static analysis</a> tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names.  One solution to this configuration problem is to be able to process source without handling preprocessing directives (e.g., skipping them, such as not reading the contents of header files or working out which arm of a conditional directive is applicable).  Developers can do it, why not machines?</p>
<p>A few years ago <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> support was added to <a href="http://en.wikipedia.org/wiki/GNU_Bison">Bison</a>, enabling it to process ambiguous grammars, and I decided to create a C parser that simply skipped all preprocessing directives.  I knew that at least one reasonably common usage would generate a syntax error:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">func_call<span style="color: #009900;">&#40;</span>a<span style="color: #339933;">,</span>
<span style="color: #339933;">#if SOME_FLAG</span>
b_1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#else</span>
b_2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#endif</span></pre></div></div>

<p><del datetime="00">c);</del><br />
and wanted to minimize its consequences (i.e., cascading syntax errors to the end of the file).  The solution chosen was to parse the source a single statement or declaration at a time, so any syntax error would be localized to a single statement or declaration.</p>
<p>Systems for parsing ambiguous grammars work on the basis that while the input may be locally ambiguous, once enough tokens have been seen the number of possible parses will be reduced to one.  In C (and even more so in C++) there are some situations where it is impossible to resolve which of several possible parses apply without declaration information on one or more of the identifiers involved (a traditional parser would maintain a symbol table where this information could be obtained when needed).  For instance, <code>x * y;</code> could be a declaration of the identifier <code>y</code> to have type <code>x</code> or an expression statement that multiplies <code>x</code> and <code>y</code>.  My parser did not have a symbol table and even if it did the lack of header file processing meant that its contents would only contain a partial set of the declared identifiers.  The ambiguity resolution strategy I adopted was to pick the most likely case, which in the example is the declaration parse.</p>
<p>Other constructs where the common case (chosen by me and I have yet to get around to actually verifying via measurement) was used to resolve an ambiguity deadlock included:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">f<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Very common, </span>
            <span style="color: #666666; font-style: italic;">// confidently picked function call as the common case</span>
<span style="color: #009900;">&#40;</span>m<span style="color: #009900;">&#41;</span><span style="color: #339933;">*</span>p<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// Not rare,</span>
            <span style="color: #666666; font-style: italic;">// confidently picked multiplication as the common case</span>
<span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Quiet rare,</span>
               <span style="color: #666666; font-style: italic;">// picked binary operator as the common case</span>
<span style="color: #009900;">&#40;</span>r<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Very rare,</span>
                  <span style="color: #666666; font-style: italic;">//an iteration on the case above</span></pre></div></div>

<p>At the moment I am using the parser to measure language usage, so less than 100% correctness can be tolerated.  Some of the constructs that cause a syntax error to be generated every few hundred statement/declarations include:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">offsetof<span style="color: #009900;">&#40;</span><span style="color: #993333;">struct</span> tag<span style="color: #339933;">,</span> field_name<span style="color: #009900;">&#41;</span>  <span style="color: #666666; font-style: italic;">// Declarators cannot be </span>
                                            <span style="color: #666666; font-style: italic;">//function arguments</span>
<span style="color: #993333;">int</span> f<span style="color: #009900;">&#40;</span>p<span style="color: #339933;">,</span> q<span style="color: #009900;">&#41;</span>
<span style="color: #993333;">int</span> p<span style="color: #339933;">;</span>     <span style="color: #666666; font-style: italic;">// Tries to reduce this as a declaration without handling</span>
<span style="color: #993333;">char</span> q<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// it as part of an old style function definition</span>
<span style="color: #009900;">&#123;</span>
&nbsp;
MACRO<span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Preprocessing expands to something meaningful</span></pre></div></div>

<p>Some of these can be handled by extensions to the grammar, while others could be handled by an error recovery mechanism that recognized likely macro usage and inserted something appropriate (e.g., a dummy expression in the <code>MACRO(x)</code> case).</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2008%2F12%2F19%2Fparsing-without-a-symbol-table%2F&amp;title=Parsing%20without%20a%20symbol%20table" id="wpa2a_16"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>C++ goes for too big to fail</title>
		<link>http://shape-of-code.coding-guidelines.com/2008/12/08/c-goes-for-too-big-to-fail/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2008/12/08/c-goes-for-too-big-to-fail/#comments</comments>
		<pubDate>Mon, 08 Dec 2008 23:58:36 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[Standard]]></category>
		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=26</guid>
		<description><![CDATA[If you believe the Whorfian hypothesis that language effects thought, even in one of its weaker forms, then major changes to a programming language will effect the shape of the code its users write. I was at the first International C++ Standard meeting in London during 1991 and coming from a C Standard background I [...]]]></description>
			<content:encoded><![CDATA[<p>If you believe the <a href="http://en.wikipedia.org/wiki/Sapir-Whorf_hypothesis">Whorfian hypothesis</a> that language effects thought, even in one of its <a href="http://www-psych.stanford.edu/~lera/papers/mandarin.pdf">weaker forms</a>, then major changes to a programming language will effect the shape of the code its users write.</p>
<p>I was at the first International <a href="http://www.open-std.org/jtc1/sc22/wg21/">C++ Standard</a> meeting in London during 1991 and coming from a <a href="http://www.open-std.org/jtc1/sc22/wg14/">C Standard</a> background I could not believe the number of new constructs being invented (the C committee had a stated principle that a construct be supported by at least one implementation before it be considered for inclusion in the standard; ok, this was not always followed to the letter).  The C++ committee members continued to design away, putting in a huge amount of effort, and the document was ratified before the end of the century.</p>
<p>The standard is currently undergoing a major revision and the amount of language design going on puts the original committee to shame. With over 1,300 pages in the <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2798.pdf">latest draft</a> nobodies favorite construct is omitted.  The <a href="www.justsoftwaresolutions.co.uk/bsi">UK C++ panel</a> has over 10 people actively working on producing comments and may produce over 1,000 on the latest draft.</p>
<p>With so many people committed to the approach being taken in the development of the revised C++ Standard its current direction is very unlikely to change.  The fact that most &#8216;real world&#8217; developers only understand a fraction of what is contained in the existing standard has not stopped it being very widely used and generally considered as a &#8216;success&#8217;.  What is the big deal over a doubling of the number of pages in a language definition, the majority of developers will continue to use the small subset that they each individually have used for years.</p>
<p>The large number of syntactic ambiguities make it is very difficult to parse C++ (semantic information is required to <a href="http://www.cs.clemson.edu/~malloy/papers/tanton/paper.ps">resolve the ambiguities</a> and the code to do this is an at least an order of magnitude bigger than the lexer+parser).  This difficulty is why there are <a href="http://os.inf.tu-dresden.de/vfiasco/related.html#parsing">so few source code analysis tools</a> available for C++, compared to C and Java which are much much easier to parse.  The difficulty of producing tools means that researchers rarely analyse C++ code and only reasonably well funded efforts are capably of producing worthwhile static analysis tools.</p>
<p>Like many of the active committee members I have mixed feelings about this feature bloat.  Yes it is bad, but it will keep us all actively employed on interesting projects for many years to come.  As the current financial crisis has shown, one of the advantages of being big and not understood is that you might get to being too big to fail.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2008%2F12%2F08%2Fc-goes-for-too-big-to-fail%2F&amp;title=C%2B%2B%20goes%20for%20too%20big%20to%20fail" id="wpa2a_18"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2008/12/08/c-goes-for-too-big-to-fail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

