<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Shape of Code &#187; Datatypes</title>
	<atom:link href="http://shape-of-code.coding-guidelines.com/category/datatypes/feed/" rel="self" type="application/rss+xml" />
	<link>http://shape-of-code.coding-guidelines.com</link>
	<description></description>
	<lastBuildDate>Sun, 12 Feb 2012 20:42:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Parsing without a symbol table</title>
		<link>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/#comments</comments>
		<pubDate>Fri, 19 Dec 2008 01:28:09 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Datatypes]]></category>
		<category><![CDATA[empirical]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[common case]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[preprocessing]]></category>
		<category><![CDATA[syntax]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=34</guid>
		<description><![CDATA[When processing C/C++ source for the first time through a compiler or static analysis tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names. One solution to this configuration problem is to be able to process source without [...]]]></description>
			<content:encoded><![CDATA[<p>When processing C/C++ source for the first time through a compiler or <a href="http://en.wikipedia.org/wiki/Static_code_analysis">static analysis</a> tool there are invariably errors caused by missing header files (often because the search path has not been set) or incorrectly defined, or not defined, macro names.  One solution to this configuration problem is to be able to process source without handling preprocessing directives (e.g., skipping them, such as not reading the contents of header files or working out which arm of a conditional directive is applicable).  Developers can do it, why not machines?</p>
<p>A few years ago <a href="http://en.wikipedia.org/wiki/GLR_parser">GLR</a> support was added to <a href="http://en.wikipedia.org/wiki/GNU_Bison">Bison</a>, enabling it to process ambiguous grammars, and I decided to create a C parser that simply skipped all preprocessing directives.  I knew that at least one reasonably common usage would generate a syntax error:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">func_call<span style="color: #009900;">&#40;</span>a<span style="color: #339933;">,</span>
<span style="color: #339933;">#if SOME_FLAG</span>
b_1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#else</span>
b_2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #339933;">#endif</span></pre></div></div>

<p><del datetime="00">c);</del><br />
and wanted to minimize its consequences (i.e., cascading syntax errors to the end of the file).  The solution chosen was to parse the source a single statement or declaration at a time, so any syntax error would be localized to a single statement or declaration.</p>
<p>Systems for parsing ambiguous grammars work on the basis that while the input may be locally ambiguous, once enough tokens have been seen the number of possible parses will be reduced to one.  In C (and even more so in C++) there are some situations where it is impossible to resolve which of several possible parses apply without declaration information on one or more of the identifiers involved (a traditional parser would maintain a symbol table where this information could be obtained when needed).  For instance, <code>x * y;</code> could be a declaration of the identifier <code>y</code> to have type <code>x</code> or an expression statement that multiplies <code>x</code> and <code>y</code>.  My parser did not have a symbol table and even if it did the lack of header file processing meant that its contents would only contain a partial set of the declared identifiers.  The ambiguity resolution strategy I adopted was to pick the most likely case, which in the example is the declaration parse.</p>
<p>Other constructs where the common case (chosen by me and I have yet to get around to actually verifying via measurement) was used to resolve an ambiguity deadlock included:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">f<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Very common, </span>
            <span style="color: #666666; font-style: italic;">// confidently picked function call as the common case</span>
<span style="color: #009900;">&#40;</span>m<span style="color: #009900;">&#41;</span><span style="color: #339933;">*</span>p<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// Not rare,</span>
            <span style="color: #666666; font-style: italic;">// confidently picked multiplication as the common case</span>
<span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span>      <span style="color: #666666; font-style: italic;">// Quiet rare,</span>
               <span style="color: #666666; font-style: italic;">// picked binary operator as the common case</span>
<span style="color: #009900;">&#40;</span>r<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> t<span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Very rare,</span>
                  <span style="color: #666666; font-style: italic;">//an iteration on the case above</span></pre></div></div>

<p>At the moment I am using the parser to measure language usage, so less than 100% correctness can be tolerated.  Some of the constructs that cause a syntax error to be generated every few hundred statement/declarations include:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">offsetof<span style="color: #009900;">&#40;</span><span style="color: #993333;">struct</span> tag<span style="color: #339933;">,</span> field_name<span style="color: #009900;">&#41;</span>  <span style="color: #666666; font-style: italic;">// Declarators cannot be </span>
                                            <span style="color: #666666; font-style: italic;">//function arguments</span>
<span style="color: #993333;">int</span> f<span style="color: #009900;">&#40;</span>p<span style="color: #339933;">,</span> q<span style="color: #009900;">&#41;</span>
<span style="color: #993333;">int</span> p<span style="color: #339933;">;</span>     <span style="color: #666666; font-style: italic;">// Tries to reduce this as a declaration without handling</span>
<span style="color: #993333;">char</span> q<span style="color: #339933;">;</span>   <span style="color: #666666; font-style: italic;">// it as part of an old style function definition</span>
<span style="color: #009900;">&#123;</span>
&nbsp;
MACRO<span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// Preprocessing expands to something meaningful</span></pre></div></div>

<p>Some of these can be handled by extensions to the grammar, while others could be handled by an error recovery mechanism that recognized likely macro usage and inserted something appropriate (e.g., a dummy expression in the <code>MACRO(x)</code> case).</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2008%2F12%2F19%2Fparsing-without-a-symbol-table%2F&amp;title=Parsing%20without%20a%20symbol%20table" id="wpa2a_2"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2008/12/19/parsing-without-a-symbol-table/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Average distance between two fields</title>
		<link>http://shape-of-code.coding-guidelines.com/2008/12/02/average-distance-between-two-fields/</link>
		<comments>http://shape-of-code.coding-guidelines.com/2008/12/02/average-distance-between-two-fields/#comments</comments>
		<pubDate>Wed, 03 Dec 2008 00:39:47 +0000</pubDate>
		<dc:creator>Derek-Jones</dc:creator>
				<category><![CDATA[Datatypes]]></category>
		<category><![CDATA[average]]></category>
		<category><![CDATA[datatype]]></category>
		<category><![CDATA[distance]]></category>

		<guid isPermaLink="false">http://shape-of-code.coding-guidelines.com/?p=12</guid>
		<description><![CDATA[If I randomly pick two fields from an aggregate type definition containing N fields what will be the average distance between them (adjacent fields have distance 1, if separated by one field they have distance 2, separated by two fields they have distance 3 and so on)? For example, a struct containing five fields has [...]]]></description>
			<content:encoded><![CDATA[<p>If I randomly pick two fields  from an aggregate type definition containing N fields what will be the average distance between them (adjacent fields have distance 1, if separated by one field they have distance 2, separated by two fields they have distance 3 and so on)?</p>
<p>For example, a <code>struct</code> containing five fields has four field pairs having distance 1 from each other, three distance 2, two distance 2, and one field pair having distance 4; the average is 2.</p>
<p>The surprising answer, to me at least, is (N+1)/3.</p>
<p><strong>Proof</strong>: The average distance can be obtained by summing the distances between all possible field pairs and dividing this value by the number of possible different pairs.</p>
<pre>                  Distance 1  2  3  4  5  6
Number of fields
            4              3  2  1
            5              4  3  2  1
            6              5  4  3  2  1
            7              6  5  4  3  2  1</pre>
<p>The above table shows the pattern that occurs as the number of fields in a definition increases.</p>
<p>In the case of a definition containing five fields the sum of the distances of all field pairs is: (4*1 + 3*2 + 2*3 + 1*4) and the number of different pairs is: (4+3+2+1). Dividing these two values gives the average distance between two randomly chosen fields, e.g., 2.</p>
<p>Summing the distance over every field pair for a definition containing 3, 4, 5, 6, 7, 8, &#8230; fields gives the sequence: 1, 4, 10, 20, 35, 56, &#8230; This is sequence <a href="http://www.research.att.com/~njas/sequences/A000292">A000292</a> in the On-Line Encyclopedia of Integer sequences and is given by the formula n*(n+1)*(n+2)/6 (where n = N − 1, i.e., the number of fields minus 1).</p>
<p>Summing the number of different field pairs for definitions containing increasing numbers of fields gives the sequence: 1, 3, 6, 10, 15, 21, 28, &#8230; This is sequence <a href="http://www.research.att.com/~njas/sequences/A000217">A000217</a> and is given by the formula n*(n + 1)/2.</p>
<p>Dividing these two formula and simplifying yields (N + 1)/3.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fshape-of-code.coding-guidelines.com%2F2008%2F12%2F02%2Faverage-distance-between-two-fields%2F&amp;title=Average%20distance%20between%20two%20fields" id="wpa2a_4"><img src="http://shape-of-code.coding-guidelines.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://shape-of-code.coding-guidelines.com/2008/12/02/average-distance-between-two-fields/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

