Archive

Posts Tagged ‘C’

pycparser: a serious entry in the not-written-in-C C-parser category

February 8th, 2016 No comments

Sometimes it feels like everybody and his dog has had a go at writing a C parser. On the whole these parsers are small subsets and the choice of subset seems to be driven by the developer’s taste, what they find easy to do and how much time they are willing to put into the project.

Some time ago a blog post discussing parsing C type declarations made me think that pycparser might be a cut above the usual learning experience projects and a quick look showed it was quite good. I recently tried it out again on the examples from my C book and it did a surprisingly good job of handling this rather weird set of edge cases (it failed to handle the code in 20 out of 957 files).

There is a type of person who insists that the C parser used by the C source analysis project they are working on be written in a ‘high-level’ language, i.e., they don’t want to use one of the perfectly adequate (and correct) parsers written in C/C++. I’m not sure whether this is because having to actually use C would expose the poor state of their knowledge of the language, language snobbery (its ok to analyse C source, but write it, heaven forbid) or they are members of a One True Language guild.

Up until now the parser of choice, for people not wanting to use C/C++, was CIL (a slightly more up to date version on Github); used by Coccinelle and many other tools.

If you really don’t want to parse C source using a parser written in C/C++, I think pycparser has now reached the stage where it is worth considering, along with CIL.

Tags: ,

Maximizing profit selling C compilers

January 22nd, 2016 No comments

Upgrades are the lifeblood of established software companies. I recently came across the paper Information Goods Upgrades: Theory and Evidence and what caught my attention was one of the datasets the author had collected, first purchase and upgrade price of various PC C/C++ compilers between 1987 and 1997. What’s more the author still had the data and was willing to share it, yay!

By the early 1990s I was no longer actively involved in C compilers, but was involved in C static analysis on non-PC platforms. So my view of the 1990s C compiler market is a bit sketchy.

Compiler companies, like other companies, want to maximize their revenue and THE decision that has to be made is the price to charge for a compiler (compiler writers are also developers and hate high prices for compilers and those that failed to charge enough for their product soon went bust). My recollection is that compiler pricing was based around the spending authority of a senior development engineer and also what other companies were charging. Just under £500 was common, with a few companies failing to make a go of selling around the £100 mark. Zorland (later renamed to Zortech) gained huge market share in the mid/late 1980s selling a great C compiler for £29, but a few years later were selling a C++ compiler for a lot more.

To some extent each compiler vendor operates in a monopoly market; developers write code that depends on the features supported by the compiler used and it can be very expensive to port code to a different compiler. How much can vendors charge for a compiler upgrade? Selling the product at a high price provides a rationale for higher priced upgrades (the percentage discount will look good). I wonder how many vendors continued to advertise a high price product just to justify a high upgrade price.

Management always feel an affinity for the OS vendor and Microsoft sold a C compiler and later a C++ compiler. They were both awful and easy, product quality wise, to compete against. Microsoft had to have their own compiler for strategic internal use, with sales to developers being insignificant compared to sales of Word and Excel (Microsoft compiler people I talked to at the time said they had thought of giving the compiler away for free and later it was possible to essentially get the compiler for free by joining the various developer programs). Over time Microsoft improved and compiler companies found easier ways to make money, so the number of compiler vendors dropped to almost one (a company selling C compiler validation suites once told me in the late 1990s that they had sold over 150 copies; someone has to be serious about their compiler to shell out $5,000-$10,000 for software to test it).

By the late 1980s the C compiler market was quite saturated and vendors needed something else to sell. IDEs and debuggers were popular choices. Then along came C++. Yay! A new language meant a new compiler to sell. Compiler vendors’ need for a new compiler to sell is a significantly underestimated factor in C++ gaining traction in developer mind share.

A rarely talked about compiler revenue stream is being paid to port a compiler to a new platform (either because there is an important application hat depend son it or because the platform does not yet have a C compiler). This is the market where gcc had its first successes. Its hard to say whether gcc spread because these niche platforms spread or because gcc cut off revenue to compiler vendors making remaining in the compiler market unattractive to them.

I don’t have any sales figures for any ‘mass’ market C compilers or compilers for any languages. Can any readers help out? In fact any data on compiler sales would be most welcome.

Recent formal methods and C papers (Sep 2015)

September 14th, 2015 2 comments

I have been catching up on my reading of papers from this year’s Programming Language Design and Implementation conference (whose organizers have not yet figured out that linking to pdfs of the papers might be useful).

Needless to say there are a few papers on formal methods and C:

  • “A Formal C Memory Model Supporting Integer-Pointer Casts” is a truly awful paper. It starts out: “The ISO C standard famously does not give semantics to a significant subset of syntactically valid C programs.” and goes down hill from there. As far as I know only one language, Algol 68, defines semantic requirements using syntax, all other languages specify a syntax which is a very large superset of the set of semantically valid programs. The paper goes on to define a C-like language that is also Java-like, C#-like and *-like most languages created in the last 20 years. I have no idea why this paper got accepted, is PLDI now a third tier conference?
  • Defining the undefinedness of C from the C-semantics guys. I could only find a version from 2012 online. Come on guys, you’re letting down one of your cheer-leaders. Update: pdf now available.
  • A Formal C Memory Model for Separation Logic (not at PLDI, but popped up on arXiv today). This is one of those annoying papers that could have been great, but shoots itself in the foot. The first 20 pages shows that the author is aware of some of the complications involved in modeling C’s behavior. This is followed by pages and pages of definitions, a scattering of lemmas and Facts; at page 51 The Theorems start, blah, blah, blah. Then we are almost done, there is a discussion of related work.

    Where is it shown that any of this stuff is connected to the requirements contained in the C Standard? The source of the implementation is provided, lets look at that; hmmm, no cross references to the C Standard here (in fact it is almost comment free). What about testing, processing source code to see what happens. The only mention of testing appears while discussing what the competition do (C-semantics; those pesky Americans again, not only not using Coq but testing their formal tools, don’t they know that anything written using mathematics must be correct).

    The author’s draft PhD thesis says something about testing; but I get the feeling that he only says something about it because the competition does, even mostly using their+others tests rather than coming up with lots of his own.

    While this work (part of the CH2O project) has clearly created a system that handles a chunk of real C, I don’t think it is anywhere close to being a very accurate model of C semantics. The author appears to be so much more interested in doing interesting mathematical stuff and finds it rather tiresome that the realities of C semantics disrupt the idealism.

Showing that they have clearly not learned how things are done in the formal semantics community, those pesky Americans have gone and produced a formal semantics for Javascript and tested it against the ECMAScript 5.1 conformance test suite (passing all 2,782 core language tests, Chrome V8 is the only other implementations that does this).

Actively maintained production compilers for middle-age languages

September 1st, 2015 10 comments

The owners of the Borland C++ compiler have stopped maintaining it. So we are now down to, by my counting, three four different production quality C++ compilers still being actively maintained (Visual C++ {the command line c1.exe, not the interactive IDE compiler}, GCC, LLVM and EDG); lots of companies repackage EDG and don’t talk about it.

How many production compilers for other middle-age languages are still being actively maintained?

Ada I think is now down to one (GNAT; I’m not sure of the status of what was the Intermetrics compiler).

Cobol has two+ (I’m not sure ow many internal compilers IBM has, some of which are really Microfocus) that I know of (Microfocus and Fujitsu {was ACUCobol}).

Fortran probably needs more than one hand to count its compilers. Nothing like having large engineering applications using the languages features supported by your compiler to keep the maintenance fees rolling in.

C still has lots of compilers (a C validation suite vendor told me many years ago that they had over 150 customers). Embedded processors can be a very tough target for the general purpose algorithms used in GCC and LLVM, so vendors with hand crafted compilers can still eek out a living.

Perl has one (which I find surprising).

R has one, but like Cobol it is not a fashionable language in compiler writing circles. Over the last couple of years there have been a few ‘play’ implementations and rumors of people creating a new production quality implementation.

Lisp has one or millions, depending on how you view dialects or there could be a million people with a different view on the identity of the 1.

Snobol-4 still has one (yes, I am a fan of this language).

There are lots of languages which have not yet reached middle-age, so its too soon to start counting how many actively supported compilers they still have in production use.

How will C code in 2045 look different from today?

August 5th, 2015 No comments

What constructs will be in such common use in C source code written by developers in 2045 that people looking at C written in 2015 will know it comes from a much earlier era (a previous post looked back at C written in 1986)?

C is a high level language that allows developers to get close to the hardware, so to get some idea of what everyday C might be like in 2045 we have to ask what everyday hardware will be like 10-20 years from now (the C standard committee waits for hardware feature to become established before adding features to support them).

I think the following hardware trends will have a big impact on the future appearance of C source code:

Power consumption: Runtime performance is an integral part of the design of C. In the past performance has been about program execution time and/or memory usage; the spread of mobile computing has created a third strand: electrical power consumption. A variety of techniques have been proposed for reducing program power consumption, including: type specifiers that enable developers to tell the compiler accuracy can be traded off against power in calculations involving a given variable and scaling cpu voltage/frequency in non-time critical code (researchers are currently trying to do this without developer involvement, but a storage/type specifier like register or inline would provide useful information to the compiler),

Unreliable hardware: running hardware at lower voltages (to reduce power consumption) increases the probability of noise having an effect on program output, as does use of smaller line widths in cpu fabrication (more chips per die increases manufacturer profits). Proposed solutions include adding type specifiers to variables that can tolerate holding approximate values or more making probabilistic assertions.

Non-volatile memory: Like most languages C has an implicit model of programs sitting on a slow storage device, e.g., hard disk, and being loaded into very fast storage for execution. Non-volatile storage could have a very dramatic impact on this view of the world. For years gaming consoles have stored code+data as a memory image in ROM for rapid loading, but being able to write to storage that is only an order of magnitude slower than main memory opens up all sorts of interesting opportunities. The concept of named address spaces defined in Programming languages – C – Extensions to support embedded processors is waiting to expand out of its current niche of C on embedded processors.

There is at least one language construct that is likely to be rarely seen by developers working in 2045: inline. The reason that today’s developers have been given the ability to define functions inline is that compilers are not yet good enough to reliably make good function inlining decisions, rather like they were not good enough to reliability make good register allocation decisions 30 years ago (ok, register can still be useful for developers using weird and wonderful processor architectures or brain dead compilers).

I have not yet said anything about parallel processing or multiprocessor hardware. The C11 Standard updated C99 to provide generic support (i.e., _Atomic plus associated sequence point wording updates and the threads library) for this kind of hardware. Support for a specific parallel/multiprocessor model will happen if a specific model becomes the industry standard (rather like IEEE floating-point not being anointed by C90 because it was not yet what every hardware vendor used; other formats were on their last legs and by C99 could be treated as dead).

2015: A new C semantics research group

June 30th, 2015 3 comments

A very new PhD student research group working on C semantics has just appeared on the horizon. You can tell they are very new to C semantics by the sloppy wording in their survey of C users (what is a ‘normal’ compiler and how does it differ from the ‘current mainstream’ compiler referred to in some questions? I’m surprised the outcome appeared clear to the authors, given the jumble of multiple choice options given to respondents).

Over the years a number of these groups have appeared, existed until their members received a PhD and then disappeared. In some cases one of the group members does something that shows a lot of potential (e.g., the C-semantics work), but the nature of academic research means that either the freshly minted PhD moves to industry or else moves on to another research area. Unfortunately most groups are overwhelmed by the task and pivot into meaningless subsets of concentrating on mathematical organisms. Very, very occasionally interesting work gets supported once the PhD is out of the way, Coccinelle being the stand-out example for C.

It takes implementing a full compiler (as part of a PhD or otherwise) to learn C semantics well enough to do meaningful research on it. The world seems to be stuck in a loop of using research to educate know-nothings until they know-something and then sending them off on another track. This is why C language researchers keep repeating themselves every 10 years or so.

Will anybody in this new group do any interesting work? Alan Mycroft set the bar very high for Cambridge by submitting a 100 page comment document on the draft C89 standard that listed almost as much ambiguous wording as everybody else put together found (but he was implementing a compiler in his spare time and not doing it for a PhD, so perhaps he does not count).

One suggestion I would make to this new group is that if they really are interested in actual usage they should measure actual usage, developer beliefs about compiler behavior is rarely very accurate and always heavily tainted by experiences from when they first started out.

A checklist for evaluating compiler semantic research.

Tags: , ,

C code is 90% unspecified behavior: more uninformed scare mongering

March 19th, 2015 1 comment

Another C coding guidelines document, another clueless blanket ban on use of code containing unspecified behavior (no link so its visibility is not increased; the 90% is a back of the envelope calculation, knock yourself out here).

The C Standard defines unspecified behavior as “… provides two or more possibilities and imposes no further requirements on which is chosen in any instance.” Given this one item of information a ban on using constructs that contain unspecified behavior appears to be a good idea (writing code where the compiler gets to choose among several possible choices of behavior does not sound like recipe for consistent program behavior).

What most people lack when thinking about unspecified behavior is an understanding of the design aims for the production of the C Standard; the aim was to be concise. An example of this conciseness is the wording for the order of evaluation of subexpressions “… the order in which side effects take place are both unspecified.”

Consider the subexpression x+y; should the compiler evaluate x first (putting its value in a register) and then y (putting its value in another register), or should it evaluate y followed by x? It most situations the final result does not depend on the choice of evaluation order and the Standard gives the compiler the freedom to choose the order that produces the best quality code.

A coding guideline that bans the use of code containing unspecified behavior bans the use of any binary operator (assignment is a binary operator in C, ruling out use of the statement z=0;). The only executable statements that could be written, following this guideline, would be calls of functions containing zero or one argument (order of evaluation is unspecified, which rules out calls containing two arguments) or global variables appearing on their own in an expression statement.

One case where operand evaluation order matters is printf("Hello")+printf("World"), which can result in either HelloWorld or WorldHello being printed (printf returns the number of characters written). This is an example of the kind of usage that the authors of coding guideline want to ban.

Coming up with guideline wording that delineates the undesirable unspecified behaviors from the harmless ones is hard. Requiring that the external behavior of code does not depend on the compiler’s choice of unspecified behavior is one possibility (now that power consumption can be an external behavior of note, this framing could be too narrow). The wording used by MISRA C is “No reliance shall be placed on … unspecified behavior”; this raises the flag that it is possible to rely on unspecified behavior and leaves it up to others to fill in the details.

C++14 is now in, C++11 is out and C++17 is on the horizon

August 18th, 2014 No comments

C++11 is now so yesterday; ISO have just ratified C++14 as the new C++ standard. However, don’t let the sudden halt to the exponential growth in page count with each revision (1334 pages in C++11 to 1366 in C++14) lull you into thinking that the size of C++ has stabilized. These days the page growth market is Technical Reports (e.g., ISO/IEC TR 18015 – C++ Performance and TR 19768 – C++ Library Extensions).

What next, are the C++ committee taking a well earned rest from their twice yearly (only recently reduced from four times a year) jetset around the world to attend week long meetings with 100+ other like-minded folk?

Of course not, they are having too much fun the world needs C++17 (yes, work has already started). And lets not forget the economy, which is still limping along. Can we risk the economic consequences of lots of highly paid consultants being unemployed, of compiler writers running out of new features to implement, of hotels having no more “Latest features in C++” seminars/workshops/conferences to host?

In there really enough work for everybody to do revising C++14? Better be safe and request permission from ISO to start work on new Technical Reports covering: C++ Extensions for Transactional Memory, C++ Extensions for Library Fundamentals and C++ Extensions for Parallelism (there is ongoing work/talk of others, such as C++ — File System Technical Specification, C++ Extensions for Concurrency and C++ Extensions for Concepts).

If the number of new things to add does start to run low, there are always the known bugs in the existing documents could always do with some attention: Core Language Active Issues and the Standard Library Issues List.

Tags: ,

An ISO Standard for R (just kidding)

July 24th, 2014 4 comments

IST/5, the British Standards’ committee responsible for programming languages in the UK, has a new(ish) committee secretary and like all people in a new role wants to see a vision of the future; IST/5 members have been emailed asking us what we see happening in the programming language standards’ world over the next 12 months.

The answer is, off course, that the next 12 months in programming language standards is very likely to be the same as the previous 12 months and the previous 12 before that. Programming language standards move slowly, you don’t want existing code broken by new features and it would be a huge waste of resources creating a standard for every popular today/forgotten tomorrow language.

While true the above is probably not a good answer to give within an organization that knows its business intrinsically works this way, but pines for others to see it as doing dynamic, relevant, even trendy things. What could I say that sounded plausible and new? Big data was the obvious bandwagon waiting to be jumped on and there is no standard for R, so I suggested that work on this exciting new language might start in the next 12 months.

I am not proposing that anybody start work on an ISO standard for R, in fact at the moment I think it would be a bad idea; the purpose of suggesting the possibility is to create some believable buzz to suggest to those sitting on the committees above IST/5 that we have our finger on the pulse of world events.

The purpose of a standard is to create agreement around one way of doing things and thus save lots of time/money that would otherwise be wasted on training/tools to handle multiple language dialects. One language for which this worked very well is C, for which there were 100+ incompatible compilers in the early 1980s (it was a nightmare); with the publication of the C Standard users finally had a benchmark that they could require their suppliers to meet (it took 4-5 years for the major suppliers to get there).

R is not suffering from a proliferation of implementations (incompatible or otherwise), there is no problem for an R standard to solve.

Programming language standards do get created for reasons other than being generally useful. The ongoing work on C++ is a good example of consultant driven standards development; consultants who make their living writing and giving seminars about the latest new feature of C++ require a steady stream of new feature to talk about and have an obvious need to keep new versions of the standard rolling down the production line. Feeling that a language is unappreciated is another reason for creating an ISO Standard; the Modula-2 folk told me that once it became an ISO Standard the use of Modula-2 would take off. R folk seem to have a reasonable grip on reality, or have I missed a lurking distorted view of reality that will eventually give people the drive to spend years working their fingers to the bone to create a standard that nobody is really that interested in?

C++ vs. Ada: Which language is more strongly typed?

April 17th, 2014 No comments

Programming languages are sometimes categorized as being either weakly or strongly typed. I’m not going to join the often rabid debates over which category a particular language belongs to, but rather discuss the relative type strengths of two languages, C++ and Ada, to see if it is possible to claim that one of them is more strongly typed than the other.

Most programming languages support variables having more than one type (e.g., integer and float are two very common types) and have rules specifying which combinations of differently typed values/variables are permitted to occur in a given context, e.g., C++ allows a value of type int to be assigned to a variable of type float (an implicit conversion is performed), but Ada not perform this implicit conversion and the integer value has to be explicit converted to float before it can be assigned (otherwise a compile time error will be generated).

The more implicit type conversions a language supports the weaker its type system is said to be.

C++ supports more implicit conversions than Ada (others include boolean/int and char/int) and loose type strength points because of this (there is plenty of scope for debate about whether some implicit conversions are more evil than others, but cost/benefit debates are harder to come by).

While C++/Ada differ in their support for implicit conversions they are pretty equal in their support for explicit conversions (e.g., in Ada the code float(23) would convert the integer 23 to a float type). In some cases Ada requires that various hoops be jumped through to make the conversion happen (representation clauses are a great topic to bring up when being lectured about how type safe Ada is, a bit like telling somebody being snobbish that they go to the bathroom like everybody else).

The underlying idea is that the compilation errors generated by these ‘undesirable’ attempted implicit conversions alert the developer to a possible mistake in what they have written. These kinds of messages from the compiler have certainly caught errors in my code, but often the error has been a failure to write the required explicit conversion; every now and again a ‘real’ error is flagged. But I digress, this discussion is about what weak/strong typing is, not about what its costs and benefits might be.

Does Ada have any other feature that increases its type strength with respect to C++?

Both languages allow names to be given to existing types: typedef length_t int; in C++ and subtype length_t is integer; in Ada both define length_t to be a synonym for the integer type, but without resulting in any extra type checks occurring. However, Ada supports a kind of type definition mechanism that does result in extra checks being made by the compiler. In the following code:

subtype length_t is integer;
type time_t is integer;
 
a : integer;
b : length_t;
c : time_t;
 
begin
a := b;          -- OK
a := c;          -- Error, type mismatch
a := integer(c); -- OK, explicit conversion

time_t is defined to have a type that is not compatible with integer, even although its underlying representation is the same as the integer type. Mixing variables having types integer and time_t results in a compile time error.

The intended purpose for defining a ‘new’ type and creating variables having that type is to restrict operations on those variables to being with other variables having the same type, e.g., assignment and addition between any variables having type time_t is fine but involving other types results in a compile time error (there are special rules that allow integer literals to general get mixed in). I find that the errors flagged by this kind of checking are mostly very useful.

It is also possible to achieve the same kind of type checking in C++ using template metaprogramming, e.g., the SIunits library. In fact using this technique enables C++ to support a much more general and user friendly range of of ‘strong type’ functionality than is supported by the built-in Ada functionality (it is also possible to use general language functionality in Ada to implement the kind of checking possible in C++, however prior to the 2012 Ada standard the checks occurred at runtime but it now looks like there is a mechanism for doing them at compile time {because it is often possible to switch off runtime checks some people consider them to be weaker than compile time checks})

Fans of subranges (I dearly miss this feature when using C-like languages) can get their fix here.

Is there a rule that extra type strength points are given if a language contains explicit type creation syntax (Ada contains a menagerie of syntax and semantics for doing this kind of stuff), compared to languages that require the use of constructs having many other uses? I don’t see why there should be. The fact that template metaprogramming makes a lot of C++ developers’ head’s hurt means that many will limit themselves to using what others have created, rather than growing project specific libraries; but since when have usability and frequency of use been a major issue in the weak/string type debate?

The score so far is that C++ has lots points to Ada because of its greater support for implicit type conversions, but has held its ground everywhere else.

Can either language pick up any more points?

There is the culture angle. Ada developers have a culture of making use of the type checking functionality provided by the language; this is a skill that has to be learned, you get some type checking for free but the rest has to be designed into the code. C++ developers also have a culture of making use of the type checking functionality provided by the language, i.e., most do not use add-on packages like SIunits.

I am not aware of any studies that have investigated the extent to which developers make use of type checking functionality; pointers to such studies welcome. If there is more ‘strongly typed’ C++ than Ada code out there it is only because there is a lot more C++ code out there.

It is my experience that culture and existing code do color developers’ position on where to draw the line in the weak/strong debate, but don’t effect relative language orderings.

The conclusion is that Ada is more strongly typed than C++, but how much more strongly typed remaines an open question. Both languages require effort from the developer to make full use of the typing functionality that is available.