Archive

Posts Tagged ‘C Standard’

2023 in the programming language standards’ world

March 5, 2023 No comments

Two weeks ago I was on a virtual meeting of IST/5, the committee responsible for programming language standards in the UK. IST/5 has a new chairman, Guy Davidson, whose efficiency is very unstandard’s like.

It’s been 18 months since I last reported on the programming language standards’ world, what has been going on?

2023 is going to be a bumper year for the publication of revised Standards of long-established programming language: COBOL, Fortran, C, and C++ (a revised Standard for Ada was published last year).

Yes, COBOL; a new COBOL Standard was published in January. Reports of its death were premature, e.g., my 2014 post suggesting that the latest version would be the last version of the Standard, and the closing of PL22.4, the US Cobol group, in 2017. There has even been progress on the COBOL front end for gcc, which now supports COBOL 85.

The size of the COBOL Standard has leapt from 955 to 1,229 pages (around new 200 pages in the normative text, 100 in the annexes). Comparing the 2014/2023 documents, I could not see any major additions, just lots of small changes spread throughout the document.

Every Standard has a project editor, the person tasked with creating a document that reflects the wishes/votes of its committee; the project editor sends the agreed upon document to ISO to be published as the official ISO Standard. The ISO editors would invariably request that the project editor make tiresome organizational changes to the document, and then add a front page and ISO copyright notice; from time to time an ISO editor took it upon themselves to reformat a document, sometimes completely mangling its contents. The latest diktat from ISO requires that submitted documents use the Cambria font. Why Cambria? What else other than it is the font used by the Microsoft Word template promoted by ISO as the standard format for Standard’s documents.

All project editors have stories to tell about shepherding their document through the ISO editing process. With three Standards (COBOL lives in a disjoint ecosystem) up for publication this year, ISO editorial issues have become a widespread topic of discussion in the bubble that is language standards.

Traditionally, anybody wanting to be actively involved with a language standard in the UK had to find the contact details of the convenor of the corresponding language panel, and then ask to be added to the panel mailing list. My, and others, understanding was that provided the person was a UK citizen or worked for a UK domiciled company, their application could not be turned down (not that people were/are banging on the door to join). BSI have slowly been computerizing everything, and, as of a few years ago, people can apply to join a panel via a web page; panel members are emailed the CV of applicants and asked if “… applicant’s knowledge would be beneficial to the work programme and panel…”. In the US, people pay an annual fee for membership of a language committee ($1,340/$2,275). Nobody seems to have asked whether the criteria for being accepted as a panel member has changed. Given that BSI had recently rejected somebodies application to join the C++ panel, the C++ panel convenor accepted the action to find out if the rules have changed.

In December, BSI emailed language panel members asking them to confirm that they were actively participating. One outcome of this review of active panel membership was the disbanding of panels with ‘few’ active members (‘few’ might be one or two, IST/5 members were not sure). The panels that I know to have survived this cull are: Fortran, C, Ada, and C++. I did not receive any email relating to two panels that I thought I was a member of; one or more panel convenors may be appealing their panel being culled.

Some language panels have been moribund for years, being little more than bullet points on the IST/5 agenda (those involved having retired or otherwise moved on).

How indeterminate is an indeterminate value?

June 18, 2017 3 comments

One of the unwritten design aims of the C Standard is that it should be possible to fully implement the C Standard library in conforming C. It turned out that this was not possible in C90; the problem was implementing the memcpy function when the object being copied was an object having a struct type containing one or more padding bytes. The memcpy library function copies the bytes in one object to another object. The padding bytes might be uninitialized (they have an indeterminate value), which means accessing them is undefined behavior (in C90), i.e., use of memcpy for copying structs containing padding results in a non-conforming program.

struct {
        char c; // Occupies 1 byte
        // Possible padding bytes here
        int i;  // A 2/4-byte int sometimes has to be aligned on a 2/4-byte storage boundary
       };

Padding bytes could be set to a known value by, for instance, using memcpy to zero the storage; requiring this usage was thought to be excessive, and a hefty chunk of new words was added in C99 (some of the issues raised by this problem also cropped up elsewhere, which contributed to the will to do this).

One consequence of the new wording is that objects having type unsigned char are special in that while their uninitialized value is still indeterminate, the possible set of values excludes a trap representation, they have an unspecified value making accesses unspecified behavior (which conforming programs can contain). The uninitialized value of objects having other types can be a trap representation; it’s the possibility of a value being a trap representation that makes accessing such uninitialized objects undefined behavior.

All well and good, memcpy can now be implemented in conforming C(99) by copying unsigned chars.

Having made it possible for a conforming program to access an uninitialized object (having type unsigned char), questions about it actual value can be asked. Its value is indeterminate you say, the clue is in the term indeterminate value. Ok, what does the following value function return?

unsigned char random(void)
{
unsigned char x;
 
return x ^ x;
}

Exclusiving-oring a value with itself always produces zero. An unsigned char taking, say, values 0 to 255, pick one and you always get zero; case closed. But where does it say that an indeterminate value is always the same value? There is no wording preventing an indeterminate value being different every time it is accessed. The sound of people not breathing could be heard when this was pointed out to WG14 (the C Standard’s committee), followed by furious argument on one side or the other.

The following illustrates one situation where the value of padding bytes could change with every access. That volatile qualifier specifies that the value of c could change between two accesses (e.g., it represents the storage layout of some memory mapped I/O device). Perhaps any padding bytes following it are also effectively volatile-qualified.

struct {
        volatile char c; // A changeable 1 byte
        // Possible padding bytes may be volatile
        int i;  // No volatility here
       };

The local object x, above, is not associated with a volatile-qualified object. But, so what? Another unwritten design aim of the C Standard is to keep the wording simple, so edge cases are not called out and the behavior intended to handle padding bytes gets applied to local unsigned chars.

A compiler could decide that calls to random always return zero, based on the assumption that while indeterminate values may not be known, they are not time varying.

C Standard meeting, April 2016

April 15, 2016 No comments

I was at the ISO C Standard’s meeting in London this week; it has been five years since I last attended a WG14 meeting, when it was last in London (my jet setting standard’s meeting days are long gone). Around 20 people attended, of which slightly more than half I knew from previous meetings. Given how unchanging the membership was for so long, this is a large change and its great to see so many new people being interested in C (including and open source vendor, RedHat). There is also a change of convener since my last meeting; David Keaton is a long standing member and as meeting chair he kept things motoring along.

The format of the each day, after the first morning, was to spend an hour at the start of each morning and afternoon working on Defect Reports, break and then work through documents in the pre-meeting mailing.

The topic of note on Monday afternoon was a proposal to add support for the type short float in C2X. There is a lot of hardware support for 16 bit floating-point operations (e.g., SSE instructions) and C is behind the curve on this. There was consensus to move forward on this proposal.

Tuesday was taken up by discussing proposals under the general heading of clarifying the C memory object model; various papers by a formal methods group at Cambridge University that I have written about before. I had misunderstood the intent behind the papers; the Prof running the project wanted to fix the programming world by changing the C Standard (I thought he just wanted clarification of what the standard said). While fixing the programming world is a commendable goal, messy reality and very strong interests for not changing existing behavior are likely to maintain the status quo. Talking to the post grad working on the project, they seem to be doing all the right things, so we could be seeing some very interesting results (a major threat to success is the sheer volume of material that has to be covered).

Wednesday covered the charter for revising C, various proposals for new features in C2X (mostly lots of thread based stuff), conversion of the document to LaTeX (currently in nroff/groff; there was no sentiment to follow C++ and put the draft on a public Github repo). When C89 became an ANSI standard, before C90 became an ISO standard, Rex Jaeschke handed out a floppy of the C89 nroff sources to those attending one of the meetings (I forget which). Unless you happen to have an AT&T 3b2 and know which options to give nroff, you are very unlikely to be able to generate something that looks like C89.

Thursday covered another C2X proposal, closures using syntax and semantics supported by C on Apple (Borland got there first by supporting the __closure qualifier on pointers). In the afternoon we had a presentation of the latest C binding to the guidance on avoiding vulnerabilities in programming languages work going on in WG23. WG23 wanted WG14 to endorse this document and take ownership of it; lots of push back on this and all they got was a request to WG14 members to send any suggested improvements to WG23.

The next WG14 meeting is during October in Pittsburgh and I have no idea when the next meeting will be held in the UK (unlikely to be within three years).

Categories: Uncategorized Tags: ,

Arm waver or expert?

May 30, 2013 No comments

I was at a workshop yesterday where one of the speakers claimed his tool supported all of Standard C; needless to say I went over to chat during one of the breaks. Now a lot of the time speakers will immediately admit to implementing a subset when chatting over coffee, but this guy was claiming all of Standard C and had a very favorable opinion of his own expertise. It is impolite and somewhat confrontational to stand in front of somebody with a checklist of questions; what topic should be worked into the conversation to best gauge whether a person really does have a detailed grasp of C?

I tossed the phrase strictly conforming into the conversation and a bit later integer promotions, these resulted in just more huff and puff from him. Now people with a detailed knowledge of C are thin on the ground and an encounter with one usually results in a warm mutual exchange of war stories on the problems encountered during the implementation of some feature or other. Perhaps this guy’s tool had been implemented by a student and this was not part of the message; perhaps this was his first encountered with somebody who also had a detailed knowledge of the language and he did not know how to react (I imagine such people are even rarer in academia than industry).

Thinking about it on the train home I decided that “sequenced before” is the phrase I should have tossed into the conversation. The concept of sequence points existed in C90 and C99, but was replaced by “sequenced before” and unsequenced in C11 (a more complicated memory ordering model was necessitated by the newly added support for sharing objects between processes). Yesterday’s speaker was not there today, so I was not able to reinvestigate whether his knowledge was pre/post C11 or just bluster.

The C++ phrase to toss into a conversation used to be One definition rule, but I don’t know if this is still true today. I once saw an email exchange where a supposed expert had never heard about the “one definition” rule and jokingly asked if it was connected to the “one ring” in Lord of the Rings; oops, he had obviously never read the C++ Standard.

What about other languages? Suggestions for phrases I might use to gauge whether somebody really is an expert in some other language welcome (this would be pure bluff on my part and I accept the consequences of using any suggestion).

Undefined behavior can travel back in time

July 12, 2012 4 comments

The committee that produced the C Standard tried to keep things simple and sometimes made very short general statements that relied on compiler writers interpreting them in a ‘reasonable’ way. One example of this reliance on ‘reasonable’ behavior is the definition of undefined behavior; “… erroneous program construct or of erroneous data, for which this International Standard imposes no requirements”. The wording in the Standard permits a compiler to process the following program:

int main(int argc, char **argv)
{
// lots of code that prints out useful information
 
1 / 0;  // divide by zero, undefined behavior
}

to produce an executable that prints out “yah boo sucks”. Such behavior would probably be surprising to the developer who expected the code printing the useful information to be executed before the divide by zero was encountered. The phrase quality of implementation is heard a lot in committee discussions of this kind of topic, but this phrase does not appear in any official document.

A modern compiler is essentially a sophisticated domain specific data miner that happens to produce machine code as output and compiler writers are constantly looking for ways to use the information extracted to minimise the code they generate (minimal number of instructions or minimal amount of runtime). The following code is from the Linux kernel and its authors were surprised to find that the “division by zero” messages did not appear when arg2 was 0, in fact the entire if-statement did not appear in the generated code; based on my earlier example you can probably guess what the compiler has done:

if (arg2 == 0)
   ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO),
                                             errmsg("division by zero")));
/* No overflow is possible */
PG_RETURN_INT32((int32)arg1 / arg2);

Yes, it figured out that when arg2 == 0 the divide in the call to PG_RETURN_INT32 results in undefined behavior and took the decision that the actual undefined behavior in this instance would not include making the call to ereport which in turn made the if-statement redundant (smaller+faster code, way to go!)

There is/was a bug in Linux because of this compiler behavior. The finger of blame could be pointed at:

  • the developers for not specifying that the function ereport does not return (this would enable the compiler to deduce that there is no undefined behavior because the divide is never execute when arg2 == 0),
  • the C Standard committee for not specifying a timeline for undefined behavior, e.g., program behavior does not become undefined until the statement containing the offending construct is encountered during program execution,
  • the compiler writers for not being ‘reasonable’.

In the coming years more and more developers are likely to encounter this kind of unexpected behavior in their programs as compilers do more and more data mining and are pushed to improve performance. Other examples of this kind of behavior are given in the paper Undefined Behavior: Who Moved My Code?

What might be done to reduce the economic cost of the fallout from this developer ignorance/standard wording/compiler behavior interaction? Possibilities include:

  • developer education: few developers are aware that a statement containing undefined behavior can have an impact on the execution of code that occurs before that statement is executed,
  • change the wording in the Standard: for many cases there is no reason why the undefined behavior be allowed to reach back in time to before when the statement executing it is executed; this does not mean that any program output is guaranteed to occur, e.g., the host OS might delete any pending output when a divide by zero exception occurs.
  • paying gcc/llvm developers to do front end stuff: nearly all gcc funding is to do code generation work (I don’t know anything about llvm funding) and if the US Department of Homeland security are interested in software security they should fund related front end work in gcc and llvm (e.g., providing developers with information about suspicious usage in the code being compiled; the existing -Wall is a start).