October 27, 2024 Derek Jones 4 comments

How can developers check that a compiler correctly implements all the behavior requirements contained in the corresponding language specification?

An obvious approach is to write lots of test cases for each distinct behavior; such a collection of tests is known as a validation suite, when used by a standard’s organization to test compilers/OS interfaces/etc. The extent to which a compiler’s behavior, when fed these tests, matches that listed in the language specification is a measure of its conformance.

In a world of many compilers with significant differences in behavior (i.e., pre-Open source), it makes economic sense for governments to sponsor the creation of validation suites, and/or companies to offer such suites commercially (mainly for C and C++). The spread of Open source compilers decimated compiler diversity, and compiler validation is fading into history.

New features continue to be added to Cobol, Fortran, C, and C++ by their respective ISO Standard’s committee. If governments are no longer funding updates to validation suites and the cost of commercial suites is too high for non-vendors (my experience is that compiler vendors find them to be cost-effective), how can developers check that a compiler conforms to the behavior specified by the Standard?

How much effort is required to create some minimal set of compiler conformance tests?

C is the language whose requirements I am most familiar with. The C Standard specifies that a conforming compiler issue a diagnostic for a violation of a requirement appearing in a Constraint clause, e.g., “For addition, either both operands shall have arithmetic type, or …”

There are 80 such clauses, containing around 530 non-blank lines, in N3301, the June 2024 draft. Let’s say 300+ distinct requirements, requiring a minimum of one test each. Somebody very familiar with the C Standard might take, say, 10 minutes per test, which is 3,000 minutes, or 50 hours, or 6.7 days; somebody slightly less familiar might take, say, at least an hour, which is 300+ hours, or 40+ days.

Lots of developers are using LLMs to generate source code from a description of what is needed. Given Constraint requirements in the C Standard, can an LLM generate tests that do a good enough job checking a compiler’s conformance to the C Standard?

Simply feeding the 157 pages from the Language chapter of the C Standard into an LLM, and asking it to generate tests for each Constraint requirement does not seem practical with the current state of the art; I’m happy to be proved wrong. A more focused approach might produce the desired tests.

Negative tests are likely to be the most challenging for an LLM to generate, because most publicly available source deals with positive cases, i.e., it is syntactically/semantically correct. The wording of Constraints sometimes specifies what usage is not permitted (e.g., clause 6.4.5.3 “A floating suffix df, dd, dl, DF, DD, or DL shall not be used in a hexadecimal floating literal.”), other times specifies what usage is permitted (e.g., clause 6.5.3.4 “The first operand of the . operator shall have an atomic, qualified, or unqualified structure or union type, and the second operand shall name a member of that type.”), or simply specifies a requirement (e.g., clause 6.7.3.2 “A member declaration that does not declare an anonymous structure or anonymous union shall contain a member declarator list.”).

I took the text from the 80 Constraint clauses, removed footnote numbers and rejoined words split at line-breaks. The plan was to prefix the text of each Constraint with instructions on the code requires. After some experimentation, the instructions I settled on were:

Write a sequence of very short programs which tests that a
C compiler correctly flags each violation of the requirements
contained in the following excerpt from the latest draft of the
C Standard:

Initially, excerpt was incorrectly spelled as except, but this did not seem to have any effect. Perhaps this misspelling is sufficiently common in the training data, that LLM weights support the intended association.

Experiments using Grok and ChatGPT 4o showed that both generated technically correct tests, but Grok generated code that was intended to be run (and was verbose), while the ChatGPT 4o output was brief and to the point; it did such a good job that I did not try any other LLMs. For this extended test, use of the web interface proved to be an effective approach. Interfacing via the API is probably more practical for larger numbers of requirements.

After some experimentation, I submitted the text from 31 Constraint clauses (I picked the non-trivial ones). The complete text of the questions and ChatGPT 4o responses (text files).

ChatGPT sometimes did not generate tests for all the requirements, when these were presented as they appeared in the Constraint, but did generate tests when the containing sentence was presented in isolation from other requirement sentences. For instance, the following sentence from clause 6.5.5 Cast Operators:

Conversions that involve pointers, other than where permitted by
the constraints of 6.5.17.2, shall be specified by means of an
explicit cast.

was ignored when included as part of the complete Constraint, but when presented in isolation, reasonable tests were generated.

The responses never contained more than 10 test cases. I am guessing that this is the result of limits on response cpu time/length. Dividing the text of longer Constraints should solve this issue.

Some assumptions made by ChatGPT 4o about the implementation can be deduced from its responses, e.g., it appears to treat the type short as containing fewer than 32-bits (it assumes that a bit-field defined as a short containing 32-bits will be treated as a Constraint violation). This is not surprising, given the volume of public C source targeting the Intel x86.

I was impressed by the quality of the 242 test cases generated by ChatGPT 4o, which often included multiple tests for the same requirement (text files).

While it sometimes failed to produce a test for a requirement, I did not spot any incorrect tests (as in, not correctly testing for a violation of a listed requirement); the subset of tests feed through behaved as claimed), and I eventually found a prompt that appears to be creating a downloadable zip file of all the tests (most prompts resulted in a zip file containing some collection of 10 tests); the creation process is currently waiting for available cpu time. I now know that downloading a zip file containing one file per test, after each user prompt, is the more reliable option.

Categories: Uncategorized Tags: C Standard, ChatGPT, compiler testing, LLM, requirements, validation suite

2023 in the programming language standards’ world

March 5, 2023 Derek Jones No comments

Two weeks ago I was on a virtual meeting of IST/5, the committee responsible for programming language standards in the UK. IST/5 has a new chairman, Guy Davidson, whose efficiency is very unstandard’s like.

It’s been 18 months since I last reported on the programming language standards’ world, what has been going on?

2023 is going to be a bumper year for the publication of revised Standards of long-established programming language: COBOL, Fortran, C, and C++ (a revised Standard for Ada was published last year).

Yes, COBOL; a new COBOL Standard was published in January. Reports of its death were premature, e.g., my 2014 post suggesting that the latest version would be the last version of the Standard, and the closing of PL22.4, the US Cobol group, in 2017. There has even been progress on the COBOL front end for gcc, which now supports COBOL 85.

The size of the COBOL Standard has leapt from 955 to 1,229 pages (around new 200 pages in the normative text, 100 in the annexes). Comparing the 2014/2023 documents, I could not see any major additions, just lots of small changes spread throughout the document.

Every Standard has a project editor, the person tasked with creating a document that reflects the wishes/votes of its committee; the project editor sends the agreed upon document to ISO to be published as the official ISO Standard. The ISO editors would invariably request that the project editor make tiresome organizational changes to the document, and then add a front page and ISO copyright notice; from time to time an ISO editor took it upon themselves to reformat a document, sometimes completely mangling its contents. The latest diktat from ISO requires that submitted documents use the Cambria font. Why Cambria? What else other than it is the font used by the Microsoft Word template promoted by ISO as the standard format for Standard’s documents.

All project editors have stories to tell about shepherding their document through the ISO editing process. With three Standards (COBOL lives in a disjoint ecosystem) up for publication this year, ISO editorial issues have become a widespread topic of discussion in the bubble that is language standards.

Traditionally, anybody wanting to be actively involved with a language standard in the UK had to find the contact details of the convenor of the corresponding language panel, and then ask to be added to the panel mailing list. My, and others, understanding was that provided the person was a UK citizen or worked for a UK domiciled company, their application could not be turned down (not that people were/are banging on the door to join). BSI have slowly been computerizing everything, and, as of a few years ago, people can apply to join a panel via a web page; panel members are emailed the CV of applicants and asked if “… applicant’s knowledge would be beneficial to the work programme and panel…”. In the US, people pay an annual fee for membership of a language committee ($1,340/$2,275). Nobody seems to have asked whether the criteria for being accepted as a panel member has changed. Given that BSI had recently rejected somebodies application to join the C++ panel, the C++ panel convenor accepted the action to find out if the rules have changed.

In December, BSI emailed language panel members asking them to confirm that they were actively participating. One outcome of this review of active panel membership was the disbanding of panels with ‘few’ active members (‘few’ might be one or two, IST/5 members were not sure). The panels that I know to have survived this cull are: Fortran, C, Ada, and C++. I did not receive any email relating to two panels that I thought I was a member of; one or more panel convenors may be appealing their panel being culled.

Some language panels have been moribund for years, being little more than bullet points on the IST/5 agenda (those involved having retired or otherwise moved on).

Categories: Uncategorized Tags: C Standard, Cobol, committee, ISO Standard, IST/5, language standard, programming language

How indeterminate is an indeterminate value?

June 18, 2017 Derek Jones 3 comments

One of the unwritten design aims of the C Standard is that it should be possible to fully implement the C Standard library in conforming C. It turned out that this was not possible in C90; the problem was implementing the memcpy function when the object being copied was an object having a struct type containing one or more padding bytes. The memcpy library function copies the bytes in one object to another object. The padding bytes might be uninitialized (they have an indeterminate value), which means accessing them is undefined behavior (in C90), i.e., use of memcpy for copying structs containing padding results in a non-conforming program.

struct {
        char c; // Occupies 1 byte
        // Possible padding bytes here
        int i;  // A 2/4-byte int sometimes has to be aligned on a 2/4-byte storage boundary
       };

Padding bytes could be set to a known value by, for instance, using memcpy to zero the storage; requiring this usage was thought to be excessive, and a hefty chunk of new words was added in C99 (some of the issues raised by this problem also cropped up elsewhere, which contributed to the will to do this).

One consequence of the new wording is that objects having type unsigned char are special in that while their uninitialized value is still indeterminate, the possible set of values excludes a trap representation, they have an unspecified value making accesses unspecified behavior (which conforming programs can contain). The uninitialized value of objects having other types can be a trap representation; it’s the possibility of a value being a trap representation that makes accessing such uninitialized objects undefined behavior.

All well and good, memcpy can now be implemented in conforming C(99) by copying unsigned chars.

Having made it possible for a conforming program to access an uninitialized object (having type unsigned char), questions about it actual value can be asked. Its value is indeterminate you say, the clue is in the term indeterminate value. Ok, what does the following value function return?

unsigned char random(void)
{
unsigned char x;
 
return x ^ x;
}

Exclusiving-oring a value with itself always produces zero. An unsigned char taking, say, values 0 to 255, pick one and you always get zero; case closed. But where does it say that an indeterminate value is always the same value? There is no wording preventing an indeterminate value being different every time it is accessed. The sound of people not breathing could be heard when this was pointed out to WG14 (the C Standard’s committee), followed by furious argument on one side or the other.

The following illustrates one situation where the value of padding bytes could change with every access. That volatile qualifier specifies that the value of c could change between two accesses (e.g., it represents the storage layout of some memory mapped I/O device). Perhaps any padding bytes following it are also effectively volatile-qualified.

struct {
        volatile char c; // A changeable 1 byte
        // Possible padding bytes may be volatile
        int i;  // No volatility here
       };

The local object x, above, is not associated with a volatile-qualified object. But, so what? Another unwritten design aim of the C Standard is to keep the wording simple, so edge cases are not called out and the behavior intended to handle padding bytes gets applied to local unsigned chars.

A compiler could decide that calls to random always return zero, based on the assumption that while indeterminate values may not be known, they are not time varying.

Categories: Uncategorized Tags: C Standard, padding, uninitialized, unsigned char, unspecified

C Standard meeting, April 2016

April 15, 2016 Derek Jones No comments

I was at the ISO C Standard’s meeting in London this week; it has been five years since I last attended a WG14 meeting, when it was last in London (my jet-setting standard’s meeting days are long gone). Around 20 people attended, of which slightly more than half I knew from previous meetings. Given how unchanging the membership was for so long, this is a large change and it’s great to see so many new people being interested in C (including and open source vendor, RedHat). There is also a change of convener since my last meeting; David Keaton is a long-standing member and as meeting chair he kept things motoring along.

The format of each day, after the first morning, was to spend an hour at the start of each morning and afternoon working on Defect Reports, break and then work through documents in the pre-meeting mailing.

The topic of note on Monday afternoon was a proposal to add support for the type short float in C2X. There is a lot of hardware support for 16 bit floating-point operations (e.g., SSE instructions) and C is behind the curve on this. There was consensus to move forward on this proposal.

Tuesday was taken up by discussing proposals under the general heading of clarifying the C memory object model; various papers by a formal methods group at Cambridge University that I have written about before. I had misunderstood the intent behind the papers; the Prof running the project wanted to fix the programming world by changing the C Standard (I thought he just wanted clarification of what the standard said). While fixing the programming world is a commendable goal, messy reality and very strong interests for not changing existing behavior are likely to maintain the status quo. Talking to the post grad working on the project, they seem to be doing all the right things, so we could be seeing some very interesting results (a major threat to success is the sheer volume of material that has to be covered).

Wednesday covered the charter for revising C, various proposals for new features in C2X (mostly lots of thread based stuff), conversion of the document to LaTeX (currently in nroff/groff; there was no sentiment to follow C++ and put the draft on a public GitHub repo). When C89 became an ANSI standard, before C90 became an ISO standard, Rex Jaeschke handed out a floppy of the C89 nroff sources to those attending one of the meetings (I forget which). Unless you happen to have an AT&T 3b2 and know which options to give nroff, you are very unlikely to be able to generate something that looks like C89.

Thursday covered another C2X proposal, closures using syntax and semantics supported by C on Apple (Borland got there first by supporting the __closure qualifier on pointers). In the afternoon, we had a presentation of the latest C binding to the guidance on avoiding vulnerabilities in programming languages work going on in WG23. WG23 wanted WG14 to endorse this document and take ownership of it; lots of push back on this, and all they got was a request to WG14 members to send any suggested improvements to WG23.

The next WG14 meeting is during October in Pittsburgh, and I have no idea when the next meeting will be held in the UK (unlikely to be within three years).

Categories: Uncategorized Tags: C Standard, WG14

Arm waver or expert?

May 30, 2013 Derek Jones No comments

I was at a workshop yesterday where one of the speakers claimed his tool supported all of Standard C; needless to say I went over to chat during one of the breaks. Now a lot of the time speakers will immediately admit to implementing a subset when chatting over coffee, but this guy was claiming all of Standard C and had a very favorable opinion of his own expertise. It is impolite and somewhat confrontational to stand in front of somebody with a checklist of questions; what topic should be worked into the conversation to best gauge whether a person really does have a detailed grasp of C?

I tossed the phrase strictly conforming into the conversation and a bit later integer promotions, these resulted in just more huff and puff from him. Now people with a detailed knowledge of C are thin on the ground and an encounter with one usually results in a warm mutual exchange of war stories on the problems encountered during the implementation of some feature or other. Perhaps this guy’s tool had been implemented by a student and this was not part of the message; perhaps this was his first encountered with somebody who also had a detailed knowledge of the language and he did not know how to react (I imagine such people are even rarer in academia than industry).

Thinking about it on the train home I decided that “sequenced before” is the phrase I should have tossed into the conversation. The concept of sequence points existed in C90 and C99, but was replaced by “sequenced before” and unsequenced in C11 (a more complicated memory ordering model was necessitated by the newly added support for sharing objects between processes). Yesterday’s speaker was not there today, so I was not able to reinvestigate whether his knowledge was pre/post C11 or just bluster.

The C++ phrase to toss into a conversation used to be One definition rule, but I don’t know if this is still true today. I once saw an email exchange where a supposed expert had never heard about the “one definition” rule and jokingly asked if it was connected to the “one ring” in Lord of the Rings; oops, he had obviously never read the C++ Standard.

What about other languages? Suggestions for phrases I might use to gauge whether somebody really is an expert in some other language welcome (this would be pure bluff on my part and I accept the consequences of using any suggestion).

Categories: Uncategorized Tags: C, C Standard, expert, one definition rule, sequence point, sequenced before

Undefined behavior can travel back in time

July 12, 2012 Derek Jones 4 comments

The committee that produced the C Standard tried to keep things simple and sometimes made very short general statements that relied on compiler writers interpreting them in a ‘reasonable’ way. One example of this reliance on ‘reasonable’ behavior is the definition of undefined behavior; “… erroneous program construct or of erroneous data, for which this International Standard imposes no requirements”. The wording in the Standard permits a compiler to process the following program:

int main(int argc, char **argv)
{
// lots of code that prints out useful information
 
1 / 0;  // divide by zero, undefined behavior
}

to produce an executable that prints out “yah boo sucks”. Such behavior would probably be surprising to the developer who expected the code printing the useful information to be executed before the divide by zero was encountered. The phrase quality of implementation is heard a lot in committee discussions of this kind of topic, but this phrase does not appear in any official document.

A modern compiler is essentially a sophisticated domain specific data miner that happens to produce machine code as output and compiler writers are constantly looking for ways to use the information extracted to minimise the code they generate (minimal number of instructions or minimal amount of runtime). The following code is from the Linux kernel and its authors were surprised to find that the “division by zero” messages did not appear when arg2 was 0, in fact the entire if-statement did not appear in the generated code; based on my earlier example you can probably guess what the compiler has done:

if (arg2 == 0)
   ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO),
                                             errmsg("division by zero")));
/* No overflow is possible */
PG_RETURN_INT32((int32)arg1 / arg2);

Yes, it figured out that when arg2 == 0 the divide in the call to PG_RETURN_INT32 results in undefined behavior and took the decision that the actual undefined behavior in this instance would not include making the call to ereport which in turn made the if-statement redundant (smaller+faster code, way to go!)

There is/was a bug in Linux because of this compiler behavior. The finger of blame could be pointed at:

the developers for not specifying that the function ereport does not return (this would enable the compiler to deduce that there is no undefined behavior because the divide is never execute when arg2 == 0),
the C Standard committee for not specifying a timeline for undefined behavior, e.g., program behavior does not become undefined until the statement containing the offending construct is encountered during program execution,
the compiler writers for not being ‘reasonable’.

In the coming years more and more developers are likely to encounter this kind of unexpected behavior in their programs as compilers do more and more data mining and are pushed to improve performance. Other examples of this kind of behavior are given in the paper Undefined Behavior: Who Moved My Code?

What might be done to reduce the economic cost of the fallout from this developer ignorance/standard wording/compiler behavior interaction? Possibilities include:

developer education: few developers are aware that a statement containing undefined behavior can have an impact on the execution of code that occurs before that statement is executed,
change the wording in the Standard: for many cases there is no reason why the undefined behavior be allowed to reach back in time to before when the statement executing it is executed; this does not mean that any program output is guaranteed to occur, e.g., the host OS might delete any pending output when a divide by zero exception occurs.
paying gcc/llvm developers to do front end stuff: nearly all gcc funding is to do code generation work (I don’t know anything about llvm funding) and if the US Department of Homeland security are interested in software security they should fund related front end work in gcc and llvm (e.g., providing developers with information about suspicious usage in the code being compiled; the existing -Wall is a start).

Categories: Uncategorized Tags: C Standard, compiler writer, faults, gcc, if statement, llvm, undefined behavior

The Shape of Code

Archive

C compiler conformance testing: with ChatGPT assistance

2023 in the programming language standards’ world

How indeterminate is an indeterminate value?

C Standard meeting, April 2016

Arm waver or expert?

Undefined behavior can travel back in time

Recent Posts

Recent Comments

Archives

Meta