May 17, 2019 Derek Jones No comments

DR 260 is a defect report submitted to WG14, the C Standards’ committee, in 2001 that was never resolved, then generally ignored for 10-years, then caught the attention of a research group a few years ago, and is now back on WG14’s agenda. The following discussion covers two of the three questions raised in the DR.

Consider the following fragment of code:

int *p, *q;
 
    p = malloc (sizeof (int)); assert (p != NULL);  // Line A
    (free)(p);                                      // Line B
    // more code
    q = malloc (sizeof (int)); assert (q != NULL);  // Line C
    if (memcmp (&p, &q, sizeof p) == 0)             // Line D
       {*p = 42;                                    // Line E
        *q = 43;}                                   // Line F

Section 6.2.4p2 of the C Standard says:
“The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.”

The call to free, on line B, ends the lifetime of the storage (allocated on line A) pointed to by p.

There are two proposed interpretations of the sentence, in 6.2.4p2.

“becomes indeterminate” is treated as effectively storing a value in the pointer, i.e., some bit pattern denoting an indeterminate value. This interpretation requires that any other variables that had been assigned p‘s value, prior to the free, also have an indeterminate value stored into them,
the value held in the pointer is to be treated as an indeterminate value (for instance, a memory management unit may prevent any access to the corresponding storage).

What are the practical implications of the two options?

The call to malloc, on line C, could return a pointer to a location that is identical to the pointer returned by the first call to malloc, i.e., the second call might immediately reuse the free‘ed storage.

Effectively storing a value in the pointer, in response to the call to free means the subsequent call to memcmp would always return a non-zero value, and the questions raised below do not apply; it would be a nightmare to implement, especially in a multi-process environment.

If the sentence in section 6.2.4p2 is interpreted as treating the pointer value as indeterminate, then the definition of malloc needs to be updated to specify that all returned values are determinate, i.e., any indeterminacy that may exist gets removed before a value is returned (the memory management unit must allow read/write access to the storage).

The memcmp, on line D, does a byte-wise compare of the pointer values (a byte-wise compare side-steps indeterminate value issues). If the comparison is exact, an assignment is made via p, line E, and via q, line F.

Does the assignment via p result in undefined behavior, or is the conformance status of the code unaffected by its presence?

Nobody is impuning the conformance status of the assignment via q, on line F.

There are people who think that the assignment via p, on line E, should be treated as undefined behavior, despite the fact that the values of p and q are byte-wise identical. When this issue was first raised (by those trouble makers in the UK ;-), yours truly was less than enthusiastic, but there were enough knowledgeable people in the opposing camp to keep the ball rolling for a while.

The underlying issue some people have with some subsequent uses of p is its provenance, the activities it has previously been associated with.

Provenance can be included in the analysis process by associating a unique number with the address of every object, at the start of its lifetime; these p-numbers are not reused.

The value returned by the call to malloc, on line A, would include a pointer to the allocated storage, plus an associated p-number; the call on line C could return a pointer having the same value, but its p-number is required to be different. Implementations are not required to allocate any storage for p-numbers, treating them purely as conceptual quantities. Your author knows of two implementations that do allocate storage for p-numbers (in a private area), and track usage of p-numbers; the Model Implementation C Checker was validated as handling all of C90, and Cerberus which handles a substantial subset of C11, and I don’t believe that the other tools that check array bounds and use after free are based on provenance (corrections welcome).

If provenance is included as part of a pointer’s value, the behavior of operators needs to be expanded to handle the p-number (conceptual or not) component of a pointer.

The rules might specify that p-numbers are conceptually compared by the call to memcmp, on line C; hence p and q are considered to never compare equal. There is an existing practice of regarding byte compares as just that, i.e., no magic ever occurs when comparing bytes (otherwise known as objects having type unsigned char).

Having p-numbers be invisible to memcmp would be consistent with existing practice. The pointer indirection operation on line E (generating undefined behavior) is where p-numbers get involved and cause the undefined behavior to occur.

There are other situations where pointer values, that were once indeterminate, can appear to become ‘respectable’.

For a variable, defined in a function, “… its lifetime extends from entry into the block with which it is associated until execution of that block ends in any way.”; section 6.2.4p3.

In the following code:

int x;
static int *p=&x;
 
void f(int n)
{
   int *q = &n;
   if (memcmp (&p, &q, sizeof p) == 0)
      *p = 0;
   p = &n; // assign an address that will soon cease to exist.
} // Lifetime of pointed to object, n, terminates here
 
int main(void)
{
   f(1); // after this call, p has an indeterminate value
   f(2);
}

the pointer p has an indeterminate value after any call to f returns.

In many implementations, the second call to f will result in n having the same address it had on the first call, and memcmp will return zero.

Again, there are people who have an issue with the assignment involving p, because of its provenance.

One proposal to include provenance contains substantial changes to existing word in the C Standard. The rationale for is proposals looks more like a desire to change wording to make things clearer for those making the change, than a desire to address DR 260. Everybody thinks their proposed changes make the wording clearer (including yours truly), such claims are just marketing puff (and self-delusion); confirmation from the results of an A/B test would add substance to such claims.

It is probably possible to explicitly include support for provenance by making a small number of changes to existing wording.

Is the cost of supporting provenance (i.e., changing existing wording may introduce defects into the standard, the greater the amount of change the greater the likelihood of introducing defects), worth the benefits?

What are the benefits of introducing provenance?

Provenance makes it possible to easily specify that the uses of p, in the two previous examples (and a third given in DR 260), are undefined behavior (if that is WG14’s final decision).

Provenance also provides a model that might make it easier to reason about programs; it’s difficult to say one way or the other, without knowing what the model is.

Supporters claim that provenance would enable tool vendors to flag various snippets of code as suspicious. Tool vendors can already do this, they don’t need permission from the C Standard to flag anything they fancy.

The C Standard requires a conforming implementation to diagnose certain constructs. A conforming implementation can issue as many messages as it likes, for any other construct, e.g., for line A in the first example, a compiler might print “This is the 1,000,000’th call to malloc I have translated, ring this number to claim your prize!”

Before any changes are made to wording in the C Standard, WG14 needs to decide what the behavior should be for these examples; it could decide to continue ignoring them for another 20-years.

Once a decision is made, the next question is how to update wording in the standard to specify the behavior that has been decided on.

While provenance is an interesting idea, the benefits it provides appear to be not worth the cost of changing the C Standard.

Categories: Uncategorized Tags: defect report, pointer, provenance, WG14

A trip down memory lane with Microsoft Word 1.1

March 31, 2014 Derek Jones 3 comments

Microsoft has donated the source code of Word for Windows 1.1a to the Computer history Museum and I have been rummaging around in this C code from the last 1980s.

What immediately struck me is how un-Microsoft Windows like the code looks, in some ways it looks more like Unix code. There are:

very few near/far/huge pointer declarations. The Intel x86 architecture is based on segmented addressing of memory, a fact that most developers are blissfully unaware of because 32-bit segments are big enough to be able to ignore this fact; back in the day we had 16-bit segments and there were near pointers that could only point within a segment, far pointers that could point to the contents of other segments (there were alignment issues associated with these) and huge pointers which are essentially 32-bit pointers. Many developers obsessed over saving time/space and did their utmost to only use near pointers (who is ever going to need a data structure larger than 64k?) and Windows code was often littered with different kinds of pointer usage,
very few #if/#endif. Why do developers use #if/#endif? They want to port their code to different versions of MS-DOS, use different vendor compilers and to use different third-party libraries. Microsoft were well known for not being backwards compatible with themselves (they have improved somewhat on that score) and obviously had no interest in using other vendor compilers and third-party libraries. Unix code of the day and today is of course packed with #if/#endif.

For any developer who has only used C since 2000 the most obvious code ‘quirk’ is probably the use of K&R style function definitions. This used to be the only way of defining functions until the first C Standard, in 1989, introduced function prototypes and the mass migration of the 1990’s occurred. The K&R style function definition looks like no big deal:

f()
int a;
struct T b;
{
/* blah blah */
}

instead of:

f(int a, struct T b)
{
/* blah blah */
}

until you find out that in the first case, when f is called, there is no checking of the arguments against the parameters appearing in the definition. This lack of checking was/is a constant source of annoying faults that could/can take hours to track down; once a developer has experienced the benefits of automatic argument/parameter checking they soon convert to the prototype form.

The makefile included with the source contains compiler options that won’t be familiar to Windows developers. This is because a special purpose C compiler was used, one that generated P-code (the P here is for Packed not Portable). In the late 1980s, when this code was written, a 256K was considered a lot of memory for your average person’s computer and at +180K lines of code Word 1.1 was enormous. Word processors are on the whole not cpu intensive tasks and trading off a factor of 10 (I’m guessing, probably more) for a memory footprint saving of 2.5-4 (not so much of a guess) was obviously considered worthwhile (but then Microsoft apps have never been known for being nimble).

The extensive use of bit-fields is the other clue that memory is tight. Do this ever get used outside of comms software and embedded systems these days?

I was surprised at how clean and presentable this source code is. I should not have judged Microsoft developers by the horrors perpetrated by many developers targeting their platform.

Categories: Uncategorized Tags: K&R, Microsoft Word, pointer, segmented architecture

The Shape of Code

Archive

Background checks on pointer values being considered for C

A trip down memory lane with Microsoft Word 1.1

Recent Posts

Recent Comments

Archives

Meta