January 3, 2009 Derek Jones No comments

Approximately 30% of source code is not checked for correct syntax (developers can make up any rules they like for its internal syntax), semantic accuracy or consistency; people are content to shrug their shoulders at this this state of affairs and are generally willing to let it pass. I am of course talking about comments; the 30% figure comes from my own measurements with other published measurements falling within a similar ballpark.

Part of the problem is that comments often contain lots of natural language (i.e., human not computer language) and this is known to be very difficult to parse and is thought to be unusable without all sorts of semantic knowledge that is not currently available in machine processable form.

People are good at spotting patterns in ambiguous human communication and deducing possible meanings from it, and this has helped to keep comment usage alive, along with the fact that the information they provide is not usually available elsewhere and comments are right there in front of the person reading the code and of course management loves them as a measurable attribute that is cheap to do and not easily checkable (and what difference does it make if they don’t stay in sync with the code).

One study that did attempt to parse English sentences in comments found that 75% of sentence-style comments were in the past tense, with 55% being some kind of operational description (e.g., “This routine reads the data.”) and 44% having the style of a definition (e.g., “General matrix”).

There is a growing collection of tools for processing natural language (well at least for English). However, given the traditionally poor punctuation used in comments, the use of variable names and very domain specific terminology, full blown English parsing is likely to be very difficult. Some recent research has found that useful information can be extracted using something only a little more linguistically sophisticated than word sense disambiguation.

The designers of the iComment system sensibly limited the analysis domain (to memory/file lock related activities), simplified the parsing requirements (to looking for limited forms of requirements wording) and kept developers in the loop for some of the processing (e.g., listing lock related function names). The aim was to find inconsistencies between the requirements expressed in comments and what the code actually did. Within the Linux/Mozilla/Wine/Apache sources they found 33 faults in the code and 27 in the comments, claiming a 38.8% false positive rate.

If these impressive figures can be replicated for other kinds of coding constructs then comment contents will start to leave the dark ages.

Categories: Uncategorized Tags: comments, English, faults, inconsistency, measure, parsing

The Shape of Code

Archive

The 30% of source that is ignored

Recent Posts

Recent Comments

Archives

Meta