June 6, 2021 Derek Jones 2 comments

When creating a variable name, to what extent are developers influenced by their native human language?

There is lots of evidence that variable names are either English words, abbreviations of English words, or some combination of these two. Source code containing a large percentage of identifiers using words from other languages does exist, but it requires effort to find; there is a widely expressed view that source should be English based (based on my experience of talking to non-native English speakers, and even the odd paper discussing the issue, e.g., Language matters).

Given that variable names can prove information that reduces the effort needed to understand code, and that most code is only ever read by the person who wrote it, developers should make the most of their expertise in using their native language.

To what extent do non-native English-speaking developers make use of their non-English native language?

I have found it very difficult to even have a discussion around this question. When I broach the subject with non-native English speakers, the response is often along the lines of “our develo0pers speak good English.” I am careful to set the scene by telling them of my interest in naming, and that I think there are benefits for developers to make use of their native language. The use of non-English languages in software development is not yet a subject that is open for discussion.

I knew that sooner or later somebody would run an experiment…

How Developers Choose Names is another interesting experiment involving Dror Feitelson (the paper rather confusingly refers to it as a survey, a post on an earlier experiment).

What makes this experiment interesting is that bilingual subjects (English and Hebrew) were used, and the questions were in English or Hebrew. The 230 subjects (some professional, some student) were given a short description and asked to provide an appropriate variable/function/data-structure name; English was used for 26 of the question, and Hebrew for the other 21 questions, and subjects answered a random subset.

What patterns of Hebrew usage are present in the variable names?

Out of 2017 answers, 14 contained Hebrew characters, i.e., not enough for statistical analysis. This does not mean that all the other variable names were only derived from English words, in some cases Hebrew words appeared via transcription using the 26 English letters. For instance, using “pinuk” for the Hebrew word that means “benefit” in English. Some variables were created from a mixture of Hebrew and English words, e.g., deservedPinuks and pinuksUsed.

Analysing this data requires someone who is fluent in Hebrew and English. I am not a fluent, or even non-fluent, Hebrew speaker. My role in this debate is encouraging others, and at last I have some interesting data to show people.

The paper spends time showing how for personal preferences result in a wide selection of names being chosen by different people for the same quantity. I cannot think of any software engineering papers that have addressed this issue for variable names, but there is lots of evidence from other fields; also see figure 7.33.

Those interested in searching source code for the impact of native-language might like to look at the names of variables appearing as operands of the bitwise and logical operators. Some English words occur much more frequently in the names of these variable, compared to variables that are operands of arithmetic operators, e.g., flag, status, and signal. I predict that non-native English-speaking developers will make use of corresponding non-English words.

Categories: Uncategorized Tags: experiment, human language, identifier, naming

Variable naming based on lengths of existing variable names

December 17, 2013 Derek Jones No comments

Over the years I have spent a lot of time studying variable names and I sometimes encounter significant disbelief when explaining the more unusual developer variable name selection algorithms.

The following explanation from Rasmus Lerdorf, of PHP fame, provides a useful citable source for a variant on a common theme (i.e., name length).

“… Back when PHP had less than 100 functions and the function hashing mechanism was strlen(). In order to get a nice hash distribution of function names across the various function name lengths names were picked specifically to make them fit into a specific length bucket. This was circa late 1994 when PHP was a tool just for my own personal use and I wasn’t too worried about not being able to remember the few function names.”

Pointers to other admissions of youthful folly welcome.

Categories: Uncategorized Tags: naming, PHP, youthful folly

The most worthwhile R coding guidelines I know

March 2, 2013 Derek Jones 2 comments

Since my post questioning whether native R usage exists (e.g., a common set of R coding patterns) several people have asked about coding/style guidelines for R. My approach to style/coding guidelines is economic, adhering to a guideline involves paying a cost now for some future benefit. Obviously to be worthwhile the benefit must be greater than the cost, there is also the issue of who pays the cost and who reaps the benefit (why would anybody pay the cost if somebody else reaps the benefit?). The following three topics are probably where the biggest benefits are to be had and only the third is specific to R (and given the state of my R knowledge may be wrong).

Comment your code. Investing 5-10 seconds per few lines of code now could save substantially more time at some future date. Effective commenting is a skill that has to be learned, start learning now. Think of commenting as sending a text message or tweet to the person you will be in 6 months time (i.e., the person who can hum the tune but has forgotten the details).

Consistently use variable names that mean something to you. This should be a sub 2-second decision that is probably going to save you no more than 5-10 seconds, but in many cases you reap the benefit soon after the investment, without having to wait many months. Names evoke associations in your mind, take advantage of this associative lookup to reduce the cognitive load of working with your code. Effective naming is a skill that has to be learned, start learning now. There are people who ignore the evidence that different people’s linguistic preferences and associations can be very different and insist that everybody adhere to one particular naming convention; ignore them.

Code organization and structure. Experience shows that there are ways of organizing and structuring +1,000 line programs that have a significant impact on the effort needed to actively work on the code, the more code there is the greater the impact. R programs tend to be short, say around 100 lines (I dare say much longer ones exist). Apart from recommending that code be broken up into separate functions, I cannot think of any organizational/structural issue that is worth recommending for 100 lines of code (if you don’t appreciate the advantage of using separate functions you need some hands on training, not words in a blog post).

Is that it, are there no other worthwhile recommendations? There might be, I just don’t have enough experience using R to know. Does anybody else have enough experience to know? I suspect not; where would they have gotten the information needed to do the cost/benefit analysis? Even in the rare case where a detailed analysis is made for a language the results are rather thin on the ground and somewhat inconclusive.

What is the reason behind those R style guides/coding guideline documents that have been written? The following are some possibilities:

reducing maintenance costs (the official reason touted by purveyors of received wisdom): this is a very good reason that is let down by the complete lack of any empirical evidence that following any guidelines makes the slightest difference to maintenance costs. You R users are likely to have a lot more experience than me dealing with people claiming stuff for which no there is evidence and I will not presume to suggest how you might handle such claims (if somebody does show you some good data do please send me a copy),
marketing (sometimes openly given as a reason): managers like to tell + customers like to hear about the existence of such a document and its role in ensuring delivery of a quality product. If you are being shown around a company and are told that they follow some style guideline its always interesting to see what happens when you ask to see a copy of this guideline document, e.g., not being able to find a copy is a surprisingly common occurrence.
fashion (rarely admitted to): behaving like a herd and following trend setters is a common human trait, not only are there lots of ways of designing clothes but there are lots of ways in which code can be written. What kind of manager wants to have unfashionable developers working for them and who wouldn’t like to take a few days off to attend a boutique conference or chat to a friendly uncle (these guys can be messianic speakers and questioning them about lack of evidence can draw a negative response from the crowd).

and no, I don’t have any empirical data to backup my guidelines 🙁

Categories: Uncategorized Tags: coding guidelines, comments, cost/benefit, fashion, naming, R, style

Using local context to disambiguate source

February 12, 2009 Derek Jones No comments

Developers can often do a remarkably good job of figuring out what a snippet of code does without seeing (i.e., knowing anything about) most of the declarations of the identifiers involved. In a previous post I discussed how frequency of occurrence information could be used to help parse C without using a symbol table. Other information that could be used is the context in which particular identifiers occur. For instance, in:

f(x);
y = (f)z;

while the code f(x); is probably a function call, the use of f as the type in a cast means that f(x) is actually a definition an object x having type f.

A project investigating the analysis of partial Java programs uses this context information as its sole means of disambiguating Java source (while they do build a symbol table they do not analyze the source of any packages that might be imported). Compared to C Java parsers have it easy, but Java’s richer type system means that semantic analysis can be much more complicated.

On a set of benchmarks the researchers obtained a very reasonable 91.2% accuracy in deducing the type of identifiers.

There are other kinds of information that developers probably use to disambiguate source: the operation that the code is intended to perform and the identifier names. Figuring out the ‘high level’ operation that code performs is a very difficult problem, but the names of Java identifiers have been used to predict object lifetime and appear to be used to help deduce operator precedence. Parsing source by just looking at the identifiers (i.e., treating all punctuators and operators as whitespace) has been on my list of interesting project to do for some time, but projects that are likely to provide a more immediate interesting result keep getting in the way.

Categories: Uncategorized Tags: ambiguous, context information, declarations missing, Java, naming, parsing

Naming used to predict object lifetime

December 5, 2008 Derek Jones No comments

One of the most surprising empirical results I heard about this year was that the name of a Java object could (reasonably) reliably be used to predict its lifetime on the heap. Being a huge advocate of the importance of naming I should not have been surprised.

The author , Jeremy Singer, invited me to Manchester to talk about my own experiments and I heard about his group’s latest project investigating how to subdivide a Java program so that bits of it can be executed on different processors. I suggested various ways in which naming might be used to group semantically related functionality (would it do better than simple statement colocation you ask) and await to see if the group goes with any naming ideas (my suggestions were accompanied by a fair amount of arm waving, so I might have to wait a while).

Categories: Uncategorized Tags: heap, lifetime, naming

The Shape of Code

Archive

Impact of native language on variable naming

Variable naming based on lengths of existing variable names

The most worthwhile R coding guidelines I know

Using local context to disambiguate source

Naming used to predict object lifetime

Recent Posts

Recent Comments

Archives

Meta