March 13, 2013 Derek Jones 4 comments

Writing a program in R is almost bureaucracy free: variables don’t need to be declared, the language does a reasonable job of guessing the type a value might need to be automatically be converted to, there is no need to create a function having a special name that gets called at program startup, the commonly used library functions are ready and waiting to be called and so on.

Not having a bureaucracy is all well and good when programs are small or short lived. Large programs need a bureaucracy to provide compartmentalization (most changes to X need to be prevented from having an impact outside of X, doing this without appropriate language support eventually burns out anybody juggling it all in their head) and long lived programs need a bureaucracy to provide version control (because R and its third-party libraries change over time).

Automatically installing a package from CRAN always fetches the latest version. This is all well and good during initial program development. But will the code still work in six months time? Perhaps the author of one of the packages used in the program submits a new version of that package to CRAN and this new version behaves slightly differently, breaking the previously working program. Once the problem is located the developer has either to update their code or manually install the older version of the package. Life would be easier if it was possible to specify the required package version number in the call to the library function.

Discovering that my code depends on a particular version of a CRAN package is an irritation. Discovering that two packages I use each have a dependency on different versions of the same package is a nightmare. Having to square this circle is known in the Microsoft Windows world as DLL hell.

There is a new paper out proposing a system of dependency versioning for package management. The author proposes adding a version parameter to the library function, plus lots of other potentially useful functionality.

Apart from changing the behavior of functions a program calls, what else can a package author do to break developer code? They can create new functions and variables. The following is some code that worked last week:

library("foo")  # The function get_question is in this package
library("bar")  # The function give_answer_42 is in this package
 
(the_question=get_question())
give_answer_42(the_question)

between last week and today the author of package foo (or perhaps the author of one of the packages that foo has a dependency on) has added support for the function solve_problem_42 and it is this function that will now get called by this code (unless the ordering of the calls to library are switched). What developers need to be able to write is:

library("foo", import=c("the_question"))  # The function get_question is in this package
library("bar", import=c("give_answer_42"))  # The function give_answer_42 is in this package
 
(the_question=get_question())
give_answer_42(the_question)

to stop this happening.

The import parameter enables developers to introduce some compartmentalization into my programs. Yes, R does have namespace management for packages, and I’m pleased to see that its use will be mandatory in R version 3.0.0, but this does not protect programs from functions the package author intends to export.

I’m not sure whether this import suggestion will connect with R users (who look very laissez faire to me), but I get very twitchy watching a call to library go off and install lots of other stuff and generate warnings about this and that being masked.

Categories: Uncategorized Tags: bureaucracy, dependency, import, language extensions, namespace, R

The fatal programming language research mistake

March 8, 2012 Derek Jones 7 comments

There is a fatal mistake often made by those involved in academic programming language research and a recent blog post (by an academic) asking if programming language research has a future has spurred me into writing about this mistake.

As an aside, I would agree with much of what the academic (Cristina (Crista) Videira Lopes) says about many popular modern programming languages being hacked together by kids who did not know much, if anything about, language design. However, this post is not a lament about the poor design quality of the languages commonly used in the commercial world; it is about the most common fatal mistake academics make when researching programming languages and a suggestion about how they can avoid making this mistake. What really endeared me to Crista was her critic of academic claims of language ‘betterness’ being completely unfounded (i.e., not being based on any empirical research).

The most common fatal mistake made by researchers in programing language design is to invent a new language. Creating an implementation for any language is a big undertaking and a new language has the added hurdles of convincing developers it is worth learning, providing the learning/reference materials and porting to multiple platforms. Researchers spend nearly all their time creating an implementation and a small percentage of their time actively researching the ‘new idea’.

The attraction of designing a new language is that it is regarded as ‘sexy’ activity and the first (and usually only) time around the work needed to create an implementations does not look that great.

If a researcher really does feel that their idea is so revolutionary it is worth creating a whole new language for and they want me, and others, to start using it, then they need to make sure they can answer yes to the following questions:

Have you, or your students, created an implementation of the language that provides reasonable diagnostics, executes programs at an acceptable rate and is available free of charge on the operating systems I use for software development?
Is sufficient documentation available for me to learn the language and act as a reference manual once I become more expert?
For the next five years will you, or your students, be providing, free of charge, prompt bug fixes to errors in your implementation?
Will you and your students spend the time necessary to build an active user community for your language?
For the next five years will you, or your students knowledgeable in the language, provide prompt support (via an email group or bulletin board) to user queries?

Some new languages from academia have managed to answer yes to these questions (Haskell, R and OCaml spring to mind, but only R looks like it will have any significant industrial take-up).

In practice most new languages fail to get past fragile implementations only ever used by their designer, with minimal new research to show for all the effort that went into them.

What programming language researchers need to do, at least if they want people outside of their own department to pay any attention to their ideas, is to experiment by adding functionality to an existing language. Using an existing language as a base has the advantages:

modifying an existing implementation is significantly less work than creating a new one,
having to address all of the features present in real world languages will help weed out poor designs that only look good on paper (I continue to be amazed that people can be involved in programming language research without knowing any language very well),
documentation for most of the language already exists,
more likely to attract early adopters, developers tend to treat existing language+extensions as being a much smaller jump than a new language.

Programming language research is something of a fashion industry and I can well imagine people objecting to having to deal with a messy existing language. Well yes, the real world is a messy place and if a new design idea cannot handle that it deserves to be lost to posterity.

One cannot blame students for being idealistic and thinking they can create a language that will take over the world. It is the members of staff who should be ridiculed for being either naive or intellectually shallow.

Categories: Uncategorized Tags: academic, language extensions, research

Compiler writing: The career path to World domination

November 7, 2009 Derek Jones 3 comments

Compiler writing is not usually thought of as a career path that leads to becoming Ruler of the World. Perhaps this is because compiler writing is a relatively new profession and us compiler writers are still toiling in obscurity awaiting the new dawn.

What might be a workable plan for a compiler writer to become Ruler of the World? One possibility is to write a compiler for the language in which most of the World’s critical software is written (i.e., C) and for that compiler to become the one that the vendors of this critical software all use (i.e., gcc). This compiler needs to do more that just compile the source code it is feed, it also needs to generate code that creates a backdoor in important programs (e.g., the login program).

But, you say, this cannot happen with gcc because its source is available for everybody to read (and spot any backdoor generator). In his 1984 Turing acceptance lecture Ken Thompson showed how a compiler could contain a backdoor that was not visible in its source. The idea is for the compiler writer to modify a compiler to detect when it is being used to compile itself and to insert the backdoor generating code into its own executable. This modified compiler is then used to compile itself and the resulting executable made the default compiler; the backdoor modifications are then removed from the compiler source, they are no longer needed because the previously compiled compiler will spot when it is being used to compile its own source and generate the code necessary to propagate the backdoor code into the executable it creates.

How would the world counter the appearance of such a modified gcc? Obviously critical programs would need to be recompiled by a version of gcc that did not contain the backdoor. Today there are several companies and many amateur groups that distribute their own Linux distributions which they build from source. It should be relatively easy to obtain a usable executable of gcc from 10 years ago; remember what is needed is a version capable of compiling the latest gcc sources.

The ideal time to create a backdoor’ed version of gcc is while its development was under the control of one person, so early in the development history that all versions available anywhere are very likely to be derived from it. How can we prove that the original author of gcc did not do just this?

It could be argued that the very substantial changes to the gcc sources (most of the source has probably been rewritten several times) mean that the coding patterns searched for by the executable to detect that it is compiling itself have long gone and at some point the backdoor failed to propagate itself to the next executable.

Compilers other than gcc might also include backdoors that propagate themselves. However, the method of propagation is likely to be different. Compiling the gcc sources with a non-gcc compiler creates an executable that should exhibit the same behavior as a gcc-compiled executable. Differences in the behavior of these independently built executables is a cause for concern (one difference might be caused by differences in the conversion of floating-point literals, a recent PhD thesis provides more detail).

The problem with compiling the gcc sources is that they make use of language extensions that few, if any, other compilers support. I know IBM added modified one of their C compilers to support those gcc extensions needed to compile the Linux kernel, but I don’t know if this compiler is capable of compiling the gcc sources. The LLVM project intended to support many gcc extensions but I don’t know if they aim to be able to compile the gcc sources.

Another option is to compare the assembler generated when gcc compiles itself against the corresponding source code. A very expensive task for source code measured in hundreds of thousands of lines. Adding the necessary language extension support to another compiler would probably be cheaper and also create a tool that could be used to check future releases of gcc.

Categories: Uncategorized Tags: backdoor, critical software, gcc, language extensions, llvm, source code, world domination

The Shape of Code

Archive

R needs some bureaucracy

The fatal programming language research mistake

Compiler writing: The career path to World domination

Recent Posts

Recent Comments

Archives

Meta