Archive for March, 2014

A trip down memory lane with Microsoft Word 1.1

March 31st, 2014 3 comments

Microsoft has donated the source code of Word for Windows 1.1a to the Computer history Museum and I have been rummaging around in this C code from the last 1980s.

What immediately struck me is how un-Microsoft Windows like the code looks, in some ways it looks more like Unix code. There are:

  • very few near/far/huge pointer declarations. The Intel x86 architecture is based on segmented addressing of memory, a fact that most developers are blissfully unaware of because 32-bit segments are big enough to be able to ignore this fact; back in the day we had 16-bit segments and there were near pointers that could only point within a segment, far pointers that could point to the contents of other segments (there were alignment issues associated with these) and huge pointers which are essentially 32-bit pointers. Many developers obsessed over saving time/space and did their utmost to only use near pointers (who is ever going to need a data structure larger than 64k?) and Windows code was often littered with different kinds of pointer usage,
  • very few #if/#endif. Why do developers use #if/#endif? They want to port their code to different versions of MS-DOS, use different vendor compilers and to use different third-party libraries. Microsoft were well known for not being backwards compatible with themselves (they have improved somewhat on that score) and obviously had no interest in using other vendor compilers and third-party libraries. Unix code of the day and today is of course packed with #if/#endif.

For any developer who has only used C since 2000 the most obvious code ‘quirk’ is probably the use of K&R style function definitions. This used to be the only way of defining functions until the first C Standard, in 1989, introduced function prototypes and the mass migration of the 1990’s occurred. The K&R style function definition looks like no big deal:

int a;
struct T b;
/* blah blah */

instead of:

f(int a, struct T b)
/* blah blah */

until you find out that in the first case, when f is called, there is no checking of the arguments against the parameters appearing in the definition. This lack of checking was/is a constant source of annoying faults that could/can take hours to track down; once a developer has experienced the benefits of automatic argument/parameter checking they soon convert to the prototype form.

The makefile included with the source contains compiler options that won’t be familiar to Windows developers. This is because a special purpose C compiler was used, one that generated P-code (the P here is for Packed not Portable). In the late 1980s, when this code was written, a 256K was considered a lot of memory for your average person’s computer and at +180K lines of code Word 1.1 was enormous. Word processors are on the whole not cpu intensive tasks and trading off a factor of 10 (I’m guessing, probably more) for a memory footprint saving of 2.5-4 (not so much of a guess) was obviously considered worthwhile (but then Microsoft apps have never been known for being nimble).

The extensive use of bit-fields is the other clue that memory is tight. Do this ever get used outside of comms software and embedded systems these days?

I was surprised at how clean and presentable this source code is. I should not have judged Microsoft developers by the horrors perpetrated by many developers targeting their platform.

Hack, a template for improving code reliability

March 24th, 2014 4 comments

My sole prediction for 2014 has come true, Facebook have announced the Hack language (if you don’t know that HHVM is the Hip Hop Virtual Machine you are obviously not a trendy developer).

This language does not follow the usual trend in that it looks useful, rather than being fashion fluff for corporate developers to brag about. Hack extends an existing language (don’t the Facebook developers know about not-invented-here?) by adding features to improve code reliability (how uncool is that) and stuff that will sometimes enable faster code to be generated (which has always been cool).

Well done Facebook. I hope this is the start of a trend of adding features to a language that help developers improve code reliability.

Hack extends PHP to allow programmers to express intent, e.g., this variable only ever holds integer values. Compilers can then check that the code follows the intent and flag when it doesn’t, e.g., a string is assigned to the variable intended to only hold integers. This sounds so trivial to be hardly worth bothering about, but in practice it catches lots of minor mistakes very quickly and saves huge amounts of time that would otherwise be spent debugging code at runtime.

Yes, Hack has added static typing into a dynamically typed language. There is a generally held view that static typing prevents programmers doing what needs to be done and that dynamic typing is all about freedom of expression (who could object to that?) Static typing got a bad name because early languages using it were too disciplinarian in a few places and like the very small stone in a runners shoe these edge cases came to dominate thinking. Dynamic languages are great for small programs and showing off to spotty teenagers students, but are expensive to maintain and a nightmare to work with on 10K+ line systems.

The term gradual typing is a good description for Hack’s type system. Developers can take existing PHP code and gradually give types to existing variables in a piecemeal fashion or add new code that uses types into code that does not. The type checker figures out what it can and does not get too upperty about complaining. If a developer can be talked into giving such a system a try they quickly learn that they can save a lot of debugging time by using it.

I would like to see gradual typing introduced into R, but perhaps the language does not cause its users enough grief to make this happen (it is R’s libraries that cause the grief):

  • Compared to PHP’s quirks the R quirk’s are pedestrian. In the interest of balance I should point out that Javascript can at times be as quirky as PHP and C++ error messages can be totally incomprehensible to everybody (including the people who wrote the compiler).
  • R programs are often small, i.e., 100 lines’ish. It is only when programs, written in dynamically typed languages, start to exceed around 10k+ lines that they start to fall in on themselves unless that one person who has everything in his head is there to hold it all up.

However, there is a sort of precedent: Perl programs tend to be short (although I don’t think they are as short as R) and it gradually introduced the option of stronger typing. But Perk did/does have one person who was the recognized language designer who could lead the process; R has a committee.

Tags: , , , ,

By now I ought to feel more knowledgeable about R

March 18th, 2014 2 comments

I was surprised to find recently that there are now over 15,000 lines of R code in the book I am working on. If I had written that much code in another ‘newly’ acquired language I would probably feel a lot more knowledgeable about it than I currently feel about R. Why don’t I feel more knowledgeable about R?

Those 15,000 lines are not all real lines, lots of cut-and-paste has been going on; yes, R is a cut-and-paste language just like Cobol and ‘web’ languages. ‘Real’ programmers often look down their noses at such languages, but that is just a failure on their part to understand what they are really all about. Perhaps I have written 5,000 actual lines of R, still a decent amount and half way to the 10,000 line minimum I ask newbies if they have reached.

An expert in a language should be able to pick up a random sample of code and to have been there, done that and got the t-shirt. I still regularly learn new stuff when reading other people’s code, so I’m still a long way from being an R expert. But then R is in the mold of a functional language and one characteristic of languages in this mold is that they provide umpteen different ways of doing the same thing. The combination of this language characteristic along with the lack of common culture in R usage (when this exists it significantly reduces the patterns of code usage commonly encountered) could mean that I am on the treadmill of forever and regularly learning new R coding techniques (which is great source for blog articles but gets tedious after a while); Perl is a lot like this.

As a compiler guy I’m used to learning a language by reading the language definition. Reading this document gives me a warm fuzzy feeling of knowing the language, this has nothing to do with being able to program in it and there is no way of knowing that I understood what the words meant. I was going to say that the R language definition was little more than some brief notes jotted down by somebody to be written up later, but checking the link to the page I discovered that somebody had been spending time significantly improving on what existed a few years ago; there is still a way to go but the R language definition is starting to look respectable. Hopefully my feeling of R knowledgeability will improve after I have read through this updated definition a few times.

Use of R is usually intimately bound up with the data being manipulated; on a per line of code basis much more so than other languages (in this regard it is like Cobol). Perhaps the need to have to learn lots more about the data than I normally have to adds to my feeling of not knowing. Would my feeling of knowledgeability increase if I worked with the same kind of data ll the time?

Open source: monoculture is more desirable than portability

March 8th, 2014 No comments

An oft repeated fable is that open source software is portable, all thanks to C and Unix. The reality is that open source lives in an environment that is evolving to become a monoculture that does not require portability, this is being driven by the law of the jungle.

First some background and history. Portability requires that source code have the same behavior on different platforms, or rather than programs built from that code have the same behavior, this requires that:

  • all compilers assign the same semantics to a given piece of code,
  • all operating systems include support the same set of libraries in the same way.

If you want portability across lots of compilers then Fortran has stood head and shoulders above the competition for decades. Now some C folk may point out that they have been compiling some large code base for decades, with few changes necessitated by compiler differences; yes this has been possible in certain niche markets where there is a dominant supplier who has a vested interest in not breaking customer code. C has a long history of widespread large variation in behavior of across compilers.

What about Cobol you ask? Cobol is all about data manipulation and unless you have data in the format expected by a Cobol program you have no need for that program. Nobody cares about portable Cobol programs unless they are also interested in portable data.

If you want portability across lots of operating systems the solution has always been to minimise the dependency on system/third-party library calls (to the extent of including source code for functionality often supported by an OS). The reason for minimising OS dependency is the huge variation in support for different libraries and a wide range of behaviors for supposedly the same functionality. But you say, Unix is an OS that did/does provide a common set of libraries that have the same behavior; no, this is history seen through rose tinted glasses as anyone who knows about the Unix wars will tell you.

In the last century, to experience ‘portability’ Unix developers had to live in a monoculture of either PDP-11s or Sun workstations.

Open source, as it existed in the 1970s, 80s and into the 90’s was Fortran code that ran on a surprisingly wide range of OS/cpu/compilers, along with a smattering of other languages. Back then there were not many software applications and when they did exist many were written in Fortran (Oracle being an early, lots of Fortran, example), this created a strong incentive for vendors to support a Fortran compiler that did things the same way as everybody else (which did not prevent them adding proprietary goodies to try and lure customers towards lock-in).

How did we get to today’s dominance of C and Unix? Easy, evolution at a rate that caused competitors to die out until there was a last man standing. That last man standing was gcc and Linux. The portability problem has been solved by removing the need to port code; it is compiled by the same compiler to run on the same cpu (Intel x86 family) to run under the same OS (Linux).

Of course some of today’s open source C is compiled using non-gcc compilers, but the percentage is small and specialised (a lot of the older code is portable because it used to exist in a multi-compiler/cpu/OS world and had to evolve into being portable). The gcc competitor, llvm, is working long and hard to ensure compatibility and somehow has to differentiate itself while being compatible, a tough fight for developer hearts and minds.

Differences in CPU characteristics are a big headache to any compiler writer wanting to support identical behavior across platforms; having a single cpu family as the market leader more or less solves this problem. ARM has become a major player in the CPU world, but it shares many developer visible characteristics with Intel x86 (e.g., 32-bit int, 64-bit long, pointer and ints are the same size and IEEE floating-point) and options are available for handling some of the other potential differences (e.g., right shift of signed integers).

The Unix wars have not gone away, they have moved to more far flung battlefields leaving behind some hard fought over common ground. Anybody who wants to see the scares left by these war only needs to look at the #ifs in system headers or the parameters selected inside .configure files.

Having everybody use the same compiler/cpu/OS saves having to make a huge time/money investment in making software portable, at least until the invention of photonic computers or the arrival of aliens (whose computers are unlikely to contain a CPU that shares Intel/ARM characteristics or have the same libraries as Linux).

C/Linux has not won in the sense that competitors have given up; in 20 years time the majority of open source in active use might be Javascript running inside a browser.