It has been a while since I posted anything about generating source that mimics the characteristics of human written code.
Generating function definitions is the easy bit, although variable selection is fiddly and naming needs to be handled.
The hardest part of mimicking human written code is linking functions together, via calls, in apparently meaningful ways. Within one source file it is probably possible to get away with individual calls to other local functions (most function definitions are called once) and a couple of calls to third party libraries. At the complete program level the code needs to tell a story, and doing this is very hard.
I think function calls can be divided into three categories, all based on the relationship between the developer who writes the code than does the call and the developer(s) who wrote the called function:
- Caller developer is the same person as the callee developer. One person gets to decide how things are done,
- Caller developer works on the same team as the callee developer. Here an interface needs to be negotiated with one or more other people,
- Caller developer is making use of a third-party library. The interface is pre-decided, take it or leave it (the same principle applies to library updates, the caller has to decide whether to stay with the existing version or make the changes needed to upgrade to the new version; Android is the (in)famous example of a frequently changed library that developers are under strong pressure to continually adapt to).
Calls to functions in third-party libraries tend to follow stylized sequences, e.g.,
open_* occurs before
close_* appears last. Most of the time a generator could get away not calling functions from third-party libraries, because the number of such calls is often small, but when they occur they had better look correct.
Needing to call a function written by another developer on the team opens up all sorts of possibilities: there is always the option of them modifying the function to make life easier for the new call, but the cost of modifying all the existing call sites (plus any associated call chains) may be too high, perhaps a new global variable can be used to communicate the desired information, or perhaps the proposed usage is a small cog in a large wheel and has to make do (or perhaps they don’t like the person asking for the change and reasons are invented for staying as-is).
Then there is correlation in time and space (this has a big impact on patterns of evolution, at least I think so; models of forest fires, i.e., growth, death, fires of various sizes creating space for new growth, is the obvious parallel):
- The longer a function exists the more likely it is to accumulate callers. A function definition can remain unchanged and yet over time become more and more difficult to change.
- It can be very expensive to make changes to an existing function definition when there are lots of calls to it.
What do measurements of code have to say? Almost nothing; existing studies mostly do weird things like treating a system’s call graph as-if it were a social network (using a currently trendy metaphor is good for getting papers published, even if the mapping is all wrong), plotting power law-like graphs and sprouting portentous nonsense.
What is really needed is measurements of forked systems; comparing systems derived from a common past will tell us how much natural variation exists due to individual choice.
FreeBSD, NetBSD and OpenBSD are the obvious poster children of forked, common heritage, systems written in C; I cannot think of any such systems written in Java or C++.