Archive

Author Archive

A first stab a low resolution image enhancement

July 27th, 2015 No comments

Team Clarify-the-Heat (Gary, Pavel and yours truly) were at a hackathon sponsored by Flir at the weekend.

The FlirONE contains an infrared sensor with 160 by 120 pixels, plus an optical sensor that provides a higher resolution image that can be overlaid with the thermal, that attaches to the USB port of an Android phone or iPhone. The sensor frequency range is 8 to 14 µm, i.e., ‘real’ infrared, not the side-band sliver obtained by removing the filter from visual light sensors.

At 160 by 120 the IR sensor resolution is relatively low, compared to today’s optical sensors, and team Clarify-the-Heat decided to create an iPhone App that merged multiple images to create a higher resolution image (subpixel interpolation) in real-time. An iPhone was used because the Flir software for this platform has been around longer (Android support is on its first release) and image processing requires lots of cpu power (i.e., compiled Objective-C is likely to be a lot faster than interpreted Java). The FlirONE frame rate is 8.6 images per second.

Modern smart phones contain 3-axis gyroscopes and data on change of rotational orientation was used to find the pixels from two images that corresponded to the same area of the viewed 2-D image. Phone gyroscope sensors drift over periods of a few seconds, some experimentation found that over a period of a few seconds the drift on Gary’s iPhone was safely under a tenth of a degree; one infrared pixel had a field of view of approximately 1/3 degrees horizontally and 1/4 degrees vertically. Some phone gyroscope sensors are known to be sensitive enough to pick up the vibrations caused by local conversations.

The plan was to get an App running in realtime on the iPhone. In practice debugging a couple of problems ate up the hours and the system demonstrated uploaded data from the iPhone to a server which a laptop read and processed two images to create a higher resolution result (the intensity from the overlapping pixels is averaged to double, in our case, the resolution). This approach is very primitive compared to the algorithms that do sub-pixel enhancement by detecting image features (late in the day we found out that OpenCV supports some of this kind of stuff), but it requires image less image processing and should be practical in real-time on a phone.

We discovered that the ‘raw’ data returned by the FlirONE API has been upsampled to produce a 320 by 240 image. This means that the gyroscope data, used to align image pixels, has to be twice as accurate. We did not try to use magnetic field data (while not accurate the values do not drift) to filter the gyroscope readings.

Two IR images from a FlirOne, of yours truly, and a ‘higher’ resolution image below:

Infrared images of me

The enhanced image shows more detail in places, but would obviously benefit from input from more images. The technique relies on partial overlap between pixels, which is a hit and miss affair (we were managing to extract around 5 images per second but did not get as far as merging more than two images at a time).

Team Clarify-the_Heat got an honorable mention, but then so did half the 13 teams present, and I won a FlirONE in one of the hourly draws :-) It looks like they will be priced around $250 when they go on sale in mid-August. I wonder how long it will take before they are integrated into phones (yet more megapixels in the visual spectrum is a bit pointless)?

I thought the best project of the event was by James Rampersad, who used processed the IR video stream using Eulerian Video Magnification to how blood pulsing through his face.

Deluxe Paint: 30 year old C showing its age in places

July 23rd, 2015 No comments

Electronic Arts have just released the source code of a version of Deluxe Paint from 1986.

The C source does not look anything like my memories of code from that era, probably because it was written on and for the Amiga. The Amiga used a Motorola 68000 cpu, which meant a flat address space, i.e., no peppering the code with near/far/huge data/function declaration modifiers (i.e., which cpu register will Sir be using to call his functions and access his data).

Function definitions use what is known as K&R style, because the C++ way of doing things (i.e., function prototypes) was just being accepted into C by the ANSI committee (the following is from BLEND.C):

void DoBlend(ob, canv, clip)
BMOB *ob;
BoxBM *canv;
Box *clip;
{...}

The difference between the two ways of doing things (apart from the syntax of putting the type information outside/inside the brackets) is that the compiler does not do any type checking between definition/call for the K&R style. So it is possible to call DoBlend with any number of arguments having any defined type and the compiler will generate code for the call assuming that the receiving end is expecting what is being passed. Having an exception raised in the first few lines of a function used to be a sure sign of an argument mismatch, while having things mysteriously fail well into the body of the code regularly ruined a developer’s day. Companies sold static analysis tools that did little more than check calls against function definitions.

The void on the return type is unusual for 1986, as is the use of enum types. I would have been willing to believe that this code was from 1996. The clue that this code is from 1986 and not 1996 is that int is 16 bits, a characteristic that is implicit in the code (the switch to 32 bits started in the early 1990s when large amounts of memory started to become cheaply available in end-user systems).

So the source code for a major App, in its day, is smaller than the jpegs on many’ish web pages (jpegs use a compressed format and the zip file of the source is 164K). The reason that modern Apps contain orders of magnitude more code is that our expectation of the features they should contain has grown in magnitude (plus marketing likes bells and whistles and developers enjoy writing lots of new code).

When the source of one of today’s major Apps is released in 30 years time, will people be able to recognise patterns in the code that localise it to being from around 2015?

Ah yes, that’s how people wrote code for computers that worked by moving electrons around, not really a good fit for photonic computers…

Exhuastive superoptimization

July 10th, 2015 No comments

Optimizing compilers include a list of common code patterns and equivalent, faster executing, patterns that they can be mapped to (traditionally the role of peephole optimization, which in many early compilers was the only optimization).

The problem with this human written table approach is that it is fixed in advance and depends on the person adding the entries knowing lots of very obscure stuff.

An alternative approach is the use of a superoptimizer, these take a code sequence and find the functionally equivalent sequence that executes the fastest. However, superoptimizing interesting sequences often take a very long time and so use tends to be restricted to helping figure out, during compiler or library development, what to do for sequences that are known to occur a lot and be time critical.

A new paper, Optgen: A Generator for Local Optimizations, does not restrict itself to sequences deemed to be common or interesting, it finds the optimal sequence for all sequences by iterating through all permutations of operands and operators. Obviously this will take awhile and the results in the paper cover expressions containing two unknown operands + known constants, and expressions containing three and four operands that are not constants; the operators searched were: unary ~ and - and binary +, &, |, - and ~.

The following 63 optimization patterns involving two variables (x and y)+constants (c0, c1, c2) were found, with the right column listing whether LLVM, GCC or the Intel compiler supported the optimization or not.

  optional-precondition => before  ->  after		L G I
 -~x -> x + 1						+ + -
 -(x & 0x80000000) -> x & 0x80000000			- + -
 ~-x -> x - 1						+ + -
 x +~x -> 0xFFFFFFFF					+ + -
 x + (x & 0x80000000) -> x & 0x7FFFFFFF			- - -
 (x | 0x80000000) + 0x80000000 -> x & 0x7FFFFFFF	+ - -
 (x & 0x7FFFFFFF) + (x & 0x7FFFFFFF) -> x + x		+ + -
 (x & 0x80000000) + (x & 0x80000000) -> 0		+ + -
 (x | 0x7FFFFFFF) + (x | 0x7FFFFFFF) -> 0xFFFFFFFE	+ + -
 (x | 0x80000000) + (x | 0x80000000) -> x + x		+ + -
 x & (x + 0x80000000) -> x & 0x7FFFFFFF			+ - -
 x & (x | y) -> x					+ + -
 x & (0x7FFFFFFF - x) -> x & 0x80000000			- - -
 -x & 1 -> x & 1					- + -
 (x + x)& 1 -> 0					+ + -
 is_power_of_2(c1) && c0 & (2 * c1 - 1) == c1 - 1
=> (c0 - x) & c1 -> x & c1				- - -
 x | (x + 0x80000000) -> x | 0x80000000			+ - -
 x | (x & y) -> x					+ + -
 x | (0x7FFFFFFF - x) -> x | 0x7FFFFFFF			- - -
 x | (x^y) -> x | y					+ - -
 ((c0 | -c0) &~c1) == 0 ) (x + c0) | c1!x | c1		+ - +
 is_power_of_2(~c1) && c0 & (2 *~c1 - 1) ==~c1 - 1
=> (c0 - x) | c1 -> x | c1				- - -
 -x | 0xFFFFFFFE -> x | 0xFFFFFFFE			- - -
 (x + x) | 0xFFFFFFFE -> 0xFFFFFFFE			+ + -
 0 - (x & 0x80000000) -> x & 0x80000000			- + -
 0x7FFFFFFF - (x & 0x80000000) -> x | 0x7FFFFFFF	- - -
 0x7FFFFFFF - (x | 0x7FFFFFFF) -> x & 0x80000000	- - -
 0xFFFFFFFE - (x | 0x7FFFFFFF) -> x | 0x7FFFFFFF	- - -
 (x & 0x7FFFFFFF) - x -> x & 0x80000000			- - -
 x^(x + 0x80000000) -> 0x80000000			+ - -
 x^(0x7FFFFFFF - x) -> 0x7FFFFFFF			- - -
 (x + 0x7FFFFFFF)^0x7FFFFFFF -> -x			- - -
 (x + 0x80000000)^0x7FFFFFFF -> ~x			+ + -
 -x^0x80000000 -> 0x80000000 - x			- - -
 (0x7FFFFFFF - x)^0x80000000 -> ~x			- + -
 (0x80000000 - x)^0x80000000 -> -x			- + -
 (x + 0xFFFFFFFF)^0xFFFFFFFF -> -x			+ + -
 (x + 0x80000000)^0x80000000 -> x			+ + -
 (0x7FFFFFFF - x)^0x7FFFFFFF -> x			- - -
 x - (x & c) -> x &~c					+ + -
 x^(x & c) -> x &~c					+ + -
 ~x + c -> (c - 1) - x					+ + -
 ~(x + c) -> ~c - x					+ - -
 -(x + c) -> -c - x					+ + -
 c -~x -> x + (c + 1)					+ + -
 ~x^c -> x^~c						+ + -
 ~x - c -> ~c - x					+ + -
 -x^0x7FFFFFFF -> x + 0x7FFFFFFF			- - -
 -x^0xFFFFFFFF -> x - 1					+ + -
 x & (x^c) -> x &~c					+ + -
 -x - c -> -c - x					+ + -
 (x | c) - c -> x &~c					- - -
 (x | c)^c -> x &~c					+ + -
 ~(c - x) -> x +~c					+ - -
 ~(x^c) -> x^~c						+ + -
 ~c0 == c1 => (x & c0)^c1 -> x | c1			+ + -
 -c0 == c1 => (x | c0) + c1 -> x &~c1			- - -
 (x^c) + 0x80000000 -> x^(c + 0x80000000)		+ + -
 ((c0 | -c0) & c1) == 0 => (x^c0) & c1!x & c1		+ + -
 (c0 &~c1) == 0 => (x^c0) | c1!x | c1			+ - -
 (x^c) - 0x80000000 -> x^(c + 0x80000000)		+ + -
 0x7FFFFFFF - (x^c) -> x^(0x7FFFFFFF - c)		- - -
 0xFFFFFFFF - (x^c) -> x^(0xFFFFFFFF - c)		+ + -

The optional-precondition are hand-written rules that constants must obey for a particular expression mapping to apply, e.g., (c0 &~c1) == 0.

Hand written rules using bit-vector (for each bit of a constant the value one, zero or don’t know) analysis of constants is used to prune the search space (otherwise the search of the two operand+constants case would still be running).

LLVM covers 40 of the 63 cases, GCC 36 and Intel just 1 (wow). I suspect that many of the unsupported patterns are rare, but now they are known it is worth implementing them just to be able to say that this particular set of possibilities is completely covered.

R is now important enough to have a paid for PR make-over

July 5th, 2015 8 comments

With the creation of the R consortium R has moved up a rung on the ladder of commercial importance.

R has captured the early adopters and has picked up a fair few of the early majority (I’m following the technology adoption life-cycle model made popular by the book Crossing the Chasm), i.e., it is starting to become mainstream. Being mainstream means that jobsworths are starting to encounter the language in situations of importance to them. How are the jobsworths likely to perceive R? From my own experience I would say it will be perceived as being an academic thing, which in the commercial world is not good, not good at all.

To really become mainstream R needs to shake off its academic image, and as I see it, the R consortium has been set up to make that happen. I imagine it will try to become the go-to point for journalists wanting information or a quote about things-related-to R. Yes, they will hold conferences with grandiose sounding titles and lots of business people will spend surprising amounts of money to attend, but the real purpose is to solidify the image of R as a commercial winner (the purpose of a very high conference fee is to keep the academics out and convince those attending that it must be important because it is so expensive).

This kind of consortium gets set up when some technology having an academic image is used by large companies that need to sell this usage to potential customers (if the technology is only used internally its wider image is unimportant).

Unix used to have an academic image, one of the things that X/Open was set up to ‘solve’. The academic image is now a thing of the past.

For the first half of the 1980s it looked like Pascal would be a mainstream language; a language widely taught in universities and perceived as being academic. Pascal did not get its own consortium and C came along and took its market (I was selling Pascal tools at the time and had lots of conversations with companies who were switching from Pascal to C and essentially put the change down to perception; it did not help that Pascal implementations did their best to hide/ignore the 8086 memory model, something of interest when memory is scarce).

How will we know when R reaches the top rung (if it does)? Well there are two kinds of languages, those that nobody uses and those that everybody complains about.

R will be truly mainstream once people feel socially comfortable complaining about it to any developer they are meeting for the first time.

2015: A new C semantics research group

June 30th, 2015 2 comments

A very new PhD student research group working on C semantics has just appeared on the horizon. You can tell they are very new to C semantics by the sloppy wording in their survey of C users (what is a ‘normal’ compiler and how does it differ from the ‘current mainstream’ compiler referred to in some questions? I’m surprised the outcome appeared clear to the authors, given the jumble of multiple choice options given to respondents).

Over the years a number of these groups have appeared, existed until their members received a PhD and then disappeared. In some cases one of the group members does something that shows a lot of potential (e.g., the C-semantics work), but the nature of academic research means that either the freshly minted PhD moves to industry or else moves on to another research area. Unfortunately most groups are overwhelmed by the task and pivot into meaningless subsets of concentrating on mathematical organisms. Very, very occasionally interesting work gets supported once the PhD is out of the way, Coccinelle being the stand-out example for C.

It takes implementing a full compiler (as part of a PhD or otherwise) to learn C semantics well enough to do meaningful research on it. The world seems to be stuck in a loop of using research to educate know-nothings until they know-something and then sending them off on another track. This is why C language researchers keep repeating themselves every 10 years or so.

Will anybody in this new group do any interesting work? Alan Mycroft set the bar very high for Cambridge by submitting a 100 page comment document on the draft C89 standard that listed almost as much ambiguous wording as everybody else put together found (but he was implementing a compiler in his spare time and not doing it for a PhD, so perhaps he does not count).

One suggestion I would make to this new group is that if they really are interested in actual usage they should measure actual usage, developer beliefs about compiler behavior is rarely very accurate and always heavily tainted by experiences from when they first started out.

A checklist for evaluating compiler semantic research.

Tags: , ,

High value IP auctions, finally a way to moneterise the blockchain

June 20th, 2015 No comments

Team High-value-IP-auctions (Gary, Shlomie and yours truly) were at Hackcoin today. We targeted the opposite end of the market from Team Long Tail Licensing, i.e., high value at very low volume rather than low value at high volume.

One of the event sponsors was Nxt, a cryptocurrency I had not previously heard of. Nxt is unusual in that it is exclusively focused on the use of the blockchain as a tool for building applications, there is no mining to create new currency (it is based on proof-of-stake, which was all distributed at genesis). I was surprised at how well developed the software and documentation appeared to be (a five hour hack does not leave time for a detailed analysis).

So you have some very interesting information and want to provide a wealthy individual the opportunity to purchase exclusive access to it. There is a possibility that the individual concerned might not be very sporting about this opportunity and it would be prudent for the seller to remain anonymous throughout the negotiation and payment process.

A cryptocurrency blockchain is the perfect place to deposit information for which a global audience might be needed at some point in the future. The information can be stored in encrypted form (where it can hide in plain sight with all the other encrypted content), it will be rapidly distributed to a wide variety of independent systems and following a few simple rules allows the originating source to remain anonymous.

The wealthy individual gets sent details on how to read the information (i.e., a decryption key and a link to the appropriate block in the blockchain) and the Nxt account number where the requested purchase price should be deposited.

Once events have been set in motion the seller may not have reliable access to the Internet and would prefer a third party to handle the details.

The third party could be a monitor program running in the cloud (perhaps independent copies running on Amazon, Azure and Google to provide redundancy). This monitor program sleeps until the end of the offer period and then sends a request for the current balance on the account being used. If the account does not contain the purchase price, the encryption key and appropriate link is tweeted, otherwise the monitor program shuts itself down.

Those of you who don’t have any information that wealthy individuals might want to purchase could use this approach to run a kick-starter campaign, or any sale of digital goods that involved triggering product release after a minimum monetary amount is reached within a given amount of time.

Does the third party monitor program have to run outside of a blockchain environment? Perhaps it could be executed as a smart contract inside a crytpocurrency such as Ethereum. I did see mention of smart contracts inside Nxt, but unless I missed something they are not supported by the base API.

The designers of the Nxt blockchain have appreciated that they need a mechanism to stop it becoming weighed down by long dead information. The solution is pruned data, data that is removed from the blockchain after a period of time (the idea that a blockchain is an immutable database is great in theory, but dooms any implementation to eventual stasis).

Does our wealthy individual have any alternative options? Perhaps the information is copyright and the lawyers can be unleashed. I doubt that lawyers could prevent the information being revealed in this case, but copyright infringement via the blockchain is an issue that has yet to explode on the world.

The implementation was surprisingly straightforward and the only feature not yet working at the time of our presentation was tweeting of encryption key. We won first prize of 1-bitcoin!

Extracting absolute values from percentage data

June 17th, 2015 No comments

Empirical data on requirements for commercial software systems is extremely hard to find. Massimo Felici’s PhD thesis on requirements evolution contains some very interesting data relating to an avionics system, but in many cases the graphs are plotted without numbers next to the axis tick marks. I imagine these numbers were removed because of concerns about what people who had been promoted beyond their level of competence might say (I would also have removed them if asked by the company who had provided the raw data, its a data providers market).

As soon as I saw the plot below I knew I could reverse engineer the original numbers for the plot in the top left.

Requirements added/deleted/modified in each release

The top left plot is a total count of requirements added/deleted/modified in each of the 22 releases of the product. The eight other plots contain individual percentage information for each of the eight requirements features included in the total count product.

The key is that requirement counts have integer values and the percentages in the F1-8 plots are ratios that restrict the requirement counts to being exactly divisible by a calculated value.

I used WebPlotDigitizer, the goto tool for visually extracting data from plots, to obtain values for the height of each bar (for the time being assuming the unnumbered tick marks linearly increased in units of one).

Looking at the percentages for release one we see that 75% involved requirement feature F2 and 25% F4. This means that the total added/deleted/modified requirement count for release one must be divisible by 4. The percentages for release five are 64, 14 and 22; within a reasonable level of fuzziness this ratio is 9/2/3 and so the total count is probably divisible by 15. Repeating the process for release ten we get the ratio 7/4/1/28, probably divisible by 40; release six is probably divisible by 20, release nine by 10, release fourteen by 50 and fifteen by 10.

How do the release values extracted from the top left shape up when these divisibility requirements are enforced? A minimum value for the tick mark increment of 100 matches the divisibility requirements very well. Of course an increment of any multiple of 100 would match equally well, but until I spot a relationship that requires a larger value I will stick with 100.

I was not expecting to be so lucky in being able to extract useful ratios and briefly tried to be overly clever. Diophantine equations is the study of polynomial equations having integer solutions and there are lots of tools available for solving these equations. The data I have is fuzzy because it was extracted visually. A search for fuzzy diophantine equations locates some papers on how to solve such equations, but no obvious tools (there is code for solving fuzzy linear systems, but this technique does not restrict the solution space to integers).

The nearest possible solution path I could find involving statistics was: Statistical Analysis of Fuzzy Data.

If a reader has any suggestions for how to solve this problem automatically, given the percentage values, please let me know. Images+csvs for your delectation.

Power consumed by different instruction operand value pairs

June 11th, 2015 No comments

We are all used to the idea that program performance (e.g., cpu, storage and power consumption) varies across different input data values. A very interesting new paper illustrates how the power consumption of individual instructions varies as instruction operand values change.

The graph below (thanks to Kerstin for sending me a color version, scale is in Joules) is a heatmap of the power consumed by the 8-bit multiply instruction of an XMOS XCore Atmel AVR cpu for all possible 8-bit values (with both operands in registers and producing a 16-bit result). Spare a thought for James Pallister who spent weeks of machine time properly gathering this data; by properly I mean he took account of the variability of modern processor power consumption and measured multiple devices (to relax James researches superoptimizing for minimal power consumption).

Multiply power consumption for 8-bit values

Why is power consumption not symmetric across the diagonal from bottom left to top right? I naively assumed power consumption would be independent of operand order. Have a think before seeing one possible answer further down.

The important thing you need to know about digital hardware power consumption is that change of state (i.e., 0-to-1 or 1-to-0) is what consumes most of the power in digital circuits (there is still a long way to go before the minimum limit set by the Margolus–Levitin theorem is reached).

The plot below is for the bit-wise AND instruction on a XMOS XCore cpu and its fractal-like appearance maps straight to the changing bit-patterns generated by the instruction (Jeremy Morse did this work).

Bitwise AND power consumption for 8-bit values

Anyone interested in doing their own power consumption measurements should get a MAGEEC Energy Measurement Kit and for those who are really hardcore the schematics are available for you to build one yourself.

Why is power consumption asymmetrical? Think of paper and pencil multiplication, the smaller number is written under the larger number to minimise the number of operations that need to be performed. The ‘popular’ explanation of the top plot is that the cpu does the multiply in whatever order the operand values happen to be in, i.e., not switching the values to minimise power consumption. Do you know of any processors that switch operand order to minimise power consumption? Would making this check cost more than the saving? Chip engineers, where are you?

A plug for the ENTRA project and some of the other people working with James, Kerstin and Jeremy: Steve Kerrison and Kyriakos Georgiou.

Is your interesting project on hold because of lack of sufficient cpu time?

June 9th, 2015 3 comments

Do you have an interesting project that is stalled because of lack of cloud compute resources? If so I know some guys who may be able to help.

One of the prizes at a recent hackathon was around $8k of cloud computing per month for a year. The guys who won it have not been using the monthly allowance and would like to put it to good use.

What counts as an “interesting project”? You are dealing with hackers who enjoy working at the edge of things and want to be involved in a project that impresses other hackers (here ‘involved’ means telling other people they are involved, not actually helping you with the project in any way). While it is obviously a project that uses computers it does not have to be about computing. Helping your me-to-startup is very unlikely to be interesting.

Hackers are fans of open data, so you will have to have a very good reason not to make any data you produce public.

Send me an email briefly describing your project, why it needs this cloud computing resource and show that you will not fritter it away because you don’t know what you are doing.

The clock is ticking.

Aggregate player preference for the first 20 building created in Illyriad

June 7th, 2015 No comments

I was at the Microsoft Gaming data hackathon today. Gaming is very big business and companies rarely publish detailed game data. Through contacts one of the organizers was able to obtain two gaming datasets, both containing just under 300M of compressed of data.

Illyriad supplied a random snapshot of anonymised data on 50,000 users and Mediatonic supplied three months of player data.

Being a Microsoft event there were lots of C# developers, with data analysis people being thin on the ground. While there were plenty of gamers present I could not find any that knew the games for which we had data (domain experts are always in short supply at hackathons).

I happened to pick the Illyriad data to investigate first and stayed with it. The team sitting next to us worked on the Mediatonic data and while I got to hear about this data and kicked a few ideas around with them, I did not look at it.

The first thing to do with any dataset is to become familiar with what data it actually contains and the relationships between different items. I was working with two people new to data science who wanted to make the common beginner mistake of talking about interesting things we could do; it took a while for my message of “no point of talking about what we could do with the data until we know what data we have” to have any effect. Of course it is always worth listening to what a domain expert is interested in before looking at the data, as a source of ideas to keep in mind; it is not worth keeping in mind ideas from non-domain experts.

Quick Illyriad game overview: Players start with a settlement and construct/upgrade buildings until they have a legendary city. These buildings can generate resources such as food and iron; towns/cities can be conquered and colonized… you get the picture.

My initial investigation of the data did not uncover any of the obvious simple patterns, but did manage to find a way of connecting some pairs of players in a transaction relationship (the data for each player included a transaction list which gave one of 255 numeric locations and the transaction amount; I reasoned that the location/amount pair was likely to be unique).

The data is a snapshot in time, which appeared to rule out questions involving changes over time. Finally, I realized that time data was present in the form of the order in which each player created buildings in their village/town/city.

Buildings are the mechanism through which players create resources. What does the data have to say about gamers preferential building construction order? Do different players with different playing strategies use different building construction orders?

A search of the Illyriad website located various beginners’ guides containing various strategy suggestions, depending on player references for action.

Combining the order of the first 20 different buildings, created by all 50,000 players, into an aggregate preference building order we get:

Library
Storehouse
Lumberjack
Clay Pit
Farmyard
Marketplace
Quarry
Iron Mine
Barracks
Consulate
Mage Tower
Paddock
Common Ground
Brewery
Tavern
Spearmaker
Tannery
Book Binder
Flourmill
Architects` Office

A couple of technical points: its impractical to get an exact preference order for more than about 10 players and a Monti Carlo approach is used by RankAggreg and building multiple instance of the same kind of building were treated as a single instance (some form of weighting might be used to handle this behavior):

The order of the top three ranked buildings is very stable, but some of the buildings in lower ranks could switch places with adjacent buildings with little impact on ranking error.

Do better players use different building orders than poor players? The data does not include player ability data as such; it included game ranking (a high ranking might be achieved quickly by a strong player or slowly over a longer period by a weaker player) and various other rankings (some of which could be called sociability).

Does the preference for buildings change as a players’ village becomes a town becomes a city? At over 200 minutes of cpu time per run I have not yet had the time to find out. Here is the R code for you to try out some ideas:

library("plyr")
library("RankAggreg")
 
get_build_order=function(df)
{
# Remove duplicates for now
dup=duplicated(df$building_id)
 
# Ensure there are at least 20
build_order=c(df$building_id[!dup], -1:-20)
return(build_order[1:20])
}
 
# town_id,building_id,build_order_for_town
#1826159E-976D-4743-8AEB-0001281794C2,7,1
build=read.csv("~/illyriad/town_buildings.csv", as.is=TRUE)
 
build_order=daply(build, .(town_id), get_build_order)
 
build_rank=RankAggreg(build_order, 20)

What did other teams discover in the data? My informal walk around on Saturday evening found everybody struggling to find anything interesting to talk about (I missed the presentation on Sunday afternoon, perhaps a nights sleep turned things around for people, we will have to check other blogs for news).

If I was to make one suggestion to the organizers of the next gaming data hackathon (I hope there is one), it would be to arrange to have some domain experts (i.e., people very familiar with playing the games) present.

ps. Thanks to Richard for organizing chicken for the attendee who only eats pizza when truly starving.