Semantic vs phonetic similarity for word pairs: a weekend investigation

April 17th, 2015 No comments

The Computational Semantics hackathon was one of the events I attended last weekend. Most if the suggested problems either looked like they could not reasonably be done in a weekend (it ran 10:00-17:00 on both days, I know academics hacks) or were uninspiring coding problems. Chatting to some of the academics present threw up an interesting idea that involved comparing word pair semantic and phonetic similarity (I have written about my interest in sounds-like and source code identifiers).

Team Semantic-sounds consisted of Pavel and yours truly (code and data).

The linguists I chatted to seemed to think that there would be a lot of word pairs that sounded alike and were semantically similar; I did not succeed it getting any of them to put a percentage to “a lot”. From the human communication point of view, words that both sound alike and have a similar meaning are likely to be confused with each other; should such pairs come into existence they are likely to quickly disappear, at least if the words are in common usage. Sound symbolism related issues got mentioned several times, but we did not have any data to check out the academic enthusiasm.

One of the datasets supplied by the organizers was word semantic similarity data extracted from the Google news corpus. The similarity measure is based on similarity of occurrence, e.g., the two sentences “I like licking ice-cream” and “I like eating ice-cream” suggest a degree of semantic similarity between the words licking and eating; given enough sentences containing licking and eating there are clever ways of calculating a value that can be viewed as a measure of word similarity.

The data contained 72,000+ words, giving a possible half a billion pairs (most having zero similarity). To prune this down a bit we took, for each word, the 150 other words that were most similar to it, giving around 10 million word pairs. Each word was converted to a phoneme sequence and a similarity distance calculated for each pair of phoneme sequences (which we called phonetic distance and claimed it was a measure of how similar the words sounded to each other).

The list of word pairs with high semantic/phonetic similarity was very noisy, with lots of pairs containing the same base word in plural, past tense or some other form, e.g., billion and billions. A Porter stemmer was used to remove all pairs where the words shared the same stem, reducing the list to 2.5 million pairs. Most of the noise now came from differences in British/American spelling. We removed all word pairs that contained a word that was not in the list of words that occurred in the common subset of the British and American dictionaries used by aspell; this reduced the list to half a million pairs.

The output contained some interesting pairs, including: faultless/flawless, astonishingly/astoundingly, abysmal/dismal and elusive/illusive. These look like rarely used words to me (not enough time to add in word frequency counts).

Some pairs had surprisingly low similarity, e.g., artifact/artefact (the British/American spelling equality idea has a far from perfect implementation). Is this lower than expected semantic similarity because there is a noticeable British/American usage difference? An idea for a future hack.

A smoothed scatter plot of semantic vs phonetic similarity (for the most filtered pair list) shows lots of semantically similar pairs that don’t sound alike, but a few that do (I suspect that most of these are noise that better stemming and spell checking will filter out). The following uses Levenshtein distance for phoneme similarity, normalised by the maximum distance for a given phoneme sequence with all phoneme differences having equal weight:

Semantic vs phonetic similarity, levenshtein

and using Jaro-Winkler distance (an alternative distance metric that is faster to calculate):

Semantic vs phonetic similarity, Jaro-WInkler

The empty band at low phonetic similarity is an artifact of the data being quantized (i.e., words contain a small number of components).

Is it worth going to the trouble of comparing phoneme sequences? Would comparing letter sequences be just as good? The following plot shows word pair letter distance vs phonetic similarity distance (there is a noticeable amount of off-diagonal data, i.e., for some pairs letter/phoneme differences are large):

Letter vs Phonetic, Jaro Winkler

Its always good to have some numbers to go with graphical data. The following is the number of pairs having a given phonetic similarity (remember the starting point was the top 150 most semantically similar pairs). The spikes are cause by the discrete nature of word components.

Number of pairs having given levenshtein distance

Squinting at the above it is possible to see an exponential decline as phonetic word similarity increases. It would be interesting to have enough data to display a meaningful 3D plot, perhaps a plane can be fitted (with a log scale on the z-axis).

Rather than using the Google news corpus data as the basis of word pair semantic similarity we could have used the synonym sets from Wordnet. This rather obvious idea did not occur to me until later Saturday and there was no time to investigate. How did the small number of people who created the Wordnet data come up with lists of synonyms? If they simply thought very hard they might have been subject to the availability bias, preferentially producing lists of synonyms that contained many words that sounded alike because those that did not sound alike were less likely to be recalled. Another interesting idea to check out at another hack.

It was an interesting hack and as often happens more new questions were raised than were answered.

A disheartening Space Apps hackathon

April 11th, 2015 No comments

Every hackathon has its share of crazies, fortunately they rarely achieve sufficient critical mass to bother anybody else, generally wandering harmlessly around the venue. Hackathons involving outer space attracts crazies in droves and much of the first day is spent with everybody floating around in zero gravity.

This morning I stopped by the NASA SpaceApp hackathon in London to pick up my pass, say hi to a few people I knew were going and let them know I would be back later. A computational semantics hackathon was happening a few miles away, but finishing for the day at five (I know, academic hackathons). My plan was do interesting language analysis stuff, giving the crazies plenty of time to save the world (and then leave), before I returned to spend the rest of the weekend hacking on something space’ish.

The computational semantics hack was as interesting as it promised to be and I’m returning to it tomorrow (not the original plan).

I returned to the SpaceApp hack to find that while most of the crazies had gone, a lot of developers had also left. Had the crazies caused those with a firmer grip on reality to flee to the hills? The lack of coffee (yes, I did check several times that there were no plans to supply any coffee for the duration) and the wifi not being able to support everybody could not have helped. The few developers left (a fare few engineers, designers and ‘ideas’ people were still there) seemed to have been reduced to nibbling away at uninteresting bits (to me at least) of big problems. I left disheartened.

I think that part of the reason that non-crazies have trouble connecting with NASA’s proposed plans is that it is hard to tell the difference between NASA and the crazies. NASA’s mission has always been about politics first and science second. First as a means of boosting the US image around the world (the moon landing years) and then as a means of pork-barrel funding for favored politicians. To keep serious funding rolling in NASA has to ignite the public imagination. Fortunately for them the public is not very good at working out the economics of the proposed ventures (e.g., Google search on economics of asteroid mining to find plenty of articles exposing the economics flaws of this proposal).

While I am happy for the US taxpayer to funding NASA to do interesting space stuff and share it with the rest of the world, I do wish they could do it in a way that made technical and economic sense (I have no problem with them feeding the crazies, who have as much right to enjoy hackathons as the rest of us).

Entropy: Software researchers go to topic when they have no idea what else to talk about

April 4th, 2015 No comments

If I’m reading a software engineering paper or blog and it starts discussing entropy my default behavior is to stop reading and move along. Entropy is what software researchers talk about when they cannot think of anything else to say about a topic.

The term entropy was first used in thermodynamics, around the mid-1800s, to define the relationship between the temperature of a body and its heat content. Once people found out that molecules could exist in different energy states within solids they realized that temperature was actually an average of the different energies of the molecules in a body and that heat content was the sum of the vibrational and kinetic energies of the molecules within a body. These insights enabled entropy to also be defined in terms of the number of states that the molecules within a body could exist in, and the probability of these states being occupied; statistical mechanics, a specialist field within thermodynamics, was born. Note, it is a common mistake to associate entropy with disorder (and lets bang that point home).

The problem with the term Entropy, outside of thermodynamics, is that people conflate, confuse and co-mingle various concepts associated with it. In fact this is one of the reasons Shannon chose to associate the word Entropy with his new theory of information, as he told it: Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

While Shannon’s paper A Mathematical Theory of Communication is his most cited work, the second paper Prediction and Entropy of English has probably had a bigger impact in the world of techno-babble.

Shannon’s famous entropy formula is K sum{i=1}{N}{-p_i log{p_i}}, where there are N possible events with p_i the probability of event i occurring and K some constant. The derivation and use of this formula depends on various assumptions being true, perhaps the most important being that successive symbols are independent; if the next symbol depends on what went before it the formula is more complicated and we are now dealing with conditional entropy.

Source code, to quiet a good approximation, consists of a sequence of symbols that alternate between symbols selected from two separate alphabets, the alphabet of punctuators/operators and the alphabet of ‘words’. The following shows the source of a program separated into these two sets of symbols.

Source in constituent symbol sets

Source code, created as it is from an alternating sequence of two different symbol sets, has none of the characteristics assumed in Shannon’s formula derivation or by Maxwell & Boltzmann in their statistical mechanics derivation. Now source code might be usefully analysed using n-grams, but unless we let the n go to infinity p log{p} does not get a look-in.

Lets drive another nail into this source code entropy nonsense.

Entropy techno-babble and the second law of thermodynamics are frequent bedfellows. This law specifies that the entropy of a closed system never decreases, it either stays the same or increases. Now very many source code attributes have been found to be proportional to the log of the amount of code (as measured in lines); this is potentially a big problem for those wanting to calculate the entropy of source (it is a repeat of the Gibbs paradox).

If a function (which is claimed to have entropy S) is split in half to create two functions, the second law of thermodynamics requires that the sum of the entropies of the two new functions should be at least S. Entropy cannot scale logarithmically because splitting a function in two would reduce total entropy (i.e., 2*log{x/2} <= log{x}, when x <= 1). So source code entropy has to be an attributes that does not scale logarithmically, which goes against most of what is known about source.

If source code does not follow Maxwell–Boltzmann statistics (the equations obtained by working through the ideas behind statistical mechanics), what kind of statistics might it follow? Strange as it might sound, quantum mechanics offers some pointers. Fermi–Dirac and Bose–Einstein statistics are the result of working through the mathematics of small numbers of particles having particular attributes; most functions are small and far removed from the large number of items assumed in the derivation of Maxwell–Boltzmann statistics.

I appreciate that by pointing out the parallel with quantum mechanics I am running the risk of entanglement replacing entropy as the go to topic for researchers scraping the barrel for something to talk about. But at least the mathematics of small numbers of items obeying certain rules is a model that is closer to source code.

Coq used to prove that false is true

March 26th, 2015 No comments

Coq, a proof system much beloved by formal methods researchers, has been used to prove that false is true. The ‘proof’ makes use of a bug in Coq, which I’m sure is being hurriedly fixed as I type.

This bug stands out from the other 4,000+ on Coq’s bug list because of what it has been used to prove; this will be a standing joke that the Coq community will have to endure for years to come. I have some sympathy for their plight, but if it results in formal methods researchers taking themselves a little less seriously and being a little more intellectually honest, then some good has come out of it.

I have had some surprising feedback about posts pointing out serious faults in claims made in papers by formal methods researchers. The feedback could be paraphrased as “They are doing important work and so the these problems do not matter”. Do medical researchers get to claim whatever they like, it seems like some do like to try.

It is always worth remembering that all computer aided mathematics programs contain bugs.

Tags: ,

C code is 90% unspecified behavior: more uninformed scare mongering

March 19th, 2015 1 comment

Another C coding guidelines document, another clueless blanket ban on use of code containing unspecified behavior (no link so its visibility is not increased; the 90% is a back of the envelope calculation, knock yourself out here).

The C Standard defines unspecified behavior as “… provides two or more possibilities and imposes no further requirements on which is chosen in any instance.” Given this one item of information a ban on using constructs that contain unspecified behavior appears to be a good idea (writing code where the compiler gets to choose among several possible choices of behavior does not sound like recipe for consistent program behavior).

What most people lack when thinking about unspecified behavior is an understanding of the design aims for the production of the C Standard; the aim was to be concise. An example of this conciseness is the wording for the order of evaluation of subexpressions “… the order in which side effects take place are both unspecified.”

Consider the subexpression x+y; should the compiler evaluate x first (putting its value in a register) and then y (putting its value in another register), or should it evaluate y followed by x? It most situations the final result does not depend on the choice of evaluation order and the Standard gives the compiler the freedom to choose the order that produces the best quality code.

A coding guideline that bans the use of code containing unspecified behavior bans the use of any binary operator (assignment is a binary operator in C, ruling out use of the statement z=0;). The only executable statements that could be written, following this guideline, would be calls of functions containing zero or one argument (order of evaluation is unspecified, which rules out calls containing two arguments) or global variables appearing on their own in an expression statement.

One case where operand evaluation order matters is printf("Hello")+printf("World"), which can result in either HelloWorld or WorldHello being printed (printf returns the number of characters written). This is an example of the kind of usage that the authors of coding guideline want to ban.

Coming up with guideline wording that delineates the undesirable unspecified behaviors from the harmless ones is hard. Requiring that the external behavior of code does not depend on the compiler’s choice of unspecified behavior is one possibility (now that power consumption can be an external behavior of note, this framing could be too narrow). The wording used by MISRA C is “No reliance shall be placed on … unspecified behavior”; this raises the flag that it is possible to rely on unspecified behavior and leaves it up to others to fill in the details.

Ethereum: Is it cost effective to create reliable contracts?

March 15th, 2015 No comments

The idea of embedding of a virtual machine supporting user executable code inside a cryptocurrency continues to fascinate me. I attended the Ethereum workshop on using Solidity to develop dapps (distributed apps) yesterday.

Solidity seems to have been settled on as the high-level language in which programs for the EVM (Ethereum Virtual Machine) will be written, at least at launch. While Ethereum appear to know what they are doing in the design of the cryptocurrency (at least to my non-expert eyes), they are rank amateurs at language design. A good place to start learning is the rationales for Ada, Eiffel and Frink for starters.

Ethereum uses the term Contracts to describe programs executed by the EVM.

Anybody publishing a Contract that involves transferring something that has a real-world impact on their financial state (e.g., funds in cryptocurrency that are exchangeable for government backed currency) will want a high degree of confidence in the reliability of the code.

It is easy to imagine that people will be actively reverse engineering all published Contracts looking for flaws that can be exploited to siphon off funds into their personal accounts.

Writing high reliability code is time consuming and expensive. One way of reducing time/cost is to use a language that implicitly performs lots of checking, rather than requiring the developer to insert lots of explicit checks.

The following is an example of a variable definition providing information to the compiler that can be used to insert checks in the generated code (e.g., that no value outside the range 1 to 1000 is assigned to amount_to_pay); the developer does to not need to worry about explicitly adding checks in all the necessary places. This language functionality was invented in 1970 (in Pascal) but went out of fashion, for new languages, in the 1990s.

var amount_to_pay : 1..1000;

Languages such as Ada and Frink improve program reliability by providing functionality that reduces the effort developers need to invest in checking for unintended behavior.

Whatever language the code is written in, what happens if an attempt is made to assign 1001 to amount_to_pay? It does not matter whether the check is implicit or explicit, an error has occurred and has to be handled. As a developer of this code I want to know about the error (so I can fix it) and as a user of the code I do not want to lose money for executing a failed transaction.

At the moment the only Ethereum error handling solution I can think of is writing error information to the blockchain (in place of the information that would have been written on successful program execution); the user will still get charged for executing the program but at least will now have the evidence needed to obtain a refund (the user has to be charged to prevent denial of service attacks using Contracts containing a known fault).

I wonder who will find it worthwhile investing in creating the high reliability code needed for Contracts to be a viable solution to a problem?

Been invaded by mysterious sounds? You need SoundHound

March 12th, 2015 No comments

Team SoundHound (Gary, Pavel and yours truly) were at the Intel Hardware Hackathon last weekend. Hardware hacks are about the peripherals that are available on the day; I grabbed one of everything and we all sat down to play with them and come up with a great gadget to build. At an Intel sponsored event the computers are obviously Galileo and Edison (which is what we ended up using).

The Intel guys had a very fixed idea about how the hackathon was to run and specified that eight projects had to be proposed and selected (by everybody present) on the Friday night and that these would be worked on over the weekend (but we were allowed to change the idea; the logic escaped me). I proposed a vibration detection project to measure the impact of passing vehicles on houses next to busy roads. Nobody on what was to become SoundHound thought this was a great idea, but it was the only idea we had and there was nothing stopping us radically changing our mind over the weekend.

Vehicles generate sound which we can hear, but buildings are damaged by the lower frequency vibrations that we sometimes feel but don’t usually hear. A microphone would be useless, what was needed was a vibration sensor and a packet of piezo vibration sensors, the SEN-09198, was provided. We did not have any amplifiers to boost the signal, so the sensors were wired directly onto the signal pins; this meant that the dynamic range of the signal was about 50 times smaller than the input could handle and our test vibrations were created by banging on the table on which the sensor sat (power spectrum of me doing just that below).

Power spectrum of fist banging on table

We worked away on the vehicle vibration idea waiting for something better to come along. Some household appliances contain motors that cause them to vibrate, perhaps we could detect these vibrations. The use case here was a fridge next to a wall transmitting vibration into another room which then generated a hard to pin-down noise in that room. Around this time it occurred to us that a ghost detector (everybody knows that walls vibrate when ghosts travel through them) might be a bigger market, after all occult books sell well.

IBM were an event sponsor and had people on-hand to help us set up and use Bluemix, their cloud offering. The Edison board was communicating with our laptops via wifi and this got rerouted so the data was sent to IBM’s cloud storage. Collecting everybodies vibration data meant we could do pattern matching on the power spectrum of the vibration to identify candidates for the device most likely to be generating it (whether the vibration characteristics of different devices is sufficiently different for the pattern matching to be able to distinguish them at some useful level is another matter; I imagine that the shape of a fridge causes its vibration to modulate slightly from the frequency of the local electricity supply).

One of the problems of producing a working demo at a hackathon is being held to a much higher standard of reality. Teams whose output is slideware illustrating the tenuous connection its members have to reality don’t seem to have any trouble getting unquestioning agreement that anything they propose will just work. We were asked how we would obtain the vibration data for different household devices, against which the customer data could be compared; a marketing type capable of convincingly talking about our crowd source solution would have been useful.

Late Saturday afternoon saw team SoundHound out on the streets measuring vehicle vibration. Unfortunately large lorries traveling at speed are thin on the ground in central London, also guys standing next to a laptop with wires attached to something on the ground filming everything that went past had the effect of causing drivers to slow down to low vibration speeds (rather than speeding up to get away from us).

Perhaps we should have gone with the ghost detector sale pitch. The cloud data storage could have been rebranded as The Ethereum data matrix and flashing leds added to the hand-held sensor (the penguins below send the wrong message).

If it is possible to distinguish home appliances based on their vibration characteristics, and a database is created, there is an opportunity for salesmen to systematically check each house/flat ‘listening’ for appliances that are close to failing (motors vibrate more as they wear out): “Hello Sir, your fridge/boiler/air-conditioning is getting old and this week we have a special offer on replacements…”.

The picture below shows our 3-axis vibration sensor (plastic mounting/vibration interface laser cut by Elen from the FabLab, who do brown bag lunch and learn sessions if you are ever in central London). Two axis would probably been enough, and maybe even one, but we had the sensors and wanted to impress.

3-axis vibration sensor

Extracting the original data from a heatmap image

March 4th, 2015 No comments

The paper Analysis of the Linux Kernel Evolution Using Code Clone Coverage analysed 136 versions of Linux (from 1.0 to 2.6.18.3) and calculated the amount of source code that was shared, going forward, between each pair of these versions. When I saw the heatmap at the end of the paper (see below) I knew it had to appear in my book. The paper was published in 2007 and I knew from experience that the probability of seven year old data still being available was small, but looked so interesting I had to try. I emailed the authors (Simone Livieri, Yoshiki Higo, Makoto Matsushita and Katsuro Inoue) and received a reply from Makoto Matsushita saying that he had searched for the data and had been able to find the original images created for the paper, which he kindly sent me.

Shared code between Linux releases

I was confident that I could reverse engineer the original values from the png image and that is what I have just done (I have previously reverse engineered the points in a pdf plot by interpreting the pdf commands to figure out relative page locations).

The idea I had was to find the x/y coordinates of the edge of the staircase running from top left to bottom right. Those black lines appear to complicate things, but the RGB representation of black follows the same pattern as white, i.e., all three components are equal (0 for black and 1 for white). All I had to do was locate the first pixel containing an RGB value whose three components had at least one different value, which proved to be remarkably easy to do using R’s vector operations.

After reducing duplicate sequences to a single item I now had the x/y coordinates of the colored rectangle for each version pair; extracting an RGB value for each pair of Linux releases was one R statement. Now I needed to map RGB values to the zero to one scale denoting the amount of shared Linux source. The color scale under the heatmap contains the information needed and with some trial and error I isolated a vector of RGB pixels from this scale. Passing the offset of each RGB value on this scale to mapvalues (in the plyr package) converted the extracted RGB values.

The extracted array has 130 rows/columns, which means information on 5 versions has been lost (no history is given for the last version). At the moment I am not too bothered, most of the data has been extracted.

Here is the result of calling the R function readPNG (from the png package) to read the original file, mapping the created array of RGB values to amount of Linux source in each version pair and calling the function image to display this array (I have gone for maximum color impact; the code has no for loops):

Heatmap of extracted data

The original varied the width of the staircase, perhaps by some measure of the amount of source code. I have not done that.

Its suspicious that the letter A is not visible in some form. Its embedded in the original data and I would have expected a couple of hits on that black outline.

The above overview has not bored the reader with my stupidities that occurred along the way.

If you improve the code to handle other heatmap data extraction problems, please share the code.

Hardware variability may be greater than algorithmic improvement

February 24th, 2015 No comments

I’m giving a talk at COW 39 this week and it is more user friendly to include links in this summary than link to a pdf of the slides (which actually looks horrible).

Microelectronic fabrication has now reached the point where it etches and deposits handfuls of atoms (around 20 or so). One consequence of working with such a small number of atoms is that variations in the fabrication process (e.g., plus or minus a few atoms here and there) can have a significant impact on component characteristics, e.g., a transistor consumes more/less power or can be switched faster/slower. It might be argued that things will average out over the few hundred+ million components inside a device and that all devices will be essentially the same; measurements show that in practice there is a lot of variation across the devices.

Short version: Some properties of supposedly identical microelectronic devices now vary by around 10% and this variability is likely to get larger in the future.

Nearly all published papers involving computer system power measurement are based on measuring a single system. Many of the claimed algorithmic improvements are less than 10%, i.e., less than the expected variation in power consumption across supposedly identical devices/systems. These days any empirical paper involving power consumption has to include measurements from many devices/systems if any credibility is to be given to the findings.

The following measurement data has been found while researching for a book (code and data used for talk); a beta version of the pdf will be available for download real soon now.

The following plot shows feature size, in Silicon atoms, of released microprocessors. Data from Danowitz et al.

Line width of latest devices over time

A study by Wanner, Apte, Balani, Gupta and Srivastava measured the power consumed by 10 separate Amtel SAM3U microcontrollers at various temperatures (embedded processors get used in a wide range of environments). The following plot shows how power consumption varied between processors at different temperatures when idling and executing. The relationship between the power characteristics of different processors changes with temperature; a large amount when idling and a little bit when executing.

Power consumption at various temperatures

A study by Balaji, McCullough, Gupta and Agarwal measured the power consumption of six different Intel Core i5-540M processors, running at various frequencies, executing the SPEC2000 benchmark. The lines in the following plot were fitted to the data (grey crosses) using linear regression. The relationship between the power characteristics of different processors changes with clock frequency.

Power consumption at various frequencies

A study by Bircher measured power consumption of various system components, of an Intel server containing four Pentium 4 Xeons, when executing the SPEC CPU2006 benchmark. The power distribution for mobile devices would show the screen being the largest user of power, but I don’t have that data).

Power consumption at various frequencies

A study by Bircher measured memory (blue) and cpu (red) power consumption, in watts, executing the SPEC CPU2006 benchmarks. No point optimizig cpu power if over half the total power is consumed by memory.

Memory and DIMM power consumption for various programs

A study by Krevat, Tucek and Ganger measured the performance of then modern disk drives originally sold in 2002 (left) and 2006 (right). Different colors showing through on the right indicates that some discs have different performance characteristics than the others (faster performance -> less time -> less power, i.e., I don’t have power data for disk differences)

I/O performance of disc drives sold in 2002 and 2006

A study by Kalibera et al executed a benchmark 2,048 times followed by system reboot, repeated 10 times. Performance is consistent within a particular reboot, but not across them.

FFT benchmark performance after reboots

The following plot is in the slide deck and is included here as a teaser (i.e., more later). It shows data for 2,386 processors (x-axis is slowdown, y-axis clock frequency, top right legends max power), which thanks to Barry is now publicly available (or at least will be once some web details are sorted out).

Line width of latest devices over time

Long tail licensing

February 16th, 2015 No comments

Team ‘Long Tail Licensing’ (Richard, Pavel, Gary and yours truly) took part in the Fintech startupbootcamp hackathon at the weekend.

As the team name suggests the plan was to implement a system of payment and licensing for products in the long tail, i.e., a large number of low value products. Paypal is good for long tail payment but does not provide a way for third parties to verify that a transaction has occurred (in fact Paypal does its best to keep transactions secret from everybody except those directly involved).

Our example use case was licensing of individual Github repositories. Most of today’s 3.4 million developers with accounts on Github would rather add more features to their code than try to sell it; the 16.7 million repositories definitely qualifies as a long tail of low value products (i.e., under £100). Yes, Paypal could be (and is) used as a method of obtaining payment, but there is no friction-free method for handling licensing (e.g., providing proof of licensing to third parties).

Long Tail Licensing’s implementation used cryptocurrency for both payment and proof of licensing (by storing license information in the blockchain). For the hackathon we set up out own private Bitcoin blockchain to act as a test rig, supply fast mining and provide near instantaneous response.

To use Long Term Licensing a developer creates the file .cryptolicense in the top level directory of their repo; this file contain information on the amount to pay, cryptocurrency account details and text of licensing terms. A link in the README.md file points at our server, which validates the .cryptocurrency file and sets up a payment transaction from the licensee’s Bitcoin wallet; the licensee confirms the transaction and the payment is made.

The developer’s chosen license information is included in the transactions blockchain, providing the paperwork that third-parties can view to verify what has been licensed. This licensing information could be in plain text or use public key encryption to restrict who can read it (e.g., eBay could publish a public key that third parties could encrypt information so that only eBay’s compliance department could read it).

The implementation code includes links to private servers and other stuff that it should not be be; hackathon code is rarely written with security in mind. So those involved would rather it not be pushed to Github (perhaps it will get tidied up and made suitable for public consumption at a later date).

We did not win any of the prizes :-( . Well done to Manoj (a frequent hackathon collaborator) and his team for winning the $100k of Google cloud time prize.