Home > Uncategorized > Success does not require understanding

Success does not require understanding

I took part in the second Data Science London Hackathon last weekend (also my second hackathon) and it was a very different experience compared to the first hackathon. Once again Carlos and his team really looked after us.

  • The data was released 24 hours before the competition started and even though I had spent less than half an hour looking at it, at the start of the competition I did not feel under any time pressure; those 24 hours allowed me to get used to the data and come up with some useful looking ideas.
  • The instructions for the first competition had suggested that people form teams of 3-5 and there was a lot of networking happening before the start time. There was no such suggestion this time and as I networked around looking for people to work with I was surprised by the number of people who wanted to work alone; Jonny and Kannappan were the only members from my previous team (the Outliers) who had entered this event, with Kannappan wanting to work alone and Jonny joining me to create a two person team.
  • There was less community spirit this time, possible reasons include a lot more single person teams sitting in the corner doing their own thing, fewer people attending (it is the middle of the holiday season), fewer people staying over until the Sunday (perhaps single person teams got disheartened and left or the extra 24 hours of data access meant that teams ran out of ideas/commitment after 36 hours) or me being reduced to a single person team (Jonny had to leave at 20:00) meant I paid more attention to what was happening on the floor.

The problem was to predict what ratings different people would give to various music artists. We were given data involving 50 artists and 48,645 users (artists and users were anonymous) in five files (one contained the training dataset and another the test dataset).

A quick analysis of the data showed that while there were several thousand rows of data per artist there were only half a dozen rows per person, a very sparse dataset.

The most frequent technique I heard mentioned during my initial conversations with attendees was machine learning. In my line of work I am constantly trying to understand what is going on (the purpose of this understanding is to control and make things better) and consider anybody who uses machine learning as being clueless, dim witted or just plain lazy; the problem with machine learning is that it gives answers without explanations (ok decision trees do provide some insights). This insistence on understanding turned out to be my major mistake, the competition metric is based on correctness of answers and not on how well a competitor understands the problem domain. I had a brief conversation with a senior executive from EMI (who supplied the dataset and provided some of the sponsorship money) who showed up on Sunday morning and he had no problem with machine learning providing answers and not explanations.

Having been overly ambitious last time team Outliers went for extreme simplicity and started out with the linear model glm(Rating ~ AGE + GENDER...) being built for each artist (i.e., 50 models). For a small amount of work we got a score of just over 21 and a place of around 70th on the leader board, now we just needed to include information along the lines of “people who like Artist X also like Artist Y”. Unfortunately the only other member of my team (who did not share my view of machine learning and knew something about it) had a prior appointment and had to leave me consuming lots of cpu time on a wild goose chase that required me to have understanding.

The advantages of being in a team include getting feedback from other members (e.g., why are you wasting your time doing that, look how much better this approach is) and having access to different skill sets (e.g., knowing what magic pixie dust values to use for the optional parameters to machine learning routines). It was Sunday morning before I abandoned the ‘understanding’ approach and started thrashing around using various machine learning techniques, which told me that people demographics (e.g., age and gender) were not particularly good predictors compared to other data but did did not reduce my score to the 13-14 range that could be seen on the leader board’s top 20.

Realizing that seven hours was not enough time to learn how to drive R’s machine learning packages well enough to get me into the top ten, I switched tack and spent a lot more time wandering around chatting to people; those whose score was worse than mine were generally willing to chat. Some had gotten completely bogged down in data cleaning and figuring out how to handle missing data (a subject rarely covered in books but of huge importance in real life), I was surprised to find one team doing most of their coding in SQL (my suggestion to only consider Age+Gender improved their score from 35 to 22), I mocked the people using Clojure (people using a Lisp derived language think they have discovered the one true way and will suffer from self doubt if they are not mocked regularly). Afterwards it struck me that well over 50% of attendees were not British (based on their accents), was this yet another indicator of how far British Universities had dumbed down mathematics teaching that natives did not feel up to the challenge (well done to the Bristol undergraduate who turned up) or were the most gung-ho technical folk in London those who had traveled here to work from abroad?

The London winner was Dell Zhang, the only other person sitting at the table I was on (he sat opposite me throughout the competition), who worked quietly away for the whole 24 hours and seemed permanently unimpressed by the score he was achieving; he described his technique as “brute force random forest using Python (the source will be made available on the Data Science website).

Reading through posts made by competitors after the event was as interesting as last time. Factorization Machines seems to be the hot new technique for making predictions based on very sparse data and the libFM is the software I needed to know about last weekend (no R package providing an interface to this C++ code available yet).

  1. yuenking
    July 24, 2012 08:40 | #1

    Hi,
    It was great chatting with you on that day (and night) and I hope that you do remember me! (the guy who keeps stealing your Brazilian nuts)
    I disagree with this point in your otherwise excellent post – “anybody who uses machine learning as being clueless, dim witted or just plain lazy”.
    Firstly, machine learning is a broad category of techniques, linear regression can be considered part of it too! Knowing how to formulate the problem and choosing a good algorithm is part of the challenge.
    Secondly, some understanding of the problem is needed for good feature selection. I am sure there are many people who used random forest like Dell, but they are not getting the same result.
    Lastly (similar to the previous point), understanding the data is very vital to a good (or a not too bad score); it is not as black box as you think. You can kind of look at my submission as a benchmark for people who understand the problem and those that don’t; my submission was just the Artist-User mean Rating + a little tuning + rough imputation for missing data, basically a modification of the sample code provided; and it has a RMSE of 14.5XX. If a model is built with this as the feature (instead of the answer like I have lazily done); their score should reach sub-14.
    Cheers,
    King

  2. July 24, 2012 10:54 | #2

    @yuenking
    Yes, I think you were the guy taking the nuts one at a time rather than a handful.

    There is certainly a lot of skill involved in selecting the appropriate machine learning technique to apply and the values to use for its options (as shown by people who used the same basic techniques and got very different scores).

    Lets take the example of a fluid moving across a surface (e.g., an aircraft wing or the hull of a ship). If we have lots of measurements of the velocity and direction of the fluid we might decide to use machine learning to make predictions. However this problem has been ‘solved’ in the sense that there is a set of equations, Navier-Stokes equations, that describe fluid flow and can be used to obtain a great deal of understanding about what is going on. The Navier-Stokes equations tell us what attributes need to be measured to calculate a value of interest, machine learning will only tell us things about those measurements we happen to have fed into the model.

    As things stand today psychologists do not know enough about people to be able to formulate a set of equations, derived from human characteristics, that model music preferences; that level of understanding is not available to us. Machine learning is a useful technique for extracting information from data that is available and looks like it gives companies something to work with while they wait for psychologists to discover the applicable set of equations.

  3. July 24, 2012 13:18 | #3

    Interesting read. Thanks!

    However, I feel the title very subtly obfuscates a larger point that should be made. Success does most definitely require understanding, but not necessarily of how the exact solution came about.

    To be successful in any machine learning effort, one needs to have intricate understanding of what the problem is and how techniques can be applied to find a solution. I feel your interpretation of understanding – “the purpose of this understanding is to control and make things better” – falls short in this regard. Understanding exists on many more levels.

    As an example. To me, the engine of my car is a black box; I have very little idea how it works. My mechanic does know how engines work in general, but he is unable to know the exact internal state of the engine in my car as I am cruising down the highway at 100 miles per hour. None of this “lack of understanding” prevents me from getting from A to B. I turn the wheel, I push the peddel and off we go.

    Fluid moving across a surface is actually an excellent example of the same idea. Navier-Stokes equations might give you a nice little model that helps you make predictions and makes you feel like you understand, but the reality is that fluids are made up of gazillions of molecules that you cannot hope to all capture or understand. The model might work in many cases, but so does Newton’s law of universal gravitation.

    I’m not saying that models are not useful, they can be very useful, but they are merely tools to help you navigate a vastly complex world. Very much like machine learning models. 🙂

  1. July 24th, 2012 at 19:59 | #1
  2. August 15th, 2012 at 09:01 | #2