Home > Uncategorized > Incompetence borne of excessive cleverness

Incompetence borne of excessive cleverness

I have just got back from the 24 hour Data Science Global Hackathon; I was an on-site participant at Hub Westminster in London (thanks to Carlos and his team for doing such a great job looking after us all {around 50 turned up from the 100 who registered; the percentage was similar in other cities around the world}). Participants had to be registered by 11:00 UTC, self form into 3-5 person teams ready for the start at 12:00 UTC and finish 24 hours later. The world-wide event had been organized by our London hosts who told us they expected the winning team to come from those in the room; Team Outliers (Wang, Jonny, Kannappan, Bob, Simon, yours truely and Fran for the afternoon) started in an optimistic mood.

At 12:00 an air-quality training dataset + test points was made available and teams given the opportunity to submit eight predictions in each of the two 12 hour time periods. The on-line submissions were evaluated by Kaggle (one of the sponsors, along with EMC) to produce a mean estimated error that was used to rank teams.

The day before the event I had seen a press release saying that the task would involve air-quality and a quick trawl of the Internet threw up just the R package I needed, OpenAir; I also read a couple of Wikipedia articles on air pollution.

Team Outliers individually spent the first hour becoming familiar with the data and then had a get together to discuss ideas. Since I had a marker pen, was sitting next to a white-board and was the only person with some gray hair I attempted to manage the herding of the data science cats and later went on to plot the pollution monitor sites on Google maps as well as producing some visually impressive wind Roses (these did not contribute anything towards producing a better solution but if we had had a client they could have been used to give the impression we were doing something useful).

People had various ideas about the techniques to use for building the best model and how the measurements present in the training set might be used to predict air quality (the training data had names such as target_N_S, where N and S were small integer values denoting the kind of pollution and the site where the measurement was made). The training set included measurements of wind speed/direction data and hours of sunlight, and a couple of members wanted to investigate if these would make good predictors. Team Outliers had people looking at all the fancy stuff you find in textbooks, e.g., ARIMA for time series, svm for machine learning, and I was looking at getting the data into the form needed by OpenAir.

There were all sorts of problems with the data, just like real life, e.g., missing values (lots of them), some kinds of quality (i.e., pollutants) were only measured at one or two sites and fuzzy values such as the ‘most common month’ (what ever that might be). Some people were looking at how best to overcome this data quality problem.

20:30 arrived and some great food was laid out for dinner, no actual predictions yet but they would be arriving real soon now. A couple of hours later my data formatting project crashed and burned (being a 32-bit system R got upset about my request to create a vector needing 5.2 Gig; no chance of using the R-based OpenAir package which needed data in a format different from the one we were given it in). Fifteen minutes to midnight I decided that we either used the eight submissions permitted in the first 12 hours or lost them, and wrote a dozen lines of R that built a linear model using one predictor variable (which I knew from some earlier plots was far from linear, but the coding was trivial and the lm function would take no time to build separate models for each of the 30 odd response variables). I submitted the predictions and we appeared on the score board at number 65 out of 112. Being better than 47 other teams was a bit of a surprise.

Panic over we realised that the 12 hours ended at 12:00 UTC which was 01:00 BST (British Summer time) and we had another hour. Wang made a couple of submissions that improved our score and at around 02:00 I went to grab a few hours sleep.

I was back online at 06:00 to find that team Outliers had slipped to 95th place as the Melbourne and San Francisco teams had improved during their daytime. More good food for breakfast at 08:30.

Jonny drew attention to the fact that the mean absolute error in our team’s current score was almost twice as great as that of the sample solution provided with the data. We had long ago dismissed this solution as being too simplistic (it was effectively a database solution in that it calculated the mean value of the various pollutants in the training set at various chunkID and hour points which were used as keys for ‘looking up’ prediction values required by the test dataset). Maybe team members ought to focus their attention on tweaking this very simple approach rather than our ‘cleverer’ approaches.

I suggested modifying the sample solution to use the median rather than the mean (less susceptible to outliers), this boosted our ranking back into the 60’s. Jonny and Simon tried using a rolling mean, no improvement; Wang tried other variations, no improvement.

Team Outliers finished the Hackathon in equal 61st, along with 22 other teams, out of 114 submissions.

What did we do wrong? Mistakes include:

  • trying to do too much for people with our various skill levels in the time available. For instance, I don’t regularly reformat large data tables or try to calculate the error in machine learning models and while I can easily knock out the code I still have to sit down and think about what needs to be written, something that somebody who does this sort of thing regularly would just know how to do; other team members seemed very familiar with the theory but were not used to churning out code quickly.
  • spending too much time studying all of the various kinds of measurements available in the training set, many of which were not available in the test dataset. We should have started off ignoring measurements in the training set that were not available in the test set, perhaps looking to using these later if time permitted.

Members of team Outliers enjoyed themselves but were a little crestfallen that our clever stuff was not as good as such a crude, but insightful, approach. Most of us used R, a few made use of awk, Python, spreadsheets and Unix shell.

Our hosts are looking to run more data science hackathons this year, in particular one related to the music industry in a few months time. If you are interested in taking part keep an eye on their website.

Update (later the next day)

At least one team achieved some good results using ARIMA. Fran had started building an ARIMA model, had to leave and nobody else picked it up; I should have been paying more attention to ensure that ideas did not disappear when people left.

  1. efrique
    April 30th, 2012 at 21:42 | #1

    Both interesting and valuable. Thank you for being simultaneously candid and entertaining.

  2. Jason
    May 4th, 2012 at 08:45 | #2

    PlanetThanet team here (finished 5th globally from my flat in London). Would have loved to come to the event but Carlos informed me you had to be on site for the full 24 hours, which simply is not practical for some competitors. Big shame. Looking forward to the next one.

  3. May 4th, 2012 at 11:53 | #3

    Well done on your performance. Carlos wants to build a community and the best way to do that is with people meeting face to face. Team Outlier had members who grew up all over the world, with one member traveling from Sweden and another from Newcastle to attend. I hope Carlos penalizes those who booked a place and did not show, preventing others who would have turned up from attending.

    Being in your flat probably prevented you being infected by all the PhDs we had wandering around, which would have caused you to do something overly clever.

  4. Rob
    May 4th, 2012 at 14:21 | #4

    This sounds very like the experience we had in Chicago, and we came to many of the same conclusions.

  5. May 8th, 2012 at 11:19 | #5

    @derek thanks for your kind words. Your post reflects the experience that many contestants had; from the strategic approach to the competition to the tactics and nitty&gritty of the tools and datasets. It’s great that you have shared your experience and the lessons learned (mistakes? not really). Your post is spot on. That is precisely what we want: to share ideas; to meet new people; to build a real community; to raise awareness about our community; to discuss challenges, solutions and issues about data science… It was great talking to you in the early hours while we shared some nice food! Thanks!

    @jason congratulations on your 5th global position. Obviously we are not here to police whether you stay full 24 hours or you leave after 5 hour to sleep at your home. Most people came and stayed the 24 hours. Others came and left by 3am. Others left and came back by 8am. The important thing was to come over and participate in the event. Everyone had fun, everyone met new interesting, smart people. I even met many people that I only know via email. The whole idea as Derek says was to build community, encourage face to face meetings, and share personal experiences. We had a lot of fun. Some people even got a job offer. Others went clubbing to Soho but came back saying “the hackathon was very exciting and too much fun.” Some people got great data science ideas for their startups. Some founders of startups met new partners. Other people came up with great new ideas and suggestions… Come to the next one on June 23rd, you won’t regret it!

  1. May 3rd, 2012 at 11:35 | #1

A question to answer *