## Predicting the next value in an integer sequence

There was a Kaggle meetup group hackathon on Saturday. Integer sequence learning is a recently posted Kaggle challenge; build a model to predict the next value in an integer sequence, with example data coming from the On-Line Encyclopedia of Integer Sequences. How could I not want to try my hand at this challenge and I signed up for the hackathon hoping to find like-minded folk.

My only previous encounter with mining the OEIS was a paper that attempted to combine two or more existing sequences to match one existing sequence and to contribute a few sequences I had found.

The event was well attended and I found a fellow enthusiast in Lampros.

We kicked around a few ideas and while I jumped in and started investigating the characteristics of the data, Lampros started searching for solutions that others had found to the next value in an integer sequence problem (a much more sensible approach, but probably not as much fun as jumping in feet first; this was his first hackathon, so he has not picked up any bad habits yet ;-).

This is one of those few problems when over-fitting is required for things to work.

I concentrated on characterizing the 113,845 sequences in the training set and the following is a summary (code and data):

- 74,465 sequences contained a maximum value less than and 98,916 less than . So symbolic maths is going to be needed (a shame since I had found tscount, an R package for analyzing count time series),
- a goodly 72,202 sequences are in sorted order, leaving 41,643 to go up and down,
- The least significant digit of the values in many sequences are a subset of the ten possible values. The following is a list of the number of unique least significant digits in the training sequences:
1 2 3 4 5 6 7 8 9 10 1289 4692 6800 11589 15773 13701 7635 8644 10837 32885

After removing sequences containing fewer than 20 values (the -1 count)

-1 1 2 3 4 5 6 7 8 9 10 33398 764 3006 3171 6068 8930 7898 3690 5604 9073 32243

Lampros found Martin Rubey’s work on guessing formulas for sequences, which had an Open Source implementation using FriCAS (source code, which needs Steel Bank Common Lisp to build, which itself needs your OS to support 512 open files {OS X default is 256}).

Other software and papers include: the gfun package plus paper, sequence prediction in Mathematica and a Masters thesis on Inductive Inference of Integer Sequences.

These systems are all based on fitting a, potentially very complicated, polynomial to each sequence and achieve a success rate of around 20% on the complete OEIS.

What are the characteristics of the 80% for which a polynomial fit does not predict the next value? Perhaps there are not enough values specified in these sequence to fit the necessary polynomial. Perhaps the values grow at a faster rate than can be fitted by any polynomial expression, e.g., the Busy beaver function.

A 11:00-17:00 hackathon is not really enough time to get anything sensible up and running. People doing Masters projects had more hours and were managing to get around 20% correct.

The competition leaderboard currently has somebody with a score of 0.74. This is a big lead over everybody else. Am I being cynical by thinking the model might be reading the OEIS text to find the formula describing the sequence and then evaluating this?

Update

Comparing Computer Models Solving Number Series Problems provides an interesting review of systems using an AI based approach, i.e., trying to mimic what people do.

## Recent Comments