The Shape of Code

Home > Uncategorized > Transition probabilities when adjacent sequence items must be different

Transition probabilities when adjacent sequence items must be different

September 25, 2012 Derek Jones Leave a comment Go to comments

Generating a random sequence from a fixed set of items is a common requirement, e.g., given the items A, B and C we might generate the sequence BACABCCBABC. Often the randomness is tempered by requirements such as each item having each item appear a given number of times in a sequence of a given length, e.g., in a random sequence of 100 items A appears 20 times, B 40 times and C 40 times. If there are rules about what pairs of items may appear in the sequence (e.g., no identical items adjacent to each other), then sequence generation starts to get a bit complicated.

Let’s say we want our sequence to contain: A 6 times, B 12 times and C 12 times, and no same item pairs to appear (i.e., no AA, BB or CC). The obvious solution is to use a transition matrix containing the probability of generating the next item to be added to the end of the sequence based on knowing the item currently at the end of the sequence.

My thinking goes as follows:

given A was last generated there is an equal probability of it being followed by B or C,
given B was last generated there is a 6/(6+12) probability of it being followed by A and a 12/(6+12) probability of it being followed by C,
given C was last generated there is a 6/(6+12) probability of it being followed by A and a 12/(6+12) probability of it being followed by B.

giving the following transition matrix (this row by row approach having the obvious generalization to more items):

                 Second item
               A      B       C
          A    0     .5      .5
First
item      B   .33     0      .67
 
          C   .33    .67      0

Having read Generating constrained randomized sequences: Item frequency matters by Robert M. French and Pierre Perruchet (from whom I take these examples and algorithm on which the R code is based), I now know this algorithm for generating transition matrices is wrong. Before reading any further you might like to try and figure out why.

The key insight is that the number of XY pairs (reading the sequence left to right) must equal the number of YX pairs (reading right to left) where X and Y are different items from the fixed set (and sequence edge effects are ignored).

If we take the above matrix and multiply it by the number of each item we get the following (if A occurs 6 times it will be followed by B 3 times and C 3 times, if B occurs 12 times it will be followed by A 4 times and C 8 times, etc):

                 Second item
               A      B       C
          A    0      3       3
First
item      B    4      0       8
 
          C    4      8       0

which implies the sequence will contain AB 3 times when counted forward and BA 4 times when counted backwards (and similarly for AC/CA). This cannot happen, the matrix is not internally consistent.

The correct numbers are:

                 Second item
               A      B       C
          A    0      3       3
First
item      B    3      0       9
 
          C    3      9       0

giving the probability transition matrix:

                 Second item
               A      B       C
          A    0     .5      .5
First
item      B   .25     0      .75
 
          C   .25    .75      0

This kind of sequence generation occurs in testing and I wonder how many people have made the same mistake as me and scratched their heads over small deviations from the expected results.

The R code to calculate the transition matrix is straight forward but obscure unless you have the article to hand:

# Calculate expression (3) from:
# Generating constrained randomized sequences: Item frequency matters
# Robert M. French and Pierre Perruchet
 
transition_count=function(item_count)
{
N_total=sum(item_count)
 
# expected number of transitions
ni_nj=(item_count %*% t(item_count))/(N_total-1)
diag(ni_nj) = 0
 
# expected number of repeats
d_k=item_count*(item_count-1)/(N_total-1)
 
# Now juggle stuff around to put the repeats someplace else
n=sum(ni_nj)
n_k=rowSums(ni_nj)
s_k=n - 2*n_k
 
R_i=d_k / s_k
R=sum(R_i)
 
new_ij=ni_nj*(1-R) + (n_k %*% t(R_i)) +  (R_i %*% t(n_k))
diag(new_ij)=0
 
return(new_ij)
}
 
transition_prob=function(item_count)
{
tc=transition_count(item_count)
 
tp=tc / rowSums(tc)  # relies on recycling
 
return(tp)
}

the following calls:

transition_count(c(6, 12, 12))
transition_prob(c(6, 12, 12))

return the expected results.

French and Perruchet provide an Excel spreadsheet (note this contains a bug, the formula in cell F20 should start with F5 rather than F6).

Categories: Uncategorized Tags: R, random, sequence, testing

Comments (2) Trackbacks (0) Leave a comment Trackback

jean yves torres

February 4, 2013 12:52 | #1

Reply | Quote

Hi

Do you have the translation in langage C ?

Thank in advance
rgds
JYT
Derek Jones

February 4, 2013 13:13 | #2

Reply | Quote

@jean yves torres
I used R because it kept the code short and simple. I don’t have a version in any other language.

No trackbacks yet.

Sequence generation with no duplicate pairs Impact of compiler optimization level on recovery from a hardware error

Transition probabilities when adjacent sequence items must be different

Recent Posts

Recent Comments

Archives

Meta