The Shape of Code

Home > Uncategorized > Randomizing a list of items using sort() and rand()

Randomizing a list of items using sort() and rand()

April 20, 2012 Derek Jones Leave a comment Go to comments

I’m busy putting together the experiment I will be running at the ACCU conference next week. If you are attending the conference please reserve your Thursday lunchtime slot for taking part as a subject!

Experiment generation invariably involves randomizing the sequence of items seen by every subject. While few languages support a randomise function many support sorting and random number generation. These two functions can be combined to create a randomize function; simply append a random number to the start of each item, sort it and then strip off the random number. Voilà a randomized list (awk code below).

function rand_items(items)
{
for (v in items)
   {
   items[v]=rand() " " items[v]
   }
asort(items)
for (v in items)
   {
   sp_pos=index(items[v], " ")
   items[v]=substr(items[v], sp_pos+1)
   }
}

This randomization problem is not yet listed on Rosetta code and probably has longer solutions in other languages.

Update (the next day).

The glow I have had for the last 10 years over coming up with a neat solution to a problem has now disappeared. Following the link kindly provided by D. Herring in the comments eventually lead me to the Fisher-Yates shuffle, which has O(n) performance (the call to sort is probably O(n log n) . The following shuffles a deck of cards:

for (i = 0; i < 52; i++)
   {
   j = i + (rand() % (52 - i));
   tmp = card[i];
   card[i] = card[j];
   card[j] = tmp;
   }

The proof of the uniform shuffling behavior of Fisher-Yates (also known as Knuth shuffle) is straight forward but not nearly as appealing as using rand and sort.

Categories: Uncategorized Tags: ACCU, awk randomize, experiment, Fisher-Yates

Comments (7) Trackbacks (0) Leave a comment Trackback

Anonymous

April 21, 2012 06:11 | #1

Reply | Quote

Using a similar idea you can randomly select N elements from a stream of elements of unknown size if you keep the N biggest values (after you prepend the random numbers).

For each element
…prepend random number
…if the new value is bigger than any of the saved values, replace the smallest saved value with this value.
strip N saved values of prefix
return the N saved values

So you don’t need to know the size of the stream/list you are reading, you will always have N randomly sampled elements as long as you read N elements 🙂
D Herring

April 21, 2012 12:33 | #2

Reply | Quote

Here’s the same algorithm in CL, circa 2008.
http://www.pvk.ca/Blog/Lisp/trivial_uniform_shuffling.html

I look forward to reading about your latest ACCU experiment. The results are usually quite interesting.
Derek-Jones

April 21, 2012 13:32 | #3

Reply | Quote

@D Herring
Thanks for the link which lead me to the Fisher-Yates shuffle and an update to the post.
Magnus

May 2, 2012 05:40 | #4

Reply | Quote

The standard C++ library has random_shuffle() in the algorithm include file, which is O(n).

random_shuffle(v.begin(), v.end());

which I suppose would be random_shuffle(begin(v), end(v)); in the new-fangled C++11. There’s a also an override where you can supply your own RNG.
Derek-Jones

May 2, 2012 15:10 | #5

Reply | Quote

@Magnus
Yes, it has been there since C++99. These days there are freely available libraries for doing almost everything. I guess my post should have said something like “if you are using a language that does not easily support linking against third party libraries (awk in my case) here is a useful technique”.
Paddy3118

June 5, 2012 09:24 | #6

Reply | Quote

The Knuth (a.k.a. Fisher-Yates shuffle) is on Rosetta Code here: http://rosettacode.org/wiki/Knuth_shuffle
Jonathan Thornburg

June 7, 2012 15:33 | #7

Reply | Quote

If you’re using C, you probably don’t want to use rand() : it’s a very poor random-number generator. (For example, on many implementations the least significant bit of rand()s result is either constant, or just tlggles bacn and forth between 0 and 1 with each successive call.) random() or drand48() are much better.