Saturday, February 13, 2010

Uniformly random

8A358779-D9B6-4C03-A4B4-8727DB3695BC

Every software developer should recognize the above string as a GUID - pronounced goo-id, or gwid, it's a globally unique identifier.  It is a pseduo-random sequence that has 2128 or 3.4 x 1038 possible values.  If every computer in the world generated one GUID a second from the dawn of time until now, that GUID you see above theoretically would still never occur again.  A GUID consists of 32 characters with values from 0-F, or 0-15 for those who prefer to count in base 10.

One of the best properties of a GUID is that it's uniformly random, which means that each and every one of the 32 characters in the sequence has an equal chance of being any of the 16 possible values.  This is really handy, and has some nice applications when it comes to data analysis.  If you assign each data element a GUID and sort based on that GUID, you'll get a random sampling of data.

Specifically, lets say you have a SQL Server database table and you'd like to get a random sample of 1000 records, this query will get you there:

select top 1000
    newid(),
    *
from
    MyTable
order by
    1 

Really neat.  Well, if you're a data geek it is at least...