Know your PRNGs - gzt does statistics

TL;DR version: if all you’re using the standalone R math library in C for is generating uniform random numbers, I have a little C program to remove that dependency as long as you don’t mind your seeds not having the same output as R. I might even fix that later.

I fell into a rabbit hole recently - I do a fair amount of work in C using the R standalone math library for pseudo-random number generation (PRNGs) and sometimes some of its special functions (rather than, say, working in C so it will be called from R or trying to implement a PRNG myself or a special function myself - leave RNGs and numerical analysis to experts). The usage is fairly simple - you call the standalone library, initialize your PRNG with set_seed(seed1, seed2) (this deserves some discussion, maybe later, it is an important topic), then go ahead and call runif() or whatever other function you’re interested in. There are a few pitfalls but that’s beyond the scope of this post. Please feel free to reach out if you ever need help, though!

Anyway, I’ve been working on a cluster recently where it just seems to be a pain to get anything installed on there. Rather than sort all that out, I looked at my code and realized all I was doing in this set of programs with the R math library was calling runif(). So, what would it take to replace it? I was simply using the default and I didn’t need perfect replication - that is to say, I didn’t need to get the same results as R with the same seeds, I just needed to be able to reproduce results if desired. I also want to be able to go back to the “real” library if I need something more than just the PRNG, so I only want to have to change a line or two in a header to switch.

So, what do I need to do?

Find out what PRNG is used by default for runif() and how it is implemented. Hopefully the C code in the the library is not too heavily SEXPed up or not too different from the reference implementation of the method. I’m an okay programmer but these are things that you don’t want to mess with.
Find out how set_seed() and initialization work and write a replacement function.
Put it into my own code.

So what PRNG are we using anyway

The helpful R documentation says the default method in R is the Mersenne Twister algorithm, which seems to be a de facto standard. The reference implementation is in pretty clean C code and seems fairly straightforward. Neat! There’s a little warning on the site that old methods of initialization aren’t so good, so we’ll have to see what R is doing and be careful when deciding what to do ourselves. In the future I might want to replicate the other methods that R offers but that adds some overhead.

The default method for the R standalone library called from C, though is the Marsaglia Multicarry. I could implement that to completely duplicate the Rmath.h experience or I could duplicate the actual R experience. I’m leaning toward the latter, as Marsaglia seems a little too easy!

Now to check the R code to see how they implement this. They have a helpful comment about what they’re doing.


/* ===================  Mersenne Twister ========================== */
/* From http://www.math.keio.ac.jp/~matumoto/emt.html */
/* New URL (accessed 2018-11-08):
   http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html

   The initialization method in the 1998 code and paper had a minor
   issue that was addressed with new initialization approaches in an
   update in 2002.  R has always used a different initialization
   approach and is not affected by that issue.
*/

/* A C-program for MT19937: Real number version([0,1)-interval)
   (1999/10/28)
     genrand() generates one pseudorandom real number (double)
   which is uniformly distributed on [0,1)-interval, for each
   call. sgenrand(seed) sets initial values to the working area
   of 624 words. Before genrand(), sgenrand(seed) must be
   called once. (seed is any 32-bit integer.)
   Integer generator is obtained by modifying two lines.
     Coded by Takuji Nishimura, considering the suggestions by
   Topher Cooper and Marc Rieffel in July-Aug. 1997.

   Copyright (C) 1997, 1999 Makoto Matsumoto and Takuji Nishimura.
   When you use this, send an email to: matumoto@math.keio.ac.jp
   with an appropriate reference to your work.

   REFERENCE
   M. Matsumoto and T. Nishimura,
   "Mersenne Twister: A 623-Dimensionally Equidistributed Uniform
   Pseudo-Random Number Generator",
   ACM Transactions on Modeling and Computer Simulation,
   Vol. 8, No. 1, January 1998, pp 3--30.
*/

Nice, we don’t need to worry about their initialization being out of date. Looking at their code, it looks like a simple port of the standard implementation. This is a relief, as I do not want to have to do any real coding. One nice thing: They change a division into a multiplication.

    return ( (double)y * 2.3283064365386963e-10 ); /* reals: [0,1)-interval */

Maybe it doesn’t matter in the end. #1 is done: the library’s algorithm isn’t really different from the reference implementation.

An important note here: a typical PRNG will generate a unsigned 32-bit int with approximately uniform distribution in its range. R takes that int and turns it into a uniform \([0,1)\) by dividing by \(2^32\). It uses these uniform values to generate other random variables and perform sampling. Frankly, you might want to uniformly sample from a large set of integers more efficiently than that or generate uniform doubles with more precision, so we might look into that. R has recently begun fixing their default for sampling from large sets of integers, so that’s good.

Initialization in R

Pseudo-random number generators have some stored state and then they have two parts:

Generating the next state.
Generating the next random number from the state.

Many generators put all their work in one or the other of these steps. If you know the state, you know the entire future of the PRNG. If there is an easy function for inverting step 2, you can find out the state of the PRNG by observing enough random numbers. The Mersenne Twister has a state that is 623-dimensional and you obtain the next random number by reading the next entry of the state and doing a simple, invertible function of it. When you have gone through all 623 entries, it generates a new state (ie 623 new random numbers). The other PRNGs in R have smaller states. However, you do not provide R with 623 32-bit integers, you provide it with 1 in R itself or 2 when calling it from C (note Marsaglia Multicarry has a state of size 2 and an identity function for step 2 as well). So, what are we doing?

In looking at the code, they aren’t using the outdated initialization which the authors of the MT algorithm warn against. It looks like they initialize by using the good ol’ LCG (with a multiplier of 69069, nice) with whatever (one) seed is provided to fill out the state space and then generate a new state based on that. R from set_seed() in C uses both values for initializing runif() with Marsaglia Multicarry, and perhaps we should want to emulate this behavior rather than consider how MT is initialized. For users of MT in R, this could be an issue for some, as, like, this means there are only 2^32 possible initializations, but it’s not a big problem. However, what are the odds we’ll be providing an actual 32 bits of entropy to the generator? It would be nice to be able to add in more entropy! The reference implementation also provides a method for initializing with an arbitrary array of values. This is good.

Setting the seed can be a real issue if you want a reproducible simulation and/or are running a lot of simulations (say, you’re operating in parallel). If you don’t care about reproducing your work and are only running one simulation, you can let R do its thing if you’re working interactively or, if you’re in C, you can use some tried-and-true methods like generating something based on the time or your process ID or some hardware source or entropy. If you’re doing something just for demonstration purposes and want to also provide some kind of time stamp, I like to make a seed based on the date (20190310 for today, for instance). When running things in parallel, whether you want it to be completely reproducible or not, initialization is going to require some care - basing something on the time is usually okay but if you’re starting a bunch of processes in near proximity to each other, there’s a possibility of collisions and you have to consider how the nearness of seeds is related to the results of your PRNG even if they don’t collide (spoiler: for some, including MT, this is a big problem).

All this to say: I made this take both provided seeds in set_seed() into account as well as provided a method for using the init_by_array() function. I also included a function that will initialize based on a hash of the arguments provided to call the program. This is handy for me because I will often run several copies of the same program with only minor changes in the arguments but I also want the results to be reproducible (at least while testing). The hash makes it so that each program will have a rather different seed even with only a minor change in the arguments but will have the same seed if I run it again with the exact same arguments.

Using This

Include the header for mt19337 and replaceR and you can use set_seed and runif just as you did before. I provide a set of reference results to check whether your installation is working.

You can compile and run the example file here:

gcc mt19937ar.c replaceR.c testreplaceR.c -o testreplaceR
./testreplaceR

To initialize with a hash of the arguments, you can call hash_init_rand(argc, argv) or if you are combining it with a set of seeds, hash_set_seed(argc, argv, seed1, seed2). As written, these are like the reference implementation and the library in that they use global variables for state. You can get around this.

Future thoughts

The MT random number generator has a few difficulties: it fails statistical tests, has a large state, has problems when it comes to seeds that differ very little (problem for parallelism), and is slow for what it gives you. Also, observing a few hundred draws will tell you everything you need to know about the state, which you may or may not care about.

There are other PRNGs that may or may not have these problems. In the future, I’ll talk about implementing them in R or what difficulties there are in doing that (I’ve been looking at PCG, for one).

Footnote: Marsaglia Multicarry

It’s pretty easy code, it fits on three lines:

  /* 0177777(octal) == 65535(decimal)*/
      I1= 36969*(I1 & 0177777) + (I1>>16);
      I2= 18000*(I2 & 0177777) + (I2>>16);
      return fixup(((I1 << 16)^(I2 & 0177777)) * i2_32m1); /* in [0,1) */

I1 is one seed, I2 is another seed, i2_32m1 is \(2^{32}-1\). Figuring out what’s going on there is an exercise, I guess.