## Problem definition

We have already seen a way to use direct numerical integration to determine the value of π by measuring the area of a half of a circle. For this example, we’re going to try to do the same thing, but with a different strategy. This makes use of the following simple observation.

In the above picture, you have the unit circle (whose area is π*1^{2} = π) embedded inside a square with sides of length 2 (whose area is 2*2=4). Here’s the key idea: if one were to randomly throw darts at this picture and count up how many of them hit in the circle versus the number of total darts, the ratio of hits to total would eventually approach π/4. Note that we get that same ratio if we restrict ourselves just to the top right quadrant.

### Algorithmic strategy

As can be seen from the simple example above, it will take a large number of samples (ie “darts”) to get anywhere close to a reliable value for π. It may require hundreds of thousands or millions of samples to get out to a sufficient number of digits of accuracy. Fortunately, though, this is a pretty easy thing to do, as each test is extremely simple. Also, there isn’t much that needs to be remembered in each loop -- you don’t need to keep track of the (x,y) location of each sample. You only need to keep track of a global count of number of hits versus the number of tries. This is a general property of what are known as Monte Carlo techniques. One makes a large number of random guesses, and the end calculation is just a function of the number of hits versus the number of guesses.

Of course, determining π is a rather ridiculously simple application for this particular technique. However, note that if the area of the figure you were using in the simulation contained a number of holes in the circle or weird squiggly cut-outs, the Monte Carlo technique would still work exactly the same, while numerical integration would require making lots of extra, small rectangles to add and subtract the small pieces. Consider, for example, how you might go about finding the area for something like the Mandelbrot fractal below.

Calculating the area of the Mandelbrot set exactly with numerical integration is extraordinarily difficult (or more likely impossible). However, a Monte Carlo approach will yield a ratio of hits to total that converges to the correct ratio of the area of the set to the area of the surrounding region (rectangle).

Note that this is just the start of how you can apply Monte Carlo techniques (which are, indeed, named after the famous European center of gambling). Biased sampling techniques improve the rate at which you achieve an accurate result (maybe only 10,000 samples instead of 1,000,000), simulated annealing uses an adapted form to find the maximum or minimum value of a function, and random walk techniques can be applied to solving problems in social networks like the classic “Six Degrees of Kevin Bacon” puzzle.

Let us return to the problem at hand, that of approximating the value of π. As has been hinted above, our end calculation is very simple.

By restricting ourselves to the positive quadrant, we have made the process of generating our samples much easier, since most random number functions that you will find (rand(), drand48(), etc.) are defined on [0,1). So in serial, the core of the implementation would look like the following in C:

```
1
2
3
4
56
``` | for (i=0; i<NUM_ITER; i++) { dx = drand48(); dy = drand48(); if ( (dx*dx + dy*dy) < 1) hits++;} |

Where NUM_ITER is defined earlier as the total number of samples to be taken, and “hits” stores the number of those samples that are strictly inside the unit circle.

The complete code is available here.

A few things to note in this implementation. First, the basic for loop is very simple, and there’s really only one variable (hits) which has to be maintained after the loop has completed. Otherwise, all of the data is strictly local to that loop instance, which makes it ideally suited for parallelization. Even the one variable that isn’t strictly local, hits, is what is known as a reduction variable -- it’s just a sum of a lot of 1’s (when you hit) and 0’s (when you miss). If you were it break it into sub sums, you just have to add all of them together at the end, and you get the same result as if you had just summed them together serially. This will be important when we want to have separate threads doing their own parts of the computation (see OpenMP and Pthreads below).

### Implementation strategies

At its core, this code is what as known as ~~embarrassingly~~ enchantingly parallel. That is part of the appeal of Monte Carlo codes -- most of them are. So even though you need nearly a million samples to reliably get that π is approximately 3.14, you can trivially parallelize it to run on hundreds or thousands of cores (if you happen to have that many hanging around).

#### 1. OpenMP -- link to TP1

OpenMP was designed to make it extremely easy to parallelize codes that have large for loops like we have here. The core of parallellizing this code is done by adding just two lines to the serial code above and make small edits in two others.

1 2 3 4 56 7 8 9 1011 12 13 14 | /*create the default number of threads */ #pragma omp parallel { ... Initialize PRNG per calling thread ... /* Use the threads to automatically parallelize the for loop. *** *** Each thread has its own dx & dy, and a private hits that is summed at the end */ #pragma omp for reduction(+:hits) private(dx,dy) for (i=0; i<NUM_ITER; i++) { dx = erand48(state);dy = erand48(state); if ( (dx*dx + dy*dy) < 1) hits++; } |

There is a significant tricky point that it’s good to bring up here that explains our switch from drand48() to erand48(). Although there isn’t any user-defined state that’s carried between iterations of the for loop in the serial code, there’s some implicit state **inside** the Pseudo-Random Number Generator (PRNG). The seed for the PRNG gets updated every time there’s a call to drand48() in the serial code. This means that there actually is a potential race condition if multiple threads were to try to update the seed value at the same time, and this might generate either bad data or (maybe worse) the same “random” value in both threads. Some OS implementations have chosen to protect every call to a PRNG with a mutex, but this can dramatically slow down performance, since only one thread can get a random number at a time.

A preferred way is to switch so that you use a thread-safe function that maintains per-thread state. In order to do that here, we’re going to use the erand48(state) function, where the seed value is maintained in a user-specified, per-thread state. That way, each thread can access as many random numbers as it wants. Thus, the code that was hidden in the “Initialize PRNG per calling thread” comment above looks like the following in this case:

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 | #pragma omp parallel { short state[3]; // this is the PRNG state variable that is thread-local. long myseed; struct timeval now; gettimeofday(&now, NULL); // get the current time /* initialize the random number generator with the time */ myseed=((long) now.tv_sec)*(omp_get_thread_num()+1); /* this makes sure there’s a different seed for every thread */ state[0]=(short) myseed; state[1]=(short) myseed>>16; state[2]=(short) myseed<<3; /* The 32 bits in the seed are shifted around to fill the 48 bits of the state */ ... |

*Further Exploration:*

● If you have a scientific math library kernel available, like the gnu science library (gsl -- http://www.gnu.org/software/gsl), the Intel Math Kernel Library, SPRNG __(____http____://____sprng____.____cs____.____fsu____.____edu____/)__, the NAG library, and so on, then you may have options that are both more efficient and cleaner to code. How would your code change if you knew the random number generator was thread safe?

● Some random number generators are more efficient generating large arrays of (pseudo-)random numbers than they are generating them one at a time. A more efficient code could be written by making two loops, one that generated an array of 100 random numbers, and an inner loop that used that 100 numbers to compute 50 data points. How would this effect the parallelism?

● One system disruption is to eliminate software PRNGs entirely, and use hardware assisted random number generation. Such a solution will be available as a special register inside the CPU in some upcoming hardware. Consider how this would impact parallelization and performance of a multi-threaded code.

**A Note on Random Numbers: **Random number generation functions are tricky. Technically speaking, they aren’t even random number generators -- they are what are known as pseudo-random number generators (PRNGs). First, note that the PRNG we’re using here, drand48(), needs to be initialized with a call to srand48(). All PRNGs need to be initialized to give them some starting seed (the “s” in srand48() is for seed). They are completely deterministic -- if you always use the same seed value, then you will always get the same values out for each call from drand48(). That is why we used the current time (at execution) as a way to initialize the PRNG in the full serial code; as long as you don’t call the code twice in one second, you won’t be using the same seed. However, that won’t be sufficient when we get to doing it in parallel, as you may end up with many threads starting at the same time.

There are ways to make random number generation better. Some operating systems use hardware infrastructure to generate real random numbers. There’s a long mathematical argument of how you should define the words “real” and “random” in the previous sentence; here I mean it in the looser, day-to-day sense of “can’t regenerate the result even if I knew the seed”. For very, very sensitive applications where you need very good random numbers, people have even built fancy physical devices that count the number of cosmic rays that hit a detector or similar sorts of physically random events (look up SGI’s old lavarand product for an amusing example). It is very rare that an application needs this degree of purity of random numbers, although changes in technology may make this sort of random number generation cheaper and more prevalent, even to the point of having it included on chip.

#### 1.b. OpenMP with leapfrogging

One approach for making sure every thread gets a different pseudorandom number is to take a single seed and arrange it so that each of the N thread would receive the Nth random number from that stream. Obviously it would be a large performance block to make all of the threads wait their turn to get their next random number. However, there’s a numerical technique that allows you to do this without any coordination at all. The trick is buried in the way that the PRNG uses number theory to generate the numbers, but here’s a replacement random number function that one could use to implement the leapfrogging approach.

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 42 43 44 4546 47 48 49 5051 52 53 54 5556 57 58 59 6061 62 63 64 6566 67 68 69 7071 72 73 74 7576 77 78 79 8081 82 83 84 8586 87 88 89 9091 92 93 94 9596 97 98 99 100101 102 103 104 105106 107 108 109 110 | //********************************************************** // Parallel Pseudo random number generator: // // USAGE: //// The pseudo random sequence is seeded with a range // // void seed(lower_limit, higher_limit) // // and then subsequent calls to the random number generator // generates values in the sequence: // // double random() // // A leap frog method is used to assure non-overlapping// sequences for each thread. // // Note: these functions are to be called from inside the // the OpenMP parallel region that will use the sequence. //// BACKGROUND: // // We are using a modulus of 2^31-1 and a multiplier from // the Hoaglin LCGs in the following article: //// http://random.mat.sbg.ac.at/~charly/server/node3.html#lcg // // we are using a zero addend just to make the leap frog // algorithm easier to implement. //// HISTORY: // // 9/2008: Written by Tim Mattson by cutting and pasting // from a generator written by Larry Meadows ////*********************************************************** #include <omp.h> static unsigned long long MULTIPLIER = 764261123; static unsigned long long PMOD = 2147483647;static unsigned long long mult_n; double random_low, random_hi; #define MAX_THREADS 128 static unsigned long long pseed[MAX_THREADS][4]; //[4] to padd to cache line //size to avoid false sharing unsigned long long random_last = 0; #pragma omp threadprivate(random_last) double random() { unsigned long long random_next; double ret_val; // // compute an integer random number from zero to mod // random_next = (unsigned long long)((mult_n * random_last)% PMOD); random_last = random_next; // // shift into preset range // ret_val = ((double)random_next/(double)PMOD)*(random_hi-random_low)+random_low; return ret_val; } // // set the seed, the multiplier and the range// void seed(double low_in, double hi_in) { int i, id, nthreads; unsigned long long iseed; id = omp_get_thread_num(); #pragma omp single { if(low_in < hi_in) { random_low = low_in; random_hi = hi_in; } else { random_low = hi_in; random_hi = low_in; } // // The Leapfrog method ... adjust the multiplier so you stride through // the sequence by increments of "nthreads" and adust seeds so each // thread starts with the right offset // nthreads = omp_get_num_threads(); iseed = PMOD/MULTIPLIER; // just pick a reasonable seed pseed[0][0] = iseed; mult_n = MULTIPLIER; for (i = 1; i < nthreads; ++i) { iseed = (unsigned long long)((MULTIPLIER * iseed) % PMOD); pseed[i][0] = iseed; mult_n = (mult_n * MULTIPLIER) % PMOD; } } random_last = (unsigned long long) pseed[id][0]; } |

#### 1.c OpenMP with a library

There are numerous studies and papers on the reliability and quality of different types of random number generators. It can be onerous to copy and paste different hand-written versions into your code. One advantage is that many open source and proprietary libraries, such as SPRNG, NAG, MKL, and GSL, offer the user easy access to thread-aware PRNG generation. Here’s a sample of what the core code change would look like to put GSL into the code as an example.

```
1
2
3
4
56
7
8
``` | r = gsl_rng_alloc (gsl_rng_mt19937); gsl_rng_set(r, (unsigned long int) (init.tv_sec + init.tv_usec)); for (i=0; i<numrect; i++) { dx = gsl_rng_uniform(r); dy = gsl_rng_uniform(r); if ( (dx*dx + dy*dy) < 1) hits++; |

#### 1.d OpenMP with Bull Mountain

The core issue with PRNGs is that they all have some sort of hidden structure that comes from their numerical origins, even though they may initially appear nearly random. Using physical sources that can be guaranteed in some sense to have better “randomness” not only can improve the performance of the code, it can also return the code to something closer to the original “simple” view of the parallel loop. Bull Mountain, as a particular technology, is what is known as a Cascade Corrected Random Number Generator (CCRNG), and it is available to threads as an on-chip register from which they can simply read 32 or 64 bits of random numbers using the rdrand instruction.

A piece of sample code that uses the somewhat higher-level code discussed in the Bull Mountain developer guide is included below. As technology like this is increasingly available, it is expected that more and more system library calls (like random()) will be implemented using this approach.

```
1
2
3
4
56
7
8
9
``` | #define MAX_UINT_SIZE 4294967295 double ccrandom() { int cx=0; unsigned int rx; while ( cx ) { cx = _rdrand32_step(&rx); } return (double) (rx/MAX_UINT_SIZE); } |

Here the _rdrand32_step() function is as defined in the Bull Mountain reference documents. Assuming this function, our core parallel loop becomes the following:

```
1
2
3
4
56
7
``` | #pragma omp parallel for reduction(+:hits) private(dx,dy) for (i=0; i<NUM_ITER; i++) { dx = ccrandom(); dy = ccrandom(); if ( (dx*dx + dy*dy) < 1)hits++; } |

Note that we no longer need to put the parallel and for pragmas on separate lines, and that all of the initialization has gone away.

Further Exploration:

This version of ccrandom() uses a 32 bit random number as its base. A better way to seed a double would be to pull 64 bits. An even better way for this problem might be to pull 128 bits and return a pair of doubles to fill in dx and dy simultaneously. Where is the right balance between using a standard function and writing a custom one? The check state (cx in our ccrandom) allows you to check if the physical device has managed to fill out the full 32 bits with high-quality randomness. Under normal operation, it is expected that this will always return properly. However, in the event of failure, this loop may block indefinitely. Changing it to be non-blocking requires deciding how to fail gracefully -- begin using a PRNG, return an error, contact another chip to borrow its CCRNG?

#### 2. MPI -- link to TP1

As opposed to the OpenMP example above, MPI works by spawning separate processes which cooperate to spread the work out amongst themselves. One handy implication is that the non-thread-safe versions of the PRNG codes (ie drand48()) can be used as they were in the serial code. The complete code example can be found here. Some of the key points are as follows:

```
1
2
3
4
56
7
8
``` | MPI_Init(&argc,&argv); /* Figure out how many iterations are needed for each process */ MPI_Comm_rank(MPI_COMM_WORLD,&myrank); MPI_Comm_size(MPI_COMM_WORLD,&commsize); myiter = (unsigned long) NUM_ITER/commsize; if (myrank < NUM_ITER%commsize) myiter++; |

When writing MPI code, you can’t assume that you know how many processes you will be running on, but you do know how much work needs to be done (here, the global number of iterations stored in NUM_ITER). So the code itself must determine how to divide the work up as evenly as possible amongst all of the available processes. In MPI, the communicator size (here stored in commsize) represents the number of processes that have been spawned, and the rank is the number of **that** process within the communicator (ranks run from 0 to N-1). For example with NUM_ITER=1000, a process needs to know that it is number 5 of 10 so that it does iterations 500 to 599 .

In those cases where the number of iterations is not evenly divisible by the number of processes, an integer division of NUM_ITER by commsize will leave iterations unassigned. One way to ensure all iterations are assigned to a process, the leftover iterations can be assigned to the rank N-1 process by adjusting the myiter count. This can lead to a load imbalance, especially if the computation within an iteration is relatively long. A better solution is to assign a single extra iteration to some of the processes, which then creates the situation where the total number of iterations is only off by one between any two processes. Any extra iterations are reassigned to processes in the if (myrank < NUM_ITER%commsize) myiter++; statement.

```
1
2
3
4
56
7
``` | MPI_Reduce(&hits, &totalhits, 1, MPI_UNSIGNED_LONG, MPI_SUM, \\ 0, MPI_COMM_WORLD); if (myrank == 0) { printf("Pi is approximately %.8g.\n", 4.0 * totalhits / NUM_ITER); } MPI_Finalize(); |

MPI supports reductions through an explicit API call. Unlike in a shared memory paradigm, one needs to specify **in each process** the local variable that is part of the reduction and the target location for the eventual sum. Here, the target location is in the buffer *totalhits* located on the process with rank 0. That is why the test “if (myrank == 0)” is important -- totalhits only contains data on that one process.

#### 3. Pthreads -- link to TP1

Implementing the parallel version in pthreads has features of both the OpenMP and MPI versions. Like in OpenMP, each thread will initialize a thread-safe PRNG upon spawning and then do a portion of the total number of iterations. However, like MPI, Pthreads do not offer direct support for automatic parallelization of for loops, so the bounds for the iteration needs to be computed directly in each thread.

```
1
2
3
4
56
7
8
9
``` | pthread_mutex_init(&hits_lock,NULL); for (j=0;j<NUM_THREADS;j++){ tnum[j] = j; pthread_create(&(threads[j]), NULL, calc_hits, &tnum[j]); } for(j=0;j<NUM_THREADS;j++){ pthread_join(threads[j], NULL); } printf("Pi is approximately %.8g.\n", 4.0 * totalhits / NUM_ITER); |

A new wrinkle compared to those two is that threads have no inherent ordering, so if the programmer wants threads to determine their portion of the iterations (based on being thread 5 of 7 for example), then the thread must explicitly be told what its number is by passing in an argument value, here stored in tnum[]. Note that we need to create the separate array rather than just pointing in the reference to the index j because j will be continually updated.

Additionally, we know that the main thread cannot progress until all of the other threads have contributed their portion of the totalhits, so an explicit wait for them to exit must be used (ie pthread_join()). A further new complication is that, unlike MPI or OpenMP, there is no automatic protection of reduction variables. In order to manage this, we must use an explicit lock so that only one thread can update the global value in totalhits at a time.

1 2 3 | pthread_mutex_lock(&hits_lock); totalhits += hits; pthread_mutex_unlock(&hits_lock); |

The full sample code can be seen at the following location.

#### 4. Go Language

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 42 43 44 4546 | package main import ( "flag" "fmt" "rand" "runtime" ) var ( nThrow = flag.Int("n", 1e6, "number of throws") nCPU = flag.Int("cpu", 1, "number of CPUs to use") ) func main() { flag.Parse() runtime.GOMAXPROCS(*nCPU) // Set number of OS threads to use. nParts := *nCPU * 16 parts := make(chan int, nParts) // Channel to collect part results. for i := 0; i < nParts; i++ { go func() { hits := 0 r := rand.New(rand.NewSource(rand.Int63())) n := *nThrow / nParts for i := 0; i < n; i++ { x := r.Float64() y := r.Float64() if x*x + y*y < 1 { hits++ } } parts <- hits // Send the result back. }() } // Aggregate part results. hits := 0 for i := 0; i < nParts; i++ { hits += <-parts } pi := 4 * float64(hits) / float64(*nThrow) fmt.Printf("PI = %g\n", pi) } <span style="white-space: normal;"></span> |