## Problem definition

Often, we can estimate a desired quantity by finding the area under a curve (an integral). As an example of this type of computation, we will estimate the value π. As everyone knows, the area of a circle is πr2, and therefore for a unit circle the area is just π.

We will approximate this total area value (of the blue region in the diagram above) by adding up the areas of rectangles that approximately cover the area of that semicircle, as illustrated below.

This diagram shows 12 rectangles of equal width h that approximately cover the blue semicircular area. Each rectangle is positioned over a subinterval of the x-axis interval [-1,1], and the height of a rectangle is the function’s value at some value xi in that rectangle’s subinterval. Thus, the area of a rectangle is 1-xi2 * h. We must add up the areas of all these rectangles, then double that sum to get the approximation of π.

The more rectangles we use, the greater accuracy we expect in our rectangle approximation to the exact area under the semicircular curve y=f(x). Therefore, we will compute the sum with millions of thin rectangles in order to get better accuracy for our estimate of .

Note: The accuracy of our estimate depends on how many rectangles we use, and to a lesser extent on how we choose the values xi. We have illustrated in the diagram above choosing xi as the midpoint of each subinterval, which visibly causes the sum of rectangle areas to be a reasonable approximation to the exact semicircular area we seek. If x_{i }represents the midpoint of the ith rectangle, then

x_{i }= -1 + (i + 12)*h

where

h=2.0/NUM_RECTANGLES

Implementation strategies

We provide several sample implementations below. The choice you make among these computations depends not only on which parallel platforms you have access to but also on various computational factors such as the size of your computation.

Download a Makefile and all code samples for the examples below here.

## 1. Serial computation.

The following loop carries out the bulk of the computation by adding the areas of all the rectangles. Here, n_rect represents the number of rectangles.

```
1
2
3
4
56
7
8
``` | sum = 0; h = 2.0 / n_rect; for ( i = 0; i < n_rect; i++ ) { x = -1 + ( i + 0.5 ) * h; sum += sqrt( 1 - x * x ) * h; } pi = sum*2.0; |

The program pi_area_serial.c implements the algorithmic strategy as outlined above in C language sequentially, i.e., with no (programmer-level) parallelism.

Further Exploration:

The default number of rectangles is 10 million. Time a run of the program with that default number of rectangles, then use the program’s optional command-line argument to compare with the timing for other rectangle counts, such as 100 million, 1 billion, 100,000, etc. Do these timings vary as you might expect?

Are there any anomalies if you start from a small number of rectangle counts and keep doubling it? If so how do you explain them?

The x values in pi_area_serial.c are calculated by repeatedly adding the width of each rectangle. How would it change the results to instead calculate x as is done in the code snippet above?

## 2. Shared memory

#### We will consider multiple approaches for implementing our computation in a shared memory model.

2a. OpenMP

This is the simplest parallelization strategy we will consider for this problem. The example inserts one OpenMP pragma to parallelize the loop for computing areas of rectangles and summing those areas, as shown below. In this case, the loop iterations must be independent in order to execute in parallel.Notice how the computation of the midpoint for each rectangle, x, is done within the loop rather than relying on adding the width of each segment from the initial midpoint value (in the downloadable serial code).

1 2 3 4 56 7 8 9 10 | sum = 0; h = 2.0 / n_rect; /* NOTE: i is automatically private, and n_rect, and h are shared */ #pragma omp parallel for private( x ) reduction(+: sum) for( i = 0; i < n_rect; i++ ) { x = -1 + ( i + 0.5 ) * h; sum += sqrt( 1 - x * x ) * h; } pi = sum*2.0; |

Comments on the code:

None of the OpenMP threads in the parallelized for loop will modify any of the variables n_rect or h so it is safe for those variables to be shared among the threads.

The omp parallel for pragma parallelizes a for loop by giving different threads their own subrange of values of the loop-control variable ( i in this case). Hence, that variable i is automatically private or local to each thread. The “work” variable, x, must be private to each thread or a race condition will result. This can be done through the private() clause or declaring the x variable within the parallel region (within the for-loop).

The variable sum (which holds a partial sum of rectangle areas) must be computed locally for each thread, in order to avoid a race condition. As each thread finishes its work, its resulting value of sum (representing that thread’s subtotal of areas of rectangles) must be added to a grand total in order to produce the correct (global) value of sum after the for loop is finished. This subtotalling/grand totalling procedure is accomplished using an OpenMP reduction() clause. The calues also ensures that threads use a local version of sum.

The number of OpenMP threads used does not need to match the number of cores on your system. The number of threads can be controlled by setting the environment variable OMP_NUM_THREADS at runtime. The function omp_set_num_threads() may be used to set the number of threads from within the code.

Further Exploration:

Download the serial and OpenMP codes. Build and run them. Compare the timing results you collected for the sequential program to the time performance of this parallel program using various numbers of threads using OMP_NUM_THREADS. Does the parallel program perform better? Is the speed up as much as you would expect? If not, can you hypothesize why?

#### 2b. Intel Threading Building Blocks (TBB)

There are two major parts to the TBB solution. The first, after the required #include lines to import the TBB definitions, is the call to parallel_reduce(). This algorithm call will take a range (0, num_rect), and a body object (area), and a partitioner as parameters. The range will be divided into sub-ranges until a sub-range is deemed small enough, which is the function of the partitioner. This range will be encapsulated into a task that can be executed by a thread.

Once the computation is complete, the sum of all the rectangle areas computed for the smallest sub-ranges has been gathered (reduced) into the sum component of the area object. Multiply this value by 2.0 to compute the approximation of pi.

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 | #include "tbb/parallel_reduce.h" #include "tbb/task_scheduler_init.h" #include "tbb/blocked_range.h" using namespace std; using namespace tbb; long long num_rect = 1000000000; . . . double pi; double width = 1./(double)num_rect; MyPi area(&width); //construct MyPi with initializer of step(&width) parallel_reduce(blocked_range<size_t>(0,num_rect), area, auto_partitioner()); pi = area.sum * 2.0; </size_t> |

The second major part of the solution is the body class MyPi defined below. This class defines the operator() method with the body of the serial code loop. Once a task has been defined (sub-range has been deemed indivisible), the loop in the operator() method computes the midpoint of the associated rectangle for each value within the range of the task. From this, the area of that rectangle is computed and added to the object’s sum component.

Once the entire range within a task has been used, the sum components from different tasks are added together through the join() method. This method accepts the sum from some other task and adds it to the local sum of the current task. This sum can then be used in another join() operation until the final sum of all tasks has been added together. This final result is then available through the sum component of the original body object used in the parallel_reduce() call.

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 | class MyPi { double *const my_h; public: double sum; void operator()( const blocked_range<size_t>& r ) { double h = *my_h; double x; for (size_t i = r.begin(); i != r.end(); ++i){ x = -1 + (i + 0.5) * h; sum = sum + sqrt(1.0 - x*x) * h; } } void join( const MyPi& y ) {sum += y.sum;} MyPi(double *const step) : my_h(h), sum(0) {} MyPi( MyPi& x, split ) : my_h(x.my_h), sum(0) {} }; |

Further Explorations:

● If you have a compiler that supports lambda expressions, find documentation or examples that describe the usage of the lambda version of parallel_reduce(). Modify the Threading Building Blocks code for computing pi to use the lambda variation rather than the explicit version.

#### 2c. Pthreads

An implementation of the area computation with the POSIX threads (Pthreads) explicit threading model is shown here. In the main() routine, a number (NUMTHREADS) of threads are spawned to execute the function Area(). This function takes one argument: (a pointer to) the thread number generated and stored in the tNum array. After the child threads are launched, the main() thread will call pthread_join to wait for each thread, in turn, to complete computation. The computed area of the half circle will then be stored in gPi. Multiply this result by 2.0 to compute the approximation to pi.

The threaded function Area() uses the thread number (0..NUMTHREADS-1) to initialize the local loop iteration variable. This value is used to compute the midpoint of a rectangle, the height of the rectangle, and then the area of the rectangle. Notice how the increment value in the for-loop is the number of threads. In the code given, this will have the loop of the first thread (myNum == 0) take on values 0, 4, 8, 12, etc., while the last thread (myNum == 3) will use the iteration values 3, 7, 11, 15, etc. This scheme ensures that all values in the NUM_RECT range are used and only used by one thread.

Rather than update the shared summation variable, gPi, each time a new rectangle area is computed, a local partial sum variable is used within each thread. Once the loop has completed, each partial sum is added to the shared sum with a critical region protected by the mutex object gLock. In this way, protected updates to the shared variable are done only once per thread (4 times) rather than once per rectangle (NUM_RECT times).

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 42 43 44 | #include <stdio.h> #include <math.h> #include <pthread.h> #define NUM_RECT 10000000 #define NUMTHREADS 4 double gPi = 0.0; // global accumulator for areas computed<p> </p> pthread_mutex_t gLock; void *Area(void *pArg){ int myNum = *((int *)pArg); double h = 2.0 / NUM_RECT; double partialSum = 0.0, x; // local to each thread // use every NUMTHREADS-th step for (int i = myNum; i < NUM_RECT; i += NUMTHREADS){ x = -1 + (i + 0.5f) * h; partialSum += sqrt(1.0 - x*x) * h; } pthread_mutex_lock(&gLock); gPi += partialSum; // add partial to global final answer pthread_mutex_unlock(&gLock); return 0; } int main(int argc, char **argv) { pthread_t tHandles[NUMTHREADS]; int tNum[NUMTHREADS], i; pthread_mutex_init(&gLock, NULL); for ( i = 0; i < NUMTHREADS; ++i ) { tNum[i] = i; pthread_create(&tHandles[i], // Returned thread handle NULL, // Thread Attributes Area, // Thread function (void)&tNum[i]) // Data for Area() } for ( i = 0; i < NUMTHREADS; ++i ) { pthread_join(tHandles[i], NULL); } gPi *= 2.0; printf("Computed value of Pi: %12.9f\n", gPi ); pthread_mutex_destroy(&gLock); return 0; } |

#### 2d. Windows threads

An implementation of the area computation with the Windows threads (Win32 threads) explicit threading model is shown here. There is not much difference between this version and the Pthreads version. Both spawn threads, assign those threads some portion of the rectangles, compute and sum the rectangle areas, and update the shared summation variable. The main thread blocks until all threads have terminated (via WaitForMultipleObjects()).

The difference in algorithms used is how the rectangles are distributed to threads. In this version, the beginning and ending index values of the original iteration range are computed within each thread. These begin and end indices are used as for-loop bounds. The number of rectangles each thread will handle is computed by dividing the number of rectangles by the number of threads; the index values are found by multiplying this ratio by the thread number and the thread number + 1. For 1000 rectangles and four threads, the first thread (myNum == 0) will start with rectangle 0 and finish with rectangle 249 (< (myNum+1) * (1000/4)) for a total of 250 rectangles for each thread.

The one caveat that must be addressed with this method of dividing loop iterations is when the number of rectangles (NUM_RECT) is not divisible by the number of threads (NUMTHREADS). For example, if the number of rectangles to use were 10000019 (a prime number), dividing by any number of threads will leave some iterations left out of the computation when computing the iteration bounds as described above. For instance, if executing on 4 threads, three iterations would remain unattached to a thread. Thus, to account for any “leftover” iterations, the remainder are added to the last thread by setting the end variable to be the explicit number of rectangles. If the time to compute a single iteration is significant, this distribution scheme could lead to load imbalance and an alternate method of iteration assignment should be used.

#include <windows.h>

#include <stdio.h>

#include <math.h>

#define NUM_RECT 10000000

#define NUMTHREADS 4

double gPi = 0.0;

CRITICAL_SECTION gCS;

DWORD WINAPI Area(LPVOID pArg) {

int myNum = *((int *)pArg);

double h = 2.0 / NUM_RECT;

double partialSum = 0.0, x; // local to each thread

int begin = myNum * (NUM_RECT / NUMTHREADS);

int end = (myNum+1) * (NUM_RECT / NUMTHREADS);

if (nyNum == (NUMTHREADS-1)) end = NUM_RECT;

for ( int i = begin; i < end; ++i ){ //compute rectangles in range

x = -1 + (i + 0.5f) * h;

partialSum += sqrt(1.0f - x*x) * h;

}

EnterCriticalSection(&gCS);

gPi += partialSum; // add partial to global final answer

LeaveCriticalSection(&gCS);

return 0;

}

int main(int argc, char **argv) {

HANDLE threadHandles[NUMTHREADS];

int tNum[NUMTHREADS];

InitializeCriticalSection(&gCS);

for ( int i = 0; i < NUMTHREADS; ++i ){

tNum[i] = i;

threadHandles[i] = CreateThread( NULL, // Security attributes

0, // Stack size

Area, // Thread function

(LPVOID)&tNum[i],// Data for Area()

0, // Thread start mode

NULL); // Returned thread ID

}

WaitForMultipleObjects(NUMTHREADS, threadHandles, TRUE, INFINITE);

gPi * = 2.0;

DeleteCriticalSection(&gCS)

printf("Computed value of Pi: %12.9f\n", gPi );

}

2e. Go Language

Go is a new open-source language with built-in support for concurrency developed at Google. You may get more info on the language at http://golang.org/. In particular, here is an introduction into concurrency constructs: http://golang.org/doc/effective_go.html#concurrency. You can find the Pi integration program in easily downloadable form here.

package main

import (

"flag"

"fmt"

"math"

"runtime"

)

var (

nRect = flag.Int("rect", 1e6, "number of rectangles")

nGrain = flag.Int("grain", 1e4, "parallel task grain size (in rectangles)")

nCPU = flag.Int("cpu", 1, "number of CPUs to use")

)

func main() {

flag.Parse()

runtime.GOMAXPROCS(*nCPU) // Set number of OS threads to use.

nParts := 0 // Number of parallel tasks.

parts := make(chan float64) // Channel to collect part results.

for i := 0; i < *nRect; i += *nGrain {

nParts += 1

end := i + *nGrain // Calculate end of this part.

if end > *nRect {

end = *nRect

}

// Launch a concurrent goroutine to process this part.

go func(begin, end int) {

sum := 0.0

h := 2.0 / float64(*nRect)

for i := begin; i < end; i++ {

x := -1 + (float64(i)+0.5)*h

sum += math.Sqrt(1-x*x) * h

}

parts <- sum // Send the result back.

}(i, end)

}

// Aggregate part results.

sum := 0.0

for p := 0; p < nParts; p++ {

sum += <-parts

}

pi := sum * 2.0

fmt.Printf("PI = %g\n", pi)

}

3. Distributed memory

Distributed memory bears lots of similarity on the surface to shared memory in that the work can be evenly divided between computational elements. The key detail to be aware of is that the shared memory thread requires few resources to put it into play; an MPI process requires significant care and feeding. The best example of this is if the task was to get a drink from the refrigerator, with that latency being the cost of a single cycle, i.e reaching the register file, then shared memory is like going out the the garage refrigerator. Going to a remote memory in another MPI process is like traveling 40 miles for the drink. The impact of this will not be seen in this code, but it does provide a framework a person use for adding a more representative CPU and bandwidth load.

The MPI version is also similar to the OpenMP, in that both the OpenMP thread and the MPI process learn of their portion of the work from their rank among all the other threads/processes. An MPI_Reduce is done at the end to perform a log(n) summation of the partial areas from each process.

pi_area_mpi.c

Further Exploration:

What happens when you have more than one process running on a particular procesors. How can you know how many processors you have, or if you have more than one process running on a processor?

4. Data Parallel (GPGPU)

#### 4a. CUDA

Several implementations have a reduction operator that can combine partial results in log(n) time. CUDA has a more complicated memory model, namely a group of blocks, each containing a group of syncronizable threads. Each thread gets its work as with shared and distributed memory, with a slight difference being the division of labor needing to be first transfered from CPU memory to shared CUDA memory. The CUDA thread also has parallel reduce of thread results within a block. The reduction becomes starved of actual work, with only one thread performing the final add, but is overall done in log(n) time, so it is still a win. The block results are then transfered from CUDA memory to CPU memory, where a final linear reduction is performed. There is a sample implementation available here.

Further Exploration:

The code uses 32 blocks per grid and 256 threads per block. Must these numbers be used? What are advantages/disadvantages of changing them? Is the ratio between theses numbers significant?

This code uses floats. What differences do you see with the other area under the curve codes? How can you affect any differences while still using floats? Can you use doubles with CUDA? If so, how do you test when you can?

The x values in pi_area_serial.c are calculated by repeatedly adding the width of each rectangle. How would it change the results to instead calculate x as is done in the code snippet above?

There is a coalescing of thread values within a blockby repeated passes that starve down to an effective length of 1. Was this a wise choice? Why wasn’t the same technique used for all threads at once, and not just block by block for the threads within that block?

4b. Intel Array Building Blocks

The Intel® Array Building Blocks (ArBB) code is written to execute in a data-parallel fashion. No explicit loop is written since computations are applied to each element of data containers. This does mean that extra space must be allocated to hold values that would otherwise be generated dynamically by a for-loop. The details are below, or can be downloaded directly.

In the main() routine, the first line allocates a (dense) container of double-precision floating point numbers (f64), indices, and initializes it to hold the floats from 0.0 to num_rect-1 by strides of 1.0. This is the “simulation” of the for-loop seen in previous examples. The next line allocates a container to hold the computed area of each individual rectangle (as indexed by corresponding elements from the indices container). The arbb::call() function launches an execution of the calc_pi routine, sending as parameters the two previously allocated containers and the width of each rectangle to be used. The sole task of calc_pi() is to map the computation found in the calc_area() routine onto each element of the iVals container, then place the results into the corresponding location of the area container. The cald_area() function is written to handle a single rectangle. However, the runtime of ArBB will perform this computation on all elements of the iVals container and update the area container elements. Finally, the arbb::sums() function is used to perform an addition reduction of all elements within the area container, that sum is multiplied by 2.0, and the product is stored in the ArBB scalar pi.

1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 | # include <iostream> # include <cstdlib> # include <arbb.hpp> const long num_rect=10485760; void calc_area(arbb::f64 &y, arbb::f64 h, arbb::f64 i) { arbb::f64 x = -1.0f + (i+0.5f) * h; y = arbb::sqrt(1.0 - x*x) * h ; } void calc_pi(arbb::dense<arbb::f64> &areas, arbb::f64 width, arbb::dense<arbb::f64> iVals) { arbb::map(calc_area)(areas, width, iVals); } int main(int argc, char *argv[]) { arbb::dense<arbb::f64> iterations = arbb::indices(arbb::f64(0.0), num_rect, arbb::f64(1.0)); arbb::dense<arbb::f64> areas(num_rect); arbb::f64 h = 1.0f / num_rect; arbb::call(calc_pi)(areas, h, iterations); arbb::f64 pi = arbb::sum(areas) * 2.0f; std::cout << "Pi = " << arbb::value(pi) << std::endl; return 0; } |

Looking ahead

Parallel patterns

These example programs for computing pi as an area of a semicircle exhibit a number of parallel design patterns. In the parallel patterns section of this tech pack, patterns are organized in a hierarchical structure, following the OPL community (see [OPL white paper] and [patterns wiki]). Here, we will list many of the patterns present in some of our solutions of the problem of computing pi.

Parallel algorithm strategy patterns. These patterns define high-level strategies to exploit the concurrency in a computation for execution on a parallel computer. Our example codes all illustrate the Geometric decomposition, but few fit OPL’s Data parallel parallel-algorithm strategy pattern.

Most of our parallel implementations exhibit the Geometric decomposition parallel-algorithm strategy pattern. For example, in the OpenMP solution , the OpenMP system parallelizes a for loop by dividing its loop-control range into subranges. We can view each subrange as a rectangle-based geometric approximation of some portion of the semicircle. In fact, we use the pattern name Geometric Decomposition for any decomposition into “chunks” of data, such as contiguous subranges of a loop-control range, even if the resulting computation does not correspond to a geometric figure.

Many programmers might say that these programs also shows “data parallelism,” since they process different data in different threads or processes. In fact, some authors consider nearly any parallel computation to exhibit “data parallelism” (and other authors argue that nearly any parallel computation exemplifies “task parallelism”). We will follow the OPL group and narrow the Data parallel parallel-algorithm strategy pattern to situations that apply a stream of instructions to elements of a data structure. By this definition, the Array Building Blocks implementation and arguably the CUDA implementaion qualify as using the Data parallel pattern.

Implementation strategy patterns. There are two types of patterns at this level: program structure patterns focus on organizing a program; and data structure patterns describe common data structures specific to parallel programming. Several such patterns appear among the code examples above.

The OpenMP solution exemplifies the Loop parallel program-structure pattern, because the OpenMP pragma requests that the summation loop be carried out in parallel, using OpenMP’s automatic mechanism for carrying out subranges of the summation in parallel. The TBB implementation pi_area_tbb.c provides another example, through its use of tbb::parallel_reduce which performs a multithreaded sum over subranges.

The Strict data parallel program-structure pattern provides a particular approach to implementing the higher-level Data parallel algorithm strategy pattern, which we may think of as an alternative to a Loop parallel implementation: instead of visiting each value in turn, we think of processing all the values simultaneously, applying a single instruction stream to each of those values in parallel. The ArBB code provides a higher-level example of this pattern, and the CUDA implementation shows that pattern at a lower level (i.e., closer to the hardware architecture).

The Pthreads, Windows threads, and Go code examples illustrate the Master-worker algorithm strategy pattern, in which one thread assigns work tasks to worker threads or processes, seeking a balanced load. When those tasks have varying execution times, a program may seek load balancing by having workers process one task at a time and return for a next task. However, each of these cases achieves load balancing by assigning tasks known to require about the same time to each worker (i.e., adding up areas of nearly equal numbers of rectangles per worker).

The Pthreads and Windows threads implementations also show a Fork-join algorithm strategy pattern, in the way they create, launch, and await completion of their threads. The Go implementation also demonstrates that pattern, accomplishing the “join” operation on each goroutine (process or thread) by collecting that thread’s results over the shared communcation channel.

Finally, we note that the ArBB code illustrates the Distributed array data structure pattern.

Concurrent execution patterns. This level consists of advancing program counters patterns, which concern the timing relationships for executing instructions among different threads or processes, and coordination patterns, which provide mechanisms for processes or threads to correctly access the data they need (cf. interprocess communication). Our examples illustrate several concurrent-execution patterns.

The SIMD (Single Instruction Multiple Data) advancing-program-counter pattern appears in both the ArBB code and the CUDA code. This pattern refines the Strict data parallel pattern, by specifying that concurrent execution should happen without programming how it will happen.

The Thread pool advancing-program-counter pattern appears in the Pthreads and Windows threads code examples, in that those implementations each create an array of threads for parallel computation (here, in order to serve as worker threads in a master-worker strategy). Programs with multiple parallel segments may reuse such threads for subsequent parallel operations; however, this simple example has only one portion of parallel code, so these programs use their threads only once each.

The reduce operations in the OpenMP and TBB implementations present the Collective communication coordination pattern.

The Mutual exclusion coordination pattern appears in the pthread_mutex_t variable gLock in the Pthreads example, and in the CRITICAL_SECTION variable gCS in the Windows threads example.

Finally, the Go code illustrates the Message passing coordination pattern, when goroutines computing subtotals of areas of rectangles communicate those results to the original process using the channel parts .