## Problem definition

Discrete optimization problems have a finite or infinite set of states that can satisfy the constraints of the problem and a cost function that yields a real number result for each possible state. Searches for a minimal cost solution can be formulated as searches through the state-space since it is typically prohibitive to enumerate all possible states. For this type of computation, states are related to one another by some transformation rule that controls the move from one state to another, which can be modeled as a graph and may be build dynamically as the states are explored. Depth-First Search can be used on this graph to explore the relevant state space.

One good example of such optimization problems is finding the next best move in zero-sum perfect-information games like tic-tac-toe, awari, chess, or go. The nodes of the state-space graph are board (game) positions and the edges of the graph are legal moves that are possible to get from one position to another. Finding the next best move will start from a given board position and branch to other positions via all legal moves; your opponent’s possible moves branch out from all of these nodes, and so on. The resulting graph is better known as a game tree.

To illustrate Depth-First Search, we will count the number of winning board configurations for the X player (first move) in a game of tic-tac-toe. The graph to search will be constructed blindly. The nodes of the graph will be legal board positions of a game and the edges will correspond to a legal move being made by adding the next player’s token into an open square. The graph will not strictly be a tree since two different board configurations can yield the same result after a legal move. The figure shows two separate positions within a tic-tac-toe move graph that share a common position among all the legal moves from the root positions.

### Algorithmic strategy

The idea behind graph searching is to visit and evaluate nodes in the graph through some coherent method. The Depth-First Search (DFS) algorithm utilizes a stack (last-in, first-out) and a boolean array (one element per node) to denote when a node has been visited. If the graph to be searched is connected, any node can be placed into the stack to start the algorithm.(If the graph is directed, it may not be connected for all nodes. However, if there is a node that can trace a path to every other node, e.g., the root of a tree, this node can be used to initialize the queue.) If the connectedness of the graph is unknown (or the purpose of the search is to find the connected components of the graph) all nodes should be initially pushed into the stack. DFS then loops until the stack is empty. During each iteration, the top node on the stack is popped off. If this node has been marked as visited, it is discarded. If the node has not been visited previously, the status of the node is marked as “visited” in the boolean array, the node is processed, and then all adjacent nodes are pushed onto the stack.

The idea behind depth-first search is to visit a node and then visit one adjacent node. From this node, the next node to visit is an adjacent node.Visualize a family tree where each successive generation is ordered from oldest to youngest children. Starting from the root, once the processing of that node is completed, all child nodes are pushed into the stack from eldest to youngest. The next node selected to be visited is the eldest child node, and after processing the eldest child, all child nodes are pushed into the stack. The search proceeds by next visiting the eldest child of the eldest child of the root, which adds all her children to the stack after processing. The search proceeds by visiting nodes on a path that goes from the root through eldest children, initially ignoring brothers, sisters, cousins, aunts, uncles, nephews and nieces, to the leftmost leaf of the tree. This “plunging straight to the depths” of the graph is where the algorithm got its name.

For the tic-tac-toe counting code, the DFS will not continue from that node when a winning position has been found. That is, none of the child nodes from that position will be added to the stack. Once a winning position (for either player) has been achieved, the game is over and no more moves are executed. Any game positions that are reachable in the graph from a winning node will need to be reached from some other node in another part of the graph. For example, the diagram here shows a win for X and succeeding “legal” move that extends the graph (but would not be generated in a true game). However, the succeeding node can be reached through another route as illustrated by the nodes on the right.

Implementation strategies

1. Serial code

Once the stack has been primed with one or more nodes of the graph, the Depth-First Search algorithm loops on the stack not being empty and processing each non-visited node for each iteration. Part of the processing is pushing any adjacent nodes. An adjacency matrix is used to represent the graph to be searched. Besides the adjacency matrix of the graph, the algorithm needs a method to keep track of what nodes have been visited. For this an array with one element per node serves as an indicator of a node having been visited via some boolean values (e.g., an integer array where 0 denotes “not visited” and 1 denotes previously “visited”). The pseudo-code for this algorithm is given here:

push “root” node onto stack S;

while (S not empty) {

v = pop(S);

if ( not visited[v] ) {

visited[v] = 1;

Process(v); // perform computation on node v

forall nodes, k, adjacent to v

push(k,S);

}

}

Besides the adjacency matrix and the array to keep track of which nodes have been visited, The code in Code Sample 1 also assumes that there is a global integer declared, V, that holds the number of nodes in the graph and, consequently, the number of rows and columns of adj. The code fragment in Code Sample 1 contains an iterative implementation of a Depth-First Search function, DFSearch(), and the associated function to perform the visit computations on a selected node. The functions win4X() and win4O() are the “processing” of the position represented by the node k in the graph. If the position represents a win for the X player, the win counter is incremented. Otherwise, if the position is not a win for the O player, the nodes in the graph that are adjacent to node k are pushed onto the stack S.

CODE SAMPLE 1 - Iterative implementation

int *visited; // notes when a node has been visited

int **adj; // adj[][] is adjacency matrix

int V; // number of nodes in graph

stack S; // stack of nodes (indices)

void DFSearch()

{

int i, k;

for (k = 0; k < V; ++k) visited[k] = 0;

for (k = V-1; k >= 0; --k) {

push(S, k);

}

while (size(S) > 0) {

k = pop(S);

if (!visited[k]) {

visited[k] = 1;

if (win4X(k)) {

++countXWins;

continue;

}

else if (!win4O(k)) {

for (i = V-1; i >= 0; --i){

if (adj[k][i]) push(S, i);

}

}

}

} // end while

}

The pushing of nodes onto stack S in the body of the while-loop (when the node k is neither a win for the X player nor the O player) could also test for whether or not the adjacent node has been visited prior to being pushed on the stack. As the search progresses, there would be fewer nodes placed on the stack. However, there is still the need to test whether or not a popped node still remains unvisited at the top of the while-loop body. There is always the chance that a node will be pushed onto the stack, popped off, and visited before a previously pushed instance is popped off and tested.

An explicit stack is not needed for the DFS algorithm. A recursive solution can use the call stack to keep track of which node is being currently searched and “return” to a parent node (once all the other adjacent nodes have been processed or found to be visited) to visit the next node from that parent. Code Sample 2 show a recursive implementation of counting tic-tac-toe wins using Depth-First Search.

The pushing of nodes onto stack S in the body of the while-loop (when the node k is neither a win for the X player nor the O player) could also test for whether or not the adjacent node has been visited prior to being pushed on the stack. As the search progresses, there would be fewer nodes placed on the stack. However, there is still the need to test whether or not a popped node still remains unvisited at the top of the while-loop body. There is always the chance that a node will be pushed onto the stack, popped off, and visited before a previously pushed instance is popped off and tested.

An explicit stack is not needed for the DFS algorithm. A recursive solution can use the call stack to keep track of which node is being currently searched and “return” to a parent node (once all the other adjacent nodes have been processed or found to be visited) to visit the next node from that parent. Code Sample 2 show a recursive implementation of counting tic-tac-toe wins using Depth-First Search.

CODE SAMPLE 2 - Recursive implementation

int *visited; // notes when a node has been visited

int **adj; // adj[][] is adjacency matrix of graph

int V; // number of nodes in graph

void visit(int k)

{

int i;

visited[k] = 1;

if (win4X(k)) {

++countXWins;

return;

}

else if (!win4O(k)) {

for (i = 0; i < V; i++){

if (adj[k][i])

if (!visited[i]) visit(i);

}

}

}

void DFSearch()

{

int k;

for (k = 0; k < V; k++) visited[k] = 0;

k = 0;

visit(k);

}

The DFSearch() function first resets the visited array to all ‘0’ entries since none of the nodes in the graph have yet been visited. The root node of the tic-tac-toe graph (node 0) is then used as the node to visit first as the parameter go the initial call to visit(). If the graph is a collection of connected components, a for-loop could be used to run over all nodes in the graph. Once all the nodes of one component have been visited, the return to the DFSearch() function the for-loop finds the next unvisited node, which is used for the call to visit(). This ensures that all nodes within any component are eventually visited.

Parallel considerations for DFS

In a parallel implementation of Depth-First Search, the visited array needs to be shared since all threads will need access to check on a node’s visit history and update that history when the node is actually used. A single lock object on the entire array would be the easiest solution that will regulate correct access. The biggest problem with this is the very real possibility that threads will sit idle waiting to read or update one element from the visited array. This will generate a very big overall performance hit and should be avoided if possible.

The best performance will be achieved by reducing or avoiding contention on the locking object. Thus, rather than have a single lock guard the entire array, a lock for each individual array element will reduce the instances of multiple threads needing concurrent access to the same element in the array. The drawback to the one element/one lock scheme is that for a graph with V nodes, V lock objects will need to be allocated alongside the visited array. This is a tradeoff of space for performance. Something in between the two extremes is needed to balance the contention and memory space issues.

One solution for such cases is to use modulo locks. If the multiple data items that require mutually exclusive access are indexed, a fixed number of locks can be allocated and the result of the item index modulo the number of locks is used to index the lock protecting access to the given item. For example, if two lock objects are used, one will protect access to the even-indexed items and the other will regulate access to the odd-indexed items. The contention on each lock should be cut in half from what it would be with a single all-encompassing lock, which should yield some performance benefit over using a single lock.

What is the optimal number of locks to be used? As a rule of thumb, a number of locks equal to the number of threads is obvious value. If two locks cut the contention time in half, a number of locks equal to the number of threads should avoid all contention with each thread never needing the same lock held by another thread. That won’t happen, of course, but it is a good goal. Twice the number of threads should still be relatively small and will help spread out any expected contention even better.

Locking a conditional expression evaluation

Both reads and writes of shared variables must be protected. For our example, there is no problem with computing the lock index for a given visited[k] element and obtaining the lock object before updating the node’s status. In both the iterative and recursive versions of DFS presented, there is a read access of visited[k] in the conditional expression used to determine if the node had been previously visited. You can’t put a lock/unlock sequence in the conditional expression itself. However, you can read the value of the protected variable into a local variable and use the value of that local variable within a conditional expression evaluation.

Code Sample 3 shows pseudo-code that does just that. The lock object, vMutex[j], is used to protect the critical region on the read access of visited[k]. (This object must also be used to protect the critical region that updates the visited[k] element.) The lVisited variable holds the local copy of the visited[k] value and the local integer j is used to hold the lock object index computed from the modulus operation.

Code Sample 3

j = k % NUM_LOCKS; // find index of lock protecting visited[k]

LOCK(vMutex[j]);

lVisited = visited[k];

UNLOCK(vMutex[j]);

if (NOT lVisited) {

LOCK(vMutex[j]);

visited[k] = 1;

UNLOCK(vMutex[j]);

/*

Body of if statement

*/

}

Unfortunately, the value of lVisited is only good as long as the execution is within the critical region in which that value is assigned. One scenario to show that the proposed solution in Code Sample 3 is inadequate, assume there are two threads, T0 and T1, each with the same local value of k. T0 reads visited[k], sets the local value of lVisited, exits the critical region, and is swapped out of the core, where T1 resumes execution. T1 enters the initial critical region and finds that the k node has not been visited and sets the local value of lVisited. If there are multiple cores and the threads are executing in parallel, T1 can enter the initial critical region while T0 is testing its local value of lVisited after the initial critical region. In either event, at this point, both T0 and T1 will execute the code to visit node k.

Both the reading and update of visited[k] should be in the same critical region to prevent the the value of visited[k] from being changed while a thread is attempting to read it. The results of the conditional test (reading of visited[k])must be used to determine if the node will need to be visited (and update the nodes visited status), and that results must be communicated outside the critcal region to ensure that some thread(and only one thread) will process the node.

With this in mind, Code Sample 4 shows a modified version of the code from Code Sample 3 to protect both the read and write access to visited[k] with a modulo lock and still be able to have the results of the test on visited[k] to control when a thread will execute the visit code.

CODE SAMPLE 4

j = k % NUM_LOCKS;

LOCK(vMutex[j]);

if (!visited[k]) {

iWillVisitK = 1;

visited[k] = 1;

}

UNLOCK(vMutex[j]);

if (iWillVisitK) {

/*

Body of if statement

*/

iWillVisitK = 0;

}

The pseudo-code above has only the one critical region and uses the local variable iWillVisitK (initialized to 0 or FALSE) to preserve the results of the conditional expression evaluation. Also, within the critical region, if node k has not been previously visited, the visited[k] element is set to ensure that the thread setting this value is going to be the only thread that will execute the visit computation for this node of the graph.

Is the parallel version still Depth-First Search?

In both the iterative and recursive serial versions, the order of node visits can be show to follow the expected DFS order. During execution one ore more threads may be popping nodes while another thread is pushing them onto the shared stack. Likewise, tasks of recursive calls may be executed in a nondeterministic order. Thus, there is no guarantee of a strictly DFS order to when nodes are visited in parallel. The paralellization of the DFS algorithm may result in some hybrid node visitation order from between Breadth-First (uses a queue instead of a stack) and Depth-First search.

Even so, many of the properties of a DFS on a graph can be preserved. The programmer, however, must pay particular attention to ensuring that the desired properties of DFS are maintained, even in the parallel code.

## 2. Shared memory

2a. OpenMP -- recursive implementation

This is the simplest parallelization strategy we will consider for this problem. The example uses the OpenMP task construct to spawn an independent execution of each recursive call.

int *visited; // notes when a node has been visited

int **adj; // adj[][] is adjacency matrix of graph

int V; // number of nodes in graph

void visit(int k)

{

int i, iWillVisitK = 0;

#pragma omp critical

if (!visited[k]) {

iWillVisitK = 1;

visited[k] = 1;

}

if (iWillVisitK) {

if (win4X(k)) {

#pragma omp atomic

++countXWins;

return;

}

else if (!win4O(k)) {

for (i = 0; i < V; i++){

if (adj[k][i]) {

#pragma omp task

visit(i);

}

}

}

}

}

void DFSearch()

{

int k;

#pragma omp parallel

{

#pragma omp for

for (k = 0; k < V; k++)

visited[k] = 0;

#pragma omp single

visit(0); // start at root node with index 0

} // end parallel

}

Comments on the code:

● A single region is used to allow only one thread to execute the initial call to visit(). Subsequent recursive calls to visit() are done from an OpenMP task, which wil involve the other threads in the team that are waiting at the single regions barrier

● An OpenMP critical construct is used to protect access and checking of the visited status for a node. Since the critical region is not given a name, all tasks executing visit() will block on the same critical.

● If a thread knows that it is handling the node k, the graph node is checked for a win by the X player. If this is not the case, a win for the O player is conducted and if this is not the case, the adjacent nodes to node k are explored through calls (and created tasks) to visit on each adjacent node.

● The update to the shared countXWins is done atomically..

● The number of OpenMP threads used does not need to match the number of cores on your system. The number of threads can be controlled by setting the environment variable OMP_NUM_THREADS at runtime. The function omp_set_num_threads() may be used to set the number of threads from within the code.

Further Exploration:

● Download OpenMP code. Using the OpenMP lock facility, implement modulo locks in place of using a critical construct. Is there a noticeable improvement in the execution time with the new code?

● Another modification to try is to not start a task on every recursive call. Instead, save one call to be done by the task that is spawning the new tasks. The body of the visit() function would need a loop to execute while a new node to visit is available. Whenever a win for the X player is found, the function will increment the counter and exit. If neither player has a win in the current graph node, then the first node adjacent to the current node k is put aside and any other adjacent nodes are used to spawn a new task. The node put aside then becomes the current node k for another iteration of the visit() loop.

#### 2b. Windows Threads -- iterative version

This implementation assumes the existence of a thread-safe stack (type Stack). (The Intel(R) Threading Building Blocks concurrent _queue container would be an appropriate substitute if the order of node visitation was not critical to the algorithm.) A semaphore object will be used to control access and keep a count of the number of items in the stack. A shared integer, gCount is used to count the nodes as they are visited. When the count reaches V ,the graph search is done.

long *visited;

long gCount = 0;

stack S;

unsigned __stdcall pDFSearch(void *pArg)

{

int k, i, iWillVisitK = 0;

while(1) {

WaitForSingleObject(hSem, INFINITE); // Semaphore count of stack size

if (gCount == V) break;

k = pop(S);

if (!InterlockedCompareExchange(&visited[k], 1L, 0L)) {

iWillVisitK = 1;

InterlockedIncrement(&gCount);

}

if (iWillVisitK) {

if (win4X(k)) {

InterlockedIncrement(&countXWins);

continue;

}

else if (!win4O(k)) {

for (i = V-1; i >= 0; i--){

int semCount=0;

if (adj[k][i]) {

push(S, i);

semCount++;

}

if (semCount) ReleaseSemaphore(hSem, semCount, NULL);

}

}

iWillVisitK = 0;

if (gCount == V) SetEvent(tSignal);

}

}

return 0;

}

//. . .

// Code to start DFS, wait for completion and terminate threads

push(S, 0); // load up root node into stack

hSem = CreateSemaphore(NULL, 1, V*V, NULL); // Initialize semaphore

for (i = 0; i < NUM_THREADS; ++i)

hThreads[i] = (HANDLE) _beginthreadex (NULL, 0, pDFSearch,

NULL, 0, NULL);

WaitForSingleObject(tSignal, INFINITE); // Wait for signal

ReleaseSemaphore(hSem, NUM_THREADS, NULL);

Comments on the code:

● The whole DFS computation is contained within an infinite while-loop. Each thread will use the semaphore (hSem) to determine if there are nodes on the stack. If there are items on the stack (the semaphore count is greater than 0), the count is decremented by the WaitForSingleObject() function. Before going on, the code checks the search termination criteria. If all the nodes in the graph have been visited (gCount == V), there’s no reason for a thread to continue, so the thread will break out of the while-loop and terminate. If the search hasn’t gotten to all nodes, the thread pops a node index from the stack into the local integer k.

● For the special case of the conditional expression evaluation used and the update of a single item in the visited array, the code uses InterlockedCompareExchange(d, e, c). When called, this function will store the current value of d in a temp location, the value of d is compared to c and if they are equal, the value of e is stored into d before the function returns the original value of d from the temp location. All of this is done atomically. To use InterlockedCompareExchange() to replace the critical region algorithm described in Sample Code 4 set d to reference visited[k], e will be ‘1’, and c will be ‘0’. If visited[k] is ‘0’ (node has not been visited), comparing this to c will result in the equal test being TRUE and the value in e will be stored in visited[k]. This atomically sets the status of the node to be visited and the return of ‘0’ from the originally stored value signifies that the node was previously unvisited. On the flip side, if visited[k] is ‘1’, the comparison to c will not be equal, there will be no change made to the value stored in this array element, and the return of ‘1’ signifies that the node has already been visited by a thread.

● If the iWillVisitK flag is set, the thread does the visit computation on node k. If there was no win detected for either player, the row k of the adjacency matrix (adj) is searched. Each adjacent node is pushed onto the stack and a local counter keeps track of how many new nodes are added to the stack. After all the nodes have been placed on the stack, the semaphore value is updated. By using the local counter, I only need to call ReleaseSemaphore() once, and only if there were adjacent nodes found. Finally, the iWillVisitK flag is reset in preparation of the next node to be taken from the stack.

● Before going back to the stack, the thread examines the value of gCount. If all nodes have been visited, a signal is sent to an external thread (likely the thread that spawned the threads for the search) to indicate that the search is done. While a break could be placed here, the test and break just after the WaitForSingleObject() call on the semaphore is needed for those threads that don’t detect the last node to be visited.

● In the code to prepare and launch the threads for the DFS, the first line pushes the root node onto the stack S. The count of the semaphore object, hSem, is initialized as 1, the number of nodes on the stack, and the maximum count value is set at V**2. This would correspond to a fully connected graph with V nodes, which has the largest number of edges for the given number of nodes.The threads are created by calling _beginthreadex() and the returned HANDLE for each thread is stored in the hThreads array. After creating the threads, the spawning thread waits on the Windows event that will signal completion of the search. In case there will be threads stalled waiting on an empty stack, ReleaseSemaphore() is called to release those threads in order for them to determine that the search has completed.

● Notice that the setting of the tSignal event was in pDFSearch() done after the node had been processed and extra, ineffectual nodes were added to the stack. Why not put the test for completion and sending of the signal right after the InterlockedIncrement() call that results in gCount achieving the target value? This does seem like a more logical place, but it could lead to a problem in the spawning thread’s use of the search results. If the signal is sent before the last node has actually been processed, the spawning thread can wake up, set the semaphore’s count to ensure the search nodes aren’t ensnared by an empty stack, and then proceed to use the (incomplete) results of the search. To guarantee that all node processing has finished, the spawning thread would need another synchronization point after setting the semaphore value. By not setting of the tSignal event until after the last node is finished with the required visit processing, the spawning thread knows that all search processing is finished when the tSignal event is set. Alternately, a WaitForMultipleObjects() call can be set for the thread termination. However, this would keep the spawning thread paused when the DFS has completed.

#### 2c. Go Language

package main

import (

"flag"

"fmt"

"math"

"rand"

"runtime"

"sync"

"sync/atomic"

"time"

)

var (

nVertex = flag.Int("v", 1000, "number of vertices")

nEdge = flag.Int("e", 100, "mean number of edges per vertex")

nCPU = flag.Int("cpu", 1, "number of CPUs to use")

)

type Graph struct {

adj [][]bool // Adjacency matrix.

comp []uint32 // Component index (0 means not marked).

}

func main() {

flag.Parse()

runtime.GOMAXPROCS(*nCPU) // Set number of OS threads to use.

t := time.Nanoseconds()

g := MakeGraph(*nVertex, *nEdge)

t = time.Nanoseconds() - t

fmt.Printf("make graph: %dms\n", t/1e6)

t = time.Nanoseconds()

g.Mark()

t = time.Nanoseconds() - t

fmt.Printf("mark graph: %dms\n", t/1e6)

}

// MakeGraph creates a random graph with v vertices

// and e edges per vertex (on the average).

func MakeGraph(v, e int) *Graph {

var g Graph

g.comp = make([]uint32, v)

g.adj = make([][]bool, v)

for i := 0; i < v; i++ {

g.adj[i] = make([]bool, v)

}

var wg sync.WaitGroup

wg.Add(v)

for i := 0; i < v; i++ {

go func(i int) {

r := rand.New(rand.NewSource(int64(i)))

c := float64(v) / float64(e)

for j := 0; j < i; j++ {

if r.Float64()*c < 1 {

g.adj[i][j] = true

g.adj[j][i] = true

}

}

wg.Done()

}(i)

}

wg.Wait()

return &g

}

// Mark marks connected components in g.

func (g *Graph) Mark() {

comp := 0 // Component index sequence.

splitThreshold := 0

if *nCPU > 1 {

splitThreshold = math.Ilogb(float64(*nCPU)) + 2

}

for i := 0; i < len(g.adj); i++ {

if g.comp[i] == 0 {

comp++

g.comp[i] = uint32(comp)

var wg sync.WaitGroup

g.visit(i, comp, splitThreshold, &wg)

wg.Wait()

}

}

}

// visit visits a single vertex n.

func (g *Graph) visit(n, comp, splitThreshold int, wg *sync.WaitGroup) {

v := len(g.adj)

if splitThreshold == 0 {

for i := 0; i < v; i++ {

if !g.adj[n][i] || g.comp[i] != 0 {

continue

}

if !atomic.CompareAndSwapUint32(&g.comp[i], 0, uint32(comp)) {

continue

}

g.visit(i, comp, splitThreshold, wg)

}

return

}

visit := make([]int, 0, v)

for i := 0; i < v; i++ {

// Check that the vertex is adjacent and is not yet marked.

if !g.adj[n][i] || g.comp[i] != 0 {

continue

}

// Try to mark the vertex.

// If it fails then we lose a race with a concurrent goroutine.

if !atomic.CompareAndSwapUint32(&g.comp[i], 0, uint32(comp)) {

continue

}

visit = append(visit, i)

}

if len(visit) == 0 {

return

} else if len(visit) == 1 {

g.visit(visit[0], comp, splitThreshold, wg)

return

}

mid := len(visit) / 2

g.visitSet(visit[:mid], comp, splitThreshold-1, wg)

g.visitSet(visit[mid:], comp, splitThreshold-1, wg)

}

func (g *Graph) visitSet(set []int, comp, splitThreshold int, wg *sync.WaitGroup) {

wg.Add(1)

go func() {

for i := 0; i < len(set); i++ {

g.visit(set[i], comp, splitThreshold, wg)

}

wg.Done()

}()

}

What else is DFS good for?

Given a graph of nodes and edges,a computation may need to visit every node in the graph in search of some specific node or to simply survey the contents of the graph. For such problems, Depth-First search is an excellent method for doing so. One problem that can be solved by visiting every node in a graph to tell if an undirected graph is connected (each node is reachable from any other node), or you can identify and label the connected components that make up the graph. Determining if there is a cycle in the graph can also be done through a Depth-First search.