Let’s get started...
This series of Gems describes the concurrent maze solver project ("amazing") included with the GNAT Pro examples. The first Gem in the series introduced the project itself and explained the concurrent programming design approach. This second Gem explores the principal change that was required for optimal performance on multicore architectures. This change solved a critical performance bottleneck that was not present when the original program was first deployed in the 1980s, illustrating one of the fundamental differences between traditional multiprocessing and modern multicore programming.
The original target machine was a Sequent Balance 8000, a symmetric multiprocessor with eight CPUs and shared memory. The operating system transparently dispatched Ada tasks to processors, so one could write a highly portable concurrent Ada program for it. In the 1980s this was a very attractive machine, as you might imagine. The resulting program successfully demonstrated Ada's capability to harness such architectures, as well as the general benefits of parallel execution. In particular, the execution time for the sequential version of the maze solver grew at an alarming rate as the number of maze solutions grew larger, whereas the parallel version showed only modest increases. (Remember, the point is to find all the possible solutions to a given maze, not just one.)
The program was indeed highly portable and ran on a number of very different vendors' machines, some parallel and some not. Over time, we have incorporated the language revisions' advances, primarily protected types, and added features such as command-line switches for flexibility, but the architecture and implementation have largely remained unchanged. Until recently, that is.
As described in the first Gem in this series, the program "floods" the maze with searcher tasks in a classic divide-and-conquer design, each searcher looking for the exit from a given starting point. The very first searcher starts at the maze entrance, of course, but as any searcher task encounters intersections in the maze, it assigns another identical task to each alternative location, keeping one for itself. Thus, a searcher task that finds the exit has discovered only part of a complete solution path through the maze. If the very first searcher happened to find the exit, it would have a complete solution, but all the other searchers have only a part of any given solution path because they did not start at the entrance.
As the searchers traverse the maze they keep track of the maze locations they visit so that those locations can be displayed if the exit is eventually found. But as we have seen, those locations comprise only a partial path through the maze. Therefore, when a successful searcher displays the entire solution it must also know the locations of the solution prior to its own starting point, as well as the locations it traversed itself to reach the exit. To address that requirement, when a searcher is initiated at a given starting location it is also given the current solution as it is known up to that location. The very first searcher is simply given an empty solution, known as a "trail" in the program. Successful searchers display both the part they discovered and the part they were given when started.
Note that these partial solutions are potentially shared, depending on the maze. (Solutions are unique if any constituent maze locations are different, but that does not preclude partial sharing.) Those maze locations closer to the entrance are likely to be heavily shared among a large number of unique solutions. Conceptually, the complete solutions form a tree of location sequences, with prior shared segments appearing earlier in the tree and unique subsegments appearing beneath them. The maze entrance appears once, in the root at the top of the tree, whereas the maze exit appears at the end of every solution.
Imagine, then, how one might want to represent this tree. Given that segments of the solutions – the trails – are likely shared logically, perhaps we can also share them physically. However, as a shared data structure, race conditions are an obvious concern. We therefore want a representation that will minimize the locking required for mutual exclusion. We also want a representation that can contain any number of location pairs per segment because the mazes are randomly generated initially. That is, we don't know how many locations any given solution will contain, much less how many solutions there will be.
An unbounded, dynamically allocated list of maze locations meets these goals nicely. It can directly represent the logical sharing and can handle trails of any length as long as sufficient memory is available. Even better, no mutual exclusion locking is required because we only need to append list segments to prior, existing segments. There is no need to alter the prior segments themselves, so there is no need to lock the tree at all!
The representation seems ideal, and for the original symmetric multiprocessor target it was a reasonable approach, but when the program was run on modern multicore machines the performance was very poor. Indeed, individual processor utilization was so poor that the sequential version of the maze solver was quite competitive with the concurrent version.
Poor processor utilization is the key to the problem. Even though we are harnessing multiple processors and can have as many threads available per processor as we may want, individual processors are performing poorly. The problem is caused by poor cache utilization, itself a result of poor locality of reference. Specifically, the dynamically allocated elements within the trails are not in memory locations sufficiently close to one another to be in the same cache line, thereby causing many cache misses and poor overall processor performance.
The issue is that searcher tasks must also examine the locations within their prior solution trails as they search for the exit. (In other words, not only when displaying solutions.) They do so to prevent false circular solutions through the maze, made possible by the presence of intersections. Therefore, the searcher tasks must determine whether they have been to a given location in the maze before including that location in their solution. Not all location pairs in a trail need be visited, however, to perform this check. The presence of an intersection in a prior path segment suffices to indicate circular motion, so each trail includes a list of intersections, and it is this secondary list that the searchers examine. Unfortunately any benefits of that implementation optimization are overwhelmed by the results of the cache misses.
A different trail representation is needed for programs intended for multicore targets, one with much better locality of reference. Arrays have that precise characteristic, so we have chosen a bounded, array-backed list to represent trails. That choice will not surprise those familiar with this problem, even though the resulting copying and lack of physical sharing would argue against it.
In the next Gem in this series we will provide the details of this implementation change and the reusable components involved.
As mentioned, the "amazing" project is supplied with the GNAT Pro native compiler. Look for it in the share/examples/gnat/amazing/ directory located under your compiler’s root installation. Note that the described design change will appear in future releases of the compiler.