Gem #93: High Performance Multi-core Programming - Part 1
by Pat Rogers —AdaCore
Let’s get started…
Chameneos-Redux is part of the Computer Language Shootout, a suite of benchmarks that compares the implementations of various programming languages across different kinds of applications and platforms. The program is required to perform a specified number of rendezvous between mythical "chameneos" creatures, where each creature is represented by a distinct thread. Each rendezvous is symmetric, in that the participating creatures can be either the caller or the called member of any given encounter.
Although both single-core and multi-core machines are used in the benchmarks, we focus only on the multi-core versions with our Ada implementation. The multi-core benchmark results for all implementations are available here: Chameneos-Redux.
As you will see on that web page, there are several implementations. (All implementations are supplied by volunteers, written against common requirements.) The fastest implementations are currently written in GNU C, GNU C++, SBCL Lisp, and GNAT Ada. After those, the other implementations, also written in C, Java, C++, Ada, and numerous other languages, are considerably slower. On the author's development machine, the C++ and Ada implementations have essentially the same performance results, with a slight advantage for C++, but on the official benchmark machine our Ada version is noticeably slower than the C++ version. The discrepancy is under investigation. The overall results occasionally fluctuate too, due presumably to unexpected perturbations in the benchmarking platform, such that the top-ranked versions can move up or down a little relative to each other.
The reason the other implementations are considerably slower is that they do not use the same fundamental design. That fact is our first point to be made about performance: design trumps tuning. When it comes to performance, no amount of tweaking can compensate for an inherently slow design. For example, an earlier Ada implementation (supplied by another volunteer) used an asynchronous select statement. In addition to the difficulties in using this statement correctly, the semantics are such that it is inherently expensive, and would be so in any programming language. Worse yet, the asynchronous select statement was the "else part" of a selective accept statement for a rendezvous, all within a loop. Effectively the program was polling in the most expensive manner imaginable. Worst of all, this code was in the worst possible place – the code implementing the primary behavior of the threads. As a vehicle for displaying Ada constructs the design was impressive, but as a demonstration for performance it was not competitive.
In contrast, the three fastest designs (including the current Ada implementation) all use a shared variable for this most critical aspect of the implementation. The shared variable is accessed using a specific machine instruction that locks the memory bus and then atomically reads and updates the value. By packing both the number of rendezvous completed and the identities of the creatures awaiting rendezvous into a single shared word, this single-instruction approach provides an extremely fast method of updating program state and passing data among the threads.
The machine instruction is a "compare-and-swap" (CAS) instruction that can be accessed as a GCC compiler "built-in". It can also be accessed by writing the machine-code insertion directly, but the built-in is more convenient. To import it into an Ada program, one declares the subprogram and then uses pragma Import for the completion as usual, but with a convention of "intrinsic" because it is an instruction sequence issued directly by the compiler.
function Sync_Val_Compare_And_Swap_32 (Destination : access Unsigned_32; Comparand : Unsigned_32; New_Value : Unsigned_32) return Unsigned_32; pragma Import (Intrinsic, Sync_Val_Compare_And_Swap_32, "__sync_val_compare_and_swap_4");
The declaration for the built-in is that of a function, because the value prior to the swap is returned. Specifically, the CAS built-in compares Comparand to Destination.all, and if they have the same value, writes New_Value into Destination.all. The caller can then check the value returned to see whether the update actually took place, as well as using that value for other purposes. (There is also a version that returns a Boolean value indicating whether the swap occurred, instead of returning the prior value.)
The other requirement for importing an intrinsic built-in is to deal with overloading. There are several forms of the CAS instruction, depending on the size of the data in question, resulting in overloaded versions of the built-in. GNAT currently does not automatically resolve overloaded intrinsic operations, so the External_Name parameter to pragma Import must identify which version is intended. A suffix is appended to the name for this purpose. In the declaration above, we are using 32-bit quantities, so we specify the name as shown, in which the suffix "_4" indicates the number of bytes to manipulate and thus the version of the built-in desired.
In a future Gem we will explore additional steps used to increase performance, including a user-defined allocator that ensures all allocations are cache-aligned, tuning by specifying adherence to language restrictions, and assigning threads to cores so that threads execute with maximum performance.