In this paper, we revisit the design of synchronization
primitives---specifically barriers, mutexes, and semaphores---and how they
apply to the GPU. Previous implementations are insufficient due to the
discrepancies in hardware and programming model of the GPU and CPU. We create
new implementations in CUDA and analyze the performance of spinning on the GPU,
as well as a method of sleeping on the GPU, by running a set of memory-system
benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class
GPUs from NVIDIA. From our results we define higher-level principles that are
valid for generic many-core processors, the most important of which is to limit
the number of atomic accesses required for a synchronization operation because
atomic accesses are slower than regular memory accesses. We use the results of
the benchmarks to critique existing synchronization algorithms and guide our
new implementations, and then define an abstraction of GPUs to classify any GPU
based on the behavior of the memory system. We use this abstraction to create
suitable implementations of the primitives specifically targeting the GPU, and
analyze the performance of these algorithms on Tesla and Fermi. We then predict
performance on future GPUs based on characteristics of the abstraction. We also
examine the roles of spin waiting and sleep waiting in each primitive and how
their performance varies based on the machine abstraction, then give a set of
guidelines for when each strategy is useful based on the characteristics of the
GPU and expected contention.Comment: 13 pages with appendix, several figures, plans to submit to CompSci
conference in early 201