This paper aims at bridging the gap between the lack of synchronization mechanisms in recent graphics processor (GPU) architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like testand-set and compare-and-swap. Accesses to the new waitfree objects have time complexity O(N ), where N is the number of concurrent processes. The wait-free objects have space complexity O(N 2 ), which is optimal. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
Graphics processors (GPUs) are emerging as powerful computational co-processors for general-purpose computations. The demands of graphics applications have driven GPUs to be today's most powerful computational hardware for the dollar [4] . Moreover, unlike previous GPU architectures, which are single-instruction multiple-data (SIMD), recent GPU architectures (e.g. Compute Unified Device Architecture (CUDA) [1] ) are single-program multiple-data (SPMD). The latter consists of multiple SIMD multiprocessors of which each, at the same time, can execute a different instruction. This extends the set of general-purpose applications on GPUs, which are no longer restricted to follow the SIMD programming model. These facts have motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computations such as physics simulations, data mining and signal processing [4] .
However, recent GPU architectures also create challenges on synchronization between their SIMD multiprocessors (or Copyright is held by the author/owner(s). PODC'08, August 18-21, 2008, Toronto, Ontario, Canada. ACM 978-1-59593-989-0/08/08. SIMD cores). Since the GPUs are optimized for processing 3D graphics (e.g. graphics rendering), they consist of a massive number of hardware threads each of which basically operates on a pixel of a 3D image. In such data-parallel applications, the hardware threads do not need to synchronize with each other. As a result, the GPUs typically do not support any strong synchronization primitives like test-andset and compare-and-swap (e.g. NVIDIA Tesla and Quadro Plex series with up to 64 cores) [1] . The fact prevents the GPUs from being deployed more widely. For instance, the GPUs cannot be used for conventional concurrent programming that typically supports synchronization mechanisms like semaphores and monitors, nor advanced lock-free/waitfree programming.
Based on the intrinsic features of recent GPU architectures, we first generalize the architectures to an abstract model of a chip with multiple SIMD cores sharing a memory. Each core can process M threads (in a SIMD manner) in one clock cycle. Each thread of a core accesses the shared memory using (atomic) read/write operations. We observe that due to SIMD architecture each SIMD core with M hardware threads can read/write M memory locations in one atomic step (e.g. the CUDA architecture [1] ). Using the M -register read/write operations, we construct wait-free and t-resilient synchronization objects [2, 3] for this model. The wait-free and t-resilient objects can be deployed as building blocks in parallel programming to help parallel applications tolerate crash failures and gain performance.
ACKNOWLEDGMENTS
A long version of this paper appeared at the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2008. Phuong Ha's and Otto Anshus's work was supported by the Norwegian Research Council (grant numbers 159936/V30 and 155550/420). Philippas Tsigas's work was supported by the Swedish Research Council (VR) (grant number 37252706).
