A non-blocking FIFO queue algorithm for m ultiprocessor shared memory systems is presented in this paper. The algorithm is very simple, fast and scales very well in both symmetric and non-symmetric multiprocessor shared memory systems. Experiments on a 64-node SUN Enterprise 10000 ; a symmetric m ultiprocessorsystem ; and on a 64-node SGI Origin 2000 ; a cac he coherent non uniform memory access m ultiprocessorsystem ; indicate that our algorithm considerably outperforms the best of the known alternativ es in both multiprocessors in any lev el of multiprogramming. This work introduces t w o new, simple algorithmic mechanisms. The rst low ers the contention to key variables used by the concurrent enqueue and/or dequeue operations which consequently results in the good performance of the algorithm, the second deals with the pointer recycling problem, an inconsistency problem that all non-blocking algorithms based on the compare-and-swap sync hronisation primitive h a ve to address. In our construction we selected to use compare-and-swap since compare-and-swap is an atomic primitive that scales well under contention and either is supported by modern multiprocessors or can be implemented e cient l y o n t h e m .
INTRODUCTION
Concurrent FIFO queue data structures are fundamental data structures used in many applications, algorithms and operating systems for m ultiprocessorsystems. T o protect the integrity of the shared queue, concurrent operations that have been created either by a parallel application or by the operating system have to be sync hronised. T ypically , algorithms for concurrent data structures, including FIFO This w ork is partially supported by: i) the national Swedish Real-Time Systems research initiativ e ARTES (www.artes.uu.se) supported by the Swedish Foundation for Strategic Research and ii) the Swedish Research Council for Engineering Sciences.
queues, use some form of mutual exclusion (locking) to synchronise concurrent operations. Mutual exclusion protects the consistency of the concurrent data structure by a l l o wing only one process (the holder of the lock o f the data structure) at a time to access the data structure and by blocking all the other processes that try to access the concurrent data structure at the same time. Mutual exclusion and, in general, solutions that introduce blocking are penalised by locking that introduces priority i n version, deadlock scenarios and performance bottlenecks. The time that a process can spend blocked while waiting to get access to the critical section can form a substantial part of the algorithm execution time 5, 9, 10, 14]. There are two main reasons that locking is so expensive. The rst reason is the convoying e ect that bloc king synchronisation su ers from: if a process holding the loc k is preempted, any other process waiting for the lock is unable to perform any useful work until the process that hold the locks is scheduled. When we taking into account that the m ultiprocessorrunning our program is used in a m ultiprogrammingenvironment, convoying e ects can become serious. The second is that locking tends to produce a large amount of memory and interconnection network contention, loc ks become hot memory spots.Researchers in the eld rst designed di erent l o c k implementations that low er the con tention when the system is in a high congestion situation, and they give di erent execution times under di erent con tention instances. But on the other hand the overhead due to blocking remained. T o address the problems that arise from blocking researchers have proposed non-blocking implementations of shared data structures. Non-blocking implementation of shared data objects is a new alternative approach to the problem of designing scalable shared data objects for multiprocessor systems. Non-blocking implementations allow m ultiple tasks to access a shared object at the same time, but without enforcing mutual exclusion to accomplish this. Since in non-blocking implementations of shared data structures one process is not allowed to block another process, non-blocking shared data structures have the follo wing signi cant a d v antages over lock-based ones:
1. they avoid lock c o n voys and contention points (locks). 2. they pro vide high fault tolerance (processor failures will never corrupt shared data objects) and eliminates deadloc k scenarios, where t w o or more tasks are waiting for loc ks held by the other. 3. they do not give priority i n version scenarios. Among all the innovative architectures for multiprocessor systems that have been proposed the last forty y ears shared memory multiprocessor architectures are gaining a c e n tral place in high performance computing. Over the last decade many shared memory multiprocessors have been built and almost all major computer vendors develop and o er shared memory multiprocessor systems nowadays. There are two main classes of shared memory multiprocessors: the CacheCoherent N o n uniform Memory Access multiprocessors (cc-NUMA) and the symmetric or Uniform Memory Access (UMA) multiprocessors, their di erences coming from the architectural philosophy they are based on. In symmetric shared memory multiprocessors every processor has its own cache, and all the processors and memory modules attach t o t h e same interconnect, which is a shared bus. ccNUMA is a relatively new system topology that is the foundation for nextgeneration shared memory multiprocessor systems. As in UMA systems, ccNUMA systems maintain a uni ed, global coherent memory and all resources are managed by a s i ngle copy of the operating system. A hardware-based cache coherency scheme ensures that data held in memory is consistent on a system-wide basis. In contrast to symmetric shared memory multiprocessor systems in which all memory accesses are equal in latency, in ccNUMA systems, memory latencies are not all equal, or uniform (hence, the name -Non-Uniform Memory Access). Accesses to memory addresses located on "far" modules take longer than those made to "local" memory. This paper addresses the problem of designing scalable, practical FIFO queues for shared memory multiprocessor systems. First we present a non-blocking FIFO queue algorithm. The algorithm is very simple, it algorithmically implements the FIFO queue as a circular array and introduces two new algorithmic mechanisms that we believe can be of general use in the design of e cient non-blocking algorithms for multiprocessor systems. The rst mechanism restricts contention to key variables generated by concurrent enqueue and/or dequeue operations in low l e v els contention to shared variables degrades performance not only in memory tanks where the variables are located but also in the processor-memory interconnection network. The second algorithmic mechanism that this paper introduces is a mechanism that deals with the pointer recycling (also known as ABA) problem, a problem that all non-blocking algorithms based on the compare-and-swap primitive h a ve to address. The performance improvements are due to these two mechanisms and to its simplicity that comes from the simplicity and richness of the structure of circular arrays. We have selected to use the compare-and-swap primitive since it scales well under contention and either is supported by modern multiprocessors or can be implemented e ciently on them. Last, we evaluate the performance of our algorithm on a 64-node SUN Enterprise 10000 multiprocessor and a 64-node SGI Origin 2000. The SUN system is a Uniform Memory Access (UMA) multiprocessor system while the SGI system is a Cache-Coherent Nonuniform Memory Access (ccNUMA) one SUN Enterprise 10000 supports the compare-and-swap while SGI Origin 2000 does not. The experiments clearly indicate that our algorithm considerably outperforms the best of the known alternatives in both UMA and ccNUMA machines with respect to both dedicated and multiprogramming workloads. Second, the experimental results also give a better insight into the performance and scalability of non-blocking algorithms in both UMA and ccNUMA large scale multiprocessors with respect to dedicated and multiprogramming workloads, and they con rm that non-blocking algorithms can perform better than blocking on both UMA and ccNUMA large scale multiprocessors, and that their performance and scalability increases as multiprogramming increases.
Concurrent FIFO queue data structures are fundamental data structures used in many m ultiprocessor programs and algorithms and, as can be expected, many researchers have proposed non-blocking implementations for them. Lamport 6] introduced a wait-free queue that does not allow more than one enqueue operation or dequeue operation at a time. Herlihy and Wing in 4] presented an algorithm for a nonblocking linear FIFO queue which requires an in nite array. Prakash, Lee and Johnson in 11] presented a non-blocking and linearisable queue algorithm based on a singly-linked list. Stone describes a non-blocking algorithm based on a circular queue. Massalin and Pu 8] present a non-blocking array-based queue which requires the double-compare-andswap atomic primitive that is available only on some members of the Motorola 68000 family of processors. Valois in 12] presents a non-blocking queue algorithm together with several other non-blocking data structures, his queue is an array-based one. Michael and Scott in 10] presented a nonblocking queue based on a singly-link list, which is the most e cient and scalable non-blocking algorithm compared with the other algorithms mentioned above.
The remainder of the paper is organised as follows. In Section 2 we g i v e a brief introduction to shared memory multiprocessors. Section 3 presents our algorithm together with a proof sketch. In Section 4, the performance evaluation of our algorithm is presented. The paper concludes with Section 5.
SHARED MEMORY MULTIPROCESSORS: ARCHITECTURE AND SYNCHRONIZA-TION
There are two main classes of shared memory multiprocessors the Cache-Coherent N o n uniform Memory Access (ccNUMA) multiprocessors and the symmetric multiprocessors. The most familiar design for shared memory multiprocessor systems is the " xed bus" or shared-bus multiprocessor system. The bus is a path, shared by a l l processors, but usable only by one at a time to handle transfers from CPU to/from memory. By communicating on the bus, all CPUs share all memory requests, and can synchronise their local cache memories. Such systems include the Silicon Graphics Challenge/Onyx systems, OCTANE, Sun's Enterprise (300-6000), Digital's 8400, and many others -most server vendors o er such systems.
Central Crossbar Mainframes and supercomputers have o ften used a crossbar "switch" to build shared multiprocessor systems with higher bandwidth than feasible with busses, where the switch supports multiple concurrent paths to be active at once. Such systems include most mainframes, the CRAY T90, and Sun's new Enterprise 10000 Figure 1 (a) graphically describes the architecture of the new SUN Enterprise 10000. Shared-bus and central crossbar systems are usually called UMAs, or Uniform Memory Access systems, that is, any CPU is equally distant in time from all memory locations. Uniform memory access shared memory multiprocessors dominate the server market and are becoming more common on the desktop. The price of these systems rise quite fast as the number of processors increases.
ccNUMA is a relatively new system topology that is the foundation for many next-generation shared memory multiprocessor systems. Based on "commodity" processing modules and a distributed, but uni ed, coherent memory, cc-NUMA extends the power and performance of shared memory multiprocessor systems while preserving the shared memory programming model. As in UMA systems, ccNUMA systems maintain a uni ed, global coherent memory and all resources are managed by a single copy of the operating system. A hardware-based cache coherency scheme ensures that data held in memory is consistent on a systemwide basis. I/O and memory scale linearly as processing modules are added, and there is no single backplane bus. The nodes are connected by a n i n terconnect, whose speed and nature varies widely. Normally, the memory "near" a CPU can be accessed faster than memory locations that are "further away". This attribute leads to the "Non" in Non- A widely available hardware synchronisation primitive t h a t can be found on many common architectures is compare-andswap. The compare-and-swap primitive t a k es as arguments the pointer to a memory location, and old and new values.
As it can be seen from Figure 3 that describes the specication of the compare-and-swap primitive, it automatically checks the contents of the memory location that the pointer points to, and if it is equal to the old value, updates the pointer to the new value. In either case, it returns a boolean value that indicates whether it has succeeded or not. The IBM System 370 was the rst computer system that introduced compare-and-swap. SUN Enterprise 10000 is one of the systems that support this hardware primitive. Some newer architectures, SGI Origin 2000 included, introduce the load-linked/store-conditional instruction which c a n be implemented by the compare-and-swap primitive. The load-linked/store-conditional is comprised by t wo s i mpler operations, the load-linked and the store-conditional one. The load-linked loads a word from the memory to a register. The matching store-conditional stores back p o ssibly a new value into the memory word, unless the value at the memory word has been modi ed in the meantime by another process. If the word has not been modi ed, the store succeeds and a 1 is returned. Otherwise the, store-conditional fails, the memory is not modi ed, and a 0 is returned. The speci cation of this operation is shown in The compare-and-swap primitive though gives rise to the pointer recycling (also known as ABA) problem. The ABA problem arises when a process p reads the value A from a shared memory location, computes a new value based on A, and using compare-and-swap updates the same memory location after checking that the value in this memory location is still A and mistakenly concluding that there was no operation that changed the value to this memory location in the meantime. But between the read and the compare-and-swap operation, other processes may h a ve c hanged the context of the memory location from A to B and then back t o A again. In this scenario the compare-and-swap primitive fails to detect the existence of operations that changed the value of the memory location in many non-blocking implementations of shared data structures this is something that we would like to be able to detect without having to use the read_modify_write operation that has very high latency and creates high contention. A common solution to the ABA problem is to split the shared memory location into two parts: a part for a modi cation counter and a part for the data. In this way when a process updates the memory location, it also increments the counter in the same atomic operation. There are several drawbacks of such a solution. The rst is that the real word-length decreases as the counter now occupies part of the word. The second is that when the counter rounds there is a possibility for the ABA scenario to occur, especially in systems with many, and with fast processors such as the systems that we are studying. In this paper we p r e s e n t a n e w , very simple e cient t e c hnique to overcome the ABA problem the technique is described in the next section together with the algorithm. 
THE ALGORITHM
During the design phase of any e cient non-blocking data structure, a large e ort is spent on guaranteeing the consistency of the data structure without generating many i n terconnection transactions. The reason for this is that the performance of any synchronisation protocol for multiprocessor systems heavily depends on the interconnection transactions that they generate. A high number of transactions causes a degradation in the performance of memory banks and the processor/memory interconnection network.
As a rst step, when designing the algorithm presented here, we tried to use simple synchronisation instructions (primitives), with low latency, that do not generate a lot of coherent tra c but are still powerful enough to support the highlevel synchronisation needed for the non-blocking implementation of a FIFO queue. In the construction described in this paper we have selected to use the compare-and-swap atomic primitive since it meets the three important goals that we w ere looking for. First, it is a quite powerful primitive and when used together with simple read and write registers is su cient for building any non-blocking implementation of any "interesting" shared data-structure 3]. Second, it is either supported by modern multiprocessors or can be implemented e ciently on them. Finally, i t d o e s not generate a lot of coherent tra c. The only problem with the compare-and-swap primitive is that, it gives rise to the pointer recycling (also known as ABA) problem. As a second step, we h a ve tried when designing the algorithm presented here to use the compare-and-swap operation as little as possible. The compare-and-swap operation is an e cient synchronisation operation and its latency increases linearly with the number of processors that use it concurrently, b u t still it is a transactional one that generates coherent trafc. On the other hand read or update operations require a single message in the interconnection network and do not generate coherent tra c. As a third step we p r o p o s e a s i mple new solution that overcomes the ABA problem that does not generate a lot of coherent tra c and does not restrict the size of the queue. Figure 6 and Figure 7 present commented pseudo-code for the new non-blocking queue algorithm. The algorithm is simple and practical, and we w ere surprised not to nd it in the literature. The non-blocking queue is algorithmically implemented as a circular array with a head and a tail pointer and a special shared variable called the vnull. The vnull shared variable has been introduced in order to help us to avoid the ABA problem as we are going to see at the end of this section. During the design phase of the algorithm we realised that: i) we could use the structural properties of a circular array to reduce the number of compare-and-swap operations that our algorithm uses as well as to overcome more e ciently the ABA problem and ii) all previous nonblocking implementations were trying to guarantee that the tail and the head pointers always show the real head and tail of the queue but by allowing the tail and head pointers to lag behind we could even further reduce the number of compare-and-swap asymptotically close to optimal. We assume that enqueue operations inserts data at the tail of the queue and dequeue operations remove data from the head of the queue if the queue is not empty. In the algorithm presented here we a l l o w the head and the tail pointers to lag at most m behind the actual head and tail of the queue, in this way only one every m operations has to consistently adjust the tail or head pointer by performing a compare-and-swap operation. Since we implement the queue as a circular array, each queue operation that successfully enqueues or dequeues data knows the index of the array where the data have b e e n placed, or have been taken from, respectively if this index can be divided by m, then the operation will try to update the head/tail of the queue, otherwise it will skip the step of updating the head/tail and let the head/tail lag behind the actual head/tail. In this way, the amortised numberof compare-and-swap operations for an enqueue or dequeue operation is only 1 + 1 =m, 1 compare-and-swap operation per enqueue/dequeue operation is necessary. The drawback that such a t e c hnique introduces is that each operation on average will need m=2 more read operations to nd the actual head or tail of the queue but if we x m so that the latency of (m ;1)=m compare-and-swap operations is larger than the latency of m=2 read operations, there will be a performance gain from the algorithm, and these performance gains will increase as the number of processes increases.
It is de nitely true that array-based queues are inferior to link-based queues, because they require in exible maximum queue size. But, on the other hand, they do not require memory management s c hemes that link-based queue implementations need and they bene t from spatial locality s i gni cantly more than link-based queues. Taking these into account and having a a simple, fast and practical implementation in mind we decided to use a cyclical-array in our construction.
We have used the compare-and-swap primitive to atomically swing the head and tail pointers from their current value to a new one. For the SGI Origin 2000 system we had to emulate the compare-and-swap atomic primitive with the load-linked/store-conditional instruction this implementation is shown in Figure 4 . However, using compareand-swap in this manner is susceptible to the ABA problem. In the past researchers have proposed to attach a counter to each pointer, reducing in this way the size of the memory that these pointers can point at e ciently. In this paper we observe that the circular array itself works like a counter mod l where l is the length of the cyclical array, and we can x l to be arbitrary large. In this way b y d esigning the queue as a circular array w e o vercome the ABA problem the same way the counters do but without having to attach expensive counters to the pointers, that restrict the pointer size. Henceforth, when an enqueue operation takes place, the tail changes in one direction and goes back to zero when it reaches the end of the array. Henceforth, the tail will change back to the same old value after the shared object nishes l enqueue operations and not after two successive operations (exactly as when using a counter mod l). The same also holds for the dequeue operations.
The atomic operations on the array are other potential places where the ABA problem can take place giving rise to the following scenario:
1. the array location x is the actual tail of the queue and its content i s N u l l In order to overcome this speci c ABA instance instead of using a counter with all the negative side-e ects, we i n troduce a new simple mechanism that we w ere surprised not to nd in the literature. The idea is very simple instead of using one value to describe that an entry in the array i s empty w e use two, N U L L (0) and N U L L (1). When a processor dequeues an item, it will swap into the cell one of the two N U L L sin such a way that two consecutive dequeue operations on the same cell give di erent N U L L values to the cell. Returning to the ABA scenario described above, the scenario would now l o o k l i k e t h i s : With this mechanism the ABA scenario that was taking place before, when a process was preempted by only one other process, now changes to an ABA 0 BAscenario. The ABA 0 B 0 A scenario is still a pointer recycling problem, but in order to take place l dequeue operations are needed to take the system from A to B and subsequently to A 0 , after that l more dequeue operations are needed in order to take the system from A 0 to B 0 and then to A. Moreover, all these operations have to take place while the process that will experience the pointer recycling is preempted. Taking into account that l is an arbitrary large number, the probability that the above ABA For the rest of these paper we assume that we h a ve selected l to be large enough not to give r i s e t o t h e p o i n ter recycling problem in our system.
The accessing of the shared object is modelled by a history h. A history h is a nite (or not) sequence of operation invocation and response events. Any response event i s p r eceded by the corresponding invocation event. For our case there are two di erent operations that can be invoked, an Enqueue operation or a Dequeue operation. An operation is called complete if there is a response event in the same history h otherwise, it is said to be pending. A history is called complete if all its operations are complete. In a global time model each operation q occupies" a time interval sq f q] on one linear time axis (sq < f q) we can think of sq and fq as the starting and nishing time instants of q. During this time interval the operation is said to be pending. There exists a precedence relation on operations in history denoted by < h , which is a strict partial order: q1 < h q2 means that q1 ends before q2 starts Operations incomparable under < h are called overlapping. A complete history h is linearisable if the partial order < h on its operations can be extended to a total order ! h that respects the speci cation of the object 4].
Any possible history, p r o d u c e d b y our implementation, can be mapped to a history where operations use an auxiliary array that is not bounded on the right side. In order to simplify the proof we will use this new auxiliary array. Our algorithm guarantees that enqueue operations enqueue data at consecutive array entries from left to right on this array, and dequeue operations dequeue items also from left to right. In this way it makes sure that the operations are dequeued in the order they have been enqueued. From the previous theorem we also have that when an Enqueue(x) operation nishes after writing x to some entry e of the array, the head pointer of our implementation, that guides the dequeue operations, will not over-pass this entry e, t h us making sure that no enqueued item is going to be lost. The above s k etches a proof for the next theorem: Theorem 2. In a complete history such that Enqueue(x) ! Enqueue(y), e i t h e r Dequeue(x) ! Dequeue(y) or Dequeue(y) and Dequeue(x) overlap. 2 We should point out that the technique of using 2 di erent N U L L values can be extended to k di erent v alues requiring more than k l dequeue operations to preempt an operation inorder to cause the pointer recycling problem. We think that the scheme with 2 N U L L values is simple enough and su cient for the systems that we are looking at.
The dequeue operation that dequeues x is the only one that succeeds to read and "empty" the array e n try where x was written, because of the atomic compare-and-swap operation making in this way sure that no other operation dequeues the same item. Moreover, since the array e n try was written by an enqueue operation, the dequeue operations will always dequeue items that have been "really" enqueued. The above sketches the proof of the following theorem: Theorem 3. If x has been dequeued, then it was enqueued, and Enqueue(x) ! Dequeue(x)
The last 2 theorems guarantee the linearizable behaviour of our FIFO queue implementation 4]. Due to space constraints, we o n l y s k etched the proof of these theorems.
PERFORMANCE EVALUATION
We implemented our algorithm and conducted our experiments on a SUN Enterprise 10000 with 64 250MHz UltraSPARC processors and on a SGI Origin 2000 with 64 195MHz MIPS R10000 processors. The SUN multiprocessor is a symmetric multiprocessor while the SGI multiprocessor is a ccNUMA one. To ensure accuracy of the results, we had exclusive access to the multiprocessors while conducting the experiments. For the tests we compared the performance of our algorithm (new) with the performance of the algorithm by M i c hael and Scott (MS) 10] because their algorithm appears to be the best non-blocking FIFO queue algorithm. In our experiments, we also included a solution based on locks (ordinary lock) to demonstrate the superiority of nonblocking solutions over blocking ones.
Experiments on SUN Enterprise 10000
We h a ve conducted 3 experiments on the SUN multiprocessor, in all of them we had exclusive use. In the rst experiment we measured the average time taken by all processors to perform one million pairs of enqueue/dequeue operations. In this experiment (Figure 8a ) a process enqueues an item and then dequeues an item and then it repeats. In the second experiment (Figure 8b ) processes stay idle for some random time between each two consecutive queue operations. In the third experiment w e used parallel quick-sort, that uses a queue data structure, to evaluate the performance of the three queue implementations. Parallel quick-sort had to sort 10 million randomly generated keys. The results of this experiment are shown in Figure 8c . The horizontal axis in the gures represent t h e n umber of processors, while the vertical one represents execution time normalised to that of Michael and Scott algorithm.
The rst two experiments (on 58 processors), show that the new algorithm outperforms the MS algorithm by more than 30% and the spin-lock algorithm by more than 50%. The third experiment shows that the new queue implementation o ers 40% better response time to the sorting algorithm.
Experiments on the SGI multiprocessor
On the SGI machine, the rst three experiments were basically the same experiments that we performed on the SUN multiprocessor. The only di erence is that on the SGI machine we could select to use the system as a dedicated system (multiprogramming level one) or as a multiprogrammed Figure 7 : The dequeue operation system with two and three processes per processor (multiprogramming level two and three respectively). For the SUN multiprocessor this was not possible. Figures 9, 10 and 11a show graphically the performance results. What is very interesting is that our algorithm gives almost the same performance improvements on both machines.
On the SGI multiprocessor, it was possible to use the radiosity from SPLASH-2 shared-address-space parallel applications 13]. Figure 11b shows the performance improvement compared with the original SPLASH-2 implementation. The vertical axis represents execution time normalised to that of the SPLASH-2 implementation. 
CONCLUSIONS
In this paper we presented a new bounded non-blocking concurrent FIFO queue algorithm for shared memory multiprocessor systems. The algorithm is simple and introduces two new simple algorithmic mechanisms that can be of general use in the design of e cient non-blocking algorithms. The experiments clearly indicate that our algorithm considerably outperforms the best of the known alternatives in both UMA and ccNUMA machines with respect to both dedicated and multiprogramming workloads. The experimental results also give a better insight i n to the performance and scalability of non-blocking algorithms in both UMA and cc-NUMA large scale multiprocessors with respect to dedicated and multiprogramming workloads, and they con rm that non-blocking algorithms can perform better than blocking on both UMA and ccNUMA large scale multiprocessors.
