Abstract-Multicore architectures have established themselves as the new generation of computer architectures. As part of the one core to many cores evolution, memory access mechanisms have advanced rapidly. Several new memory access mechanisms have been implemented in many modern commodity multicore architectures. By specifying how processing cores access shared memory, memory access mechanisms directly influence the synchronization capabilities of multicore architectures. Therefore, it is crucial to investigate the synchronization power of these new memory access mechanisms. This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures such as the Compute Unified Device Architecture (CUDA). We first define three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore architectures, without the need of synchronization primitives other than reads and writes. In the case of the contemporary CUDA processors, our results imply that the coalesced memory access mechanisms have consensus numbers up to 64.
Ç

INTRODUCTION
O NE of the fastest evolving multicore architectures is the graphics processor. The computational power of graphics processors (GPUs) doubles every 10 months, surpassing Moore's Law for traditional microprocessors [2] . Unlike previous GPU architectures, which are singleinstruction multiple data (SIMD), recent GPU architectures (e.g., Compute Unified Device Architecture (CUDA) [3] ) are single-program multiple data (SPMD). The latter consists of multiple SIMD multiprocessors of which each, at the same time, can execute a different instruction. This extends the set of applications on GPUs, which are no longer restricted to follow the SIMD-programming model. Consequently, GPUs are emerging as powerful computational coprocessors for general-purpose computations.
Along with their advances in computational power, GPUs memory access mechanisms have also evolved rapidly. Several new memory access mechanisms have been implemented in current commodity graphics/media processors such as CUDA [3] and Cell BE architecture [4] . For instance, in CUDA, single-word write instructions can write to words of different size and their size (in bytes) is no longer restricted to be a power of 2 [3] . Another advanced memory access mechanism implemented in CUDA is the coalesced global memory access mechanism.
The simultaneous global memory accesses by threads of an SIMD multiprocessor, during the execution of a single read/write instruction, are coalesced into a single aligned memory access if the simultaneous accesses follow the coalescence constraint [3] . It is well known that by specifying how processing cores access shared memory, memory access mechanisms directly influence the synchronization capabilities of multicore architectures. Therefore, it is crucial to investigate the synchronization power of the new memory access mechanisms.
Research on the synchronization power of memory access operations (or objects) in conventional architectures has received a great amount of attention in the literature. The synchronization power of memory access objects/mechanisms is conventionally determined by their consensussolving ability, namely, their consensus number [5] , [6] . The consensus number of an object type is either the maximum number of processes for which wait-free consensus can be solved using only objects of this type and registers, or infinity if such a maximum does not exist. An object is universal in a system of n processes if and only if it has consensus number n or higher. For hard real-time systems, it has been shown that any object with consensus number n is universal for an arbitrary number of processes running on n processors [7] . For systems that allow processes to simultaneously access m objects of type T in one atomic operation (or multiobject operation), where T has a consensus number at least two, upper and lower bounds on the consensus number of the multiobject type called T m have been provided [8] , [9] , [10] . In the case of registers (which have consensus number one), the m-register assignment, which allows processes to write to m arbitrary registers atomically, has been proven to have consensus number ð2m À 2Þ, for m > 1 [5] . Using the m-register assignment, we can construct ð2m À 3Þ-resilient read-modify-write objects [11] . An object implementation is t-resilient if nonfaulty processes can complete their operations on the object as long as no more than t processes fail [12] , [13] .
Note that the aforementioned CUDA coalesced memory accesses are neither the atomic m-register assignment [5] nor the multiobject types [8] , [9] , [10] . They are not the atomic m-register assignment since they do not allow processes to atomically write to m arbitrary memory words; instead, processes can atomically write to m memory words only if the m memory words are located within an aligned size-bounded memory portion (i.e., memory alignment restriction) (cf. Section 2). The CUDA coalesced memory accesses are not the multiobject type since their base object type T is the conventional memory word, which has consensus number one.
This paper investigates the consensus number of the new memory access mechanisms implemented in current graphics processor architectures. We first define three new memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their respective consensus number. These tight results show that the new memory access mechanisms can facilitate strong synchronization between the threads of multicore architectures, without the need of synchronization primitives other than reads and writes.
We first define a new memory access model, the svword model, where svword stands for the size-varying word access, the first of the two aforementioned advanced memory access mechanisms implemented in CUDA. Unlike single-word assignments in conventional architectures, the new single-word assignments can write to words of size b (in bytes), where b can vary from 1 to an upper bound B and b is no longer restricted to be a power of 2 (e.g., built-in type float3 in [3] ). By carefully choosing b for the single-word assignments, we can partly overlap the bytes written by two assignments, namely, each of the two assignments has some byte(s) that is not overwritten by the other overlapping assignment (cf. Fig. 1a for an illustration). Note that words of size d must always start at addresses that are multiples of d, which is called alignment restriction as defined in the conventional computer architecture. The alignment restriction prevents single-word assignments in conventional architectures from partly overlapping each other since the word size is restricted to be a power of 2. This observation has motivated us to develop the svword model.
Inspired by the coalesced memory accesses, the second of the aforementioned advanced memory access mechanisms, we define two other models, the aiword and asvword models, to capture the fundamental features of the mechanism. In CUDA, the global shared memory is considered to be partitioned into segments of equal size and aligned to this size. Simultaneous memory accesses to the same segment by threads of an SIMD multiprocessor (or halfwarp in CUDA terms [14] ), during the execution of a single read/write instruction, can be coalesced into a single memory access. The coalescence happens even if some of the threads do not actually access memory (cf. [3, or Fig. 1c ). This allows an SIMD multiprocessor (or a process) to atomically write to multiple memory locations (within a segment) that are not at consecutive addresses. Accesses to the same segment by different processes are executed sequentially.
We generally model this mechanism as an alignedinconsecutive-word access, aiword, in which the memory is aligned to A-unit words and a single-word assignment can write to an arbitrary nonempty subset of the A units of a word. Note that the single-aiword assignment is not the atomic m-register assignment [5] due to the memory alignment restriction. 1 Our third model, asvword, is an extension of the second model aiword in which aiword's A memory units are now replaced by A svwords of the same size b. This model is inspired by the fact that the read/write instructions of different coalesced global memory accesses can access words of different size [3] .
The contributions of this paper can be summarized as follows:
. We develop a general memory access model, the svword model, to capture the fundamental features of the size-varying word accesses. In this model, a single-word assignment can write to a word composed of b consecutive memory units, where b can be any integer between 1 and an upper bound B ! 2. We prove that the single-svword assignment has consensus number exactly 3 when B ! 5, consensus number 2 when B 2 f3; 4g, and consensus number 1 when B ¼ 2. We also introduce a technique to minimize the size of (proposal) values in consensus algorithms, which allows a single-word assignment to write many values atomically and handle the consensus problem for several processes (cf. Section 4). . We develop a general memory access model, the aiword model, to capture the fundamental features of the coalesced memory accesses. The second model is an aligned-inconsecutive-word access model in which the memory is aligned to A-unit words and a singleword assignment can write to an arbitrary nonempty subset of the A units of a word. We present a wait-free consensus algorithm for N ¼ b
Aþ1
2 c processes using only single-aiword assignments, and subsequently, prove that the single-aiword assignment has consensus number exactly N ¼ b . We develop a general memory access model, asvword, to capture the fundamental features of the 1. In this paper, we use term "single" in single-Ã word assignment when we want to emphasize that the assignment is not the multiple assignment [5] .
combination of the size-varying word accesses and the coalesced memory accesses. The third model is an extension of the second model aiword in which aiword's A units are A svwords of the same size b; b 2 f1; Bg (cf. Section 6). We prove that the consensus number of the single-asvword assignment is exactly N, where
In the case of the contemporary CUDA processors (with compute capability 1.2 and higher) in which A ¼ 32 and B ¼ 4, the consensus number of the asvword model is 64. The rest of this paper is organized as follows: Section 2 presents the three new memory access models. Sections 4, 5, and 6 present the exact consensus numbers of the first, second, and third models, respectively. Finally, Section 7 concludes this paper.
MODELS
General Descriptions
Before describing the details of each of the three new memory access models, we present the common properties of all these three models. The shared memory in the three new models is sequentially consistent [15] , [16] , which is weaker than the linearizable one [17] assumed in most of the previous research on the synchronization power of the conventional memory access models [5] . Processes are asynchronous. The new models use the conventional one-dimensional memory address space. In these models, one memory unit is a minimum number of consecutive bytes/bits which a basic read/write operation can atomically read from/write to. These memory models address individual memory units. Memory is organized so that a group of d consecutive memory units called word can be stored or retrieved in a single basic write or read operation, respectively, and d is called word size. Words of size d must always start at addresses that are multiples of d, which is called alignment restriction as defined in the conventional computer architecture.
The First Model Svword
The first model is a size-varying-word access model (svword) in which a single read/write operation can atomically read from/write to a word consisting of b consecutive memory units, where b, called svword size, can be any integer between 1 and an upper bound B. The upper bound B is the maximum number of consecutive units which a basic read/ write operation can atomically read from/write to. Svwords of size b must always start at addresses that are multiples of b due to the memory alignment restriction. We denote bsvword to be an svword consisting of b units, b-svwrite to be a b-svword assignment, and b-svread to be a b-svword read operation. Reading a unit U is denoted by 1-svread ðUÞ or just by U for short. This model is inspired by the CUDA graphics processor architecture in which basic read/write operations can atomically read from/write to words of different size (cf. built-in types float1, float2, float3, and float4 in [3, Section 4.3.1.1]). Fig. 1a illustrates how 2-svwrite, 3-svwrite, and 5-svwrite can partly overlap their units with addresses from 14 to 20, with respect to the memory alignment restriction.
The Second Model Aiword
The second model is an aligned-inconsecutive-word access model (aiword) in which the memory is aligned to A-unit words and a single read/write operation can atomically read from/write to an arbitrary nonempty subset of the A units of a word, where A is a constant. Aiwords must always start at addresses that are multiples of A due to the memory alignment restriction. We denote A-aiword to be an aiword consisting of A units, A-aiwrite to be an A-aiword assignment, and A-airead to be an A-aiword read operation. Reading only one unit U (using airead) is denoted by U for short. In the aiword model, an aiwrite operation executed by a process cannot atomically write to units located in different aiwords due to the memory alignment restriction. Fig. 1b This model is inspired by the coalesced global memory accesses in the CUDA architecture [3] . The CUDA architecture can be generalized to an abstract model of an MIMD 2 chip with multiple SIMD cores sharing memory. Each core (or streaming multiprocessor SM in CUDA terms) executes A identical instructions (on different data) simultaneously, but different cores can simultaneously execute different instructions. The sequence of instructions that are being executed by one SIMD core is called a process. Namely, each process consists of A parallel threads that are running in an SIMD manner. The process accesses the shared memory using the CUDA memory access mechanism. In CUDA, the global shared memory is considered to be partitioned into segments of equal size and aligned to this size. Simultaneous memory accesses to the same segment by threads of an SIMD core during the execution of a single read/write instruction can be coalesced into a single memory access. The coalescence happens even if some of the threads do not actually access memory (cf. [3, Fig. 5-1] or Fig. 1c ). This allows an SIMD core (or a process consisting of A parallel threads running in an SIMD manner) to atomically access multiple memory locations (within a segment) that are not at consecutive addresses. Accesses to the same segment by different processes are executed sequentially. Fig. 1c illustrates the coalesced memory access, where A ¼ 8. The left SIMD core can write atomically to four memory locations 0, 3, 5, and 7 by letting only four of its eight threads t0; t3; t5, and t7 simultaneously execute a write operation (i.e., divergent threads). The right SIMD core can write atomically to its own memory location 1 and shared memory locations 3, 5, and 7 by letting only four threads t1; t3; t5, and t7 simultaneously execute a write operation. Note that the CUDA architecture allows threads from different SIMD cores to communicate through the global shared memory [18] .
The Third Model Asvword
The third model is a coalesced memory access model (asvword), an extension of the second model aiword, in which aiword's A units are now replaced by A svwords of the same size b; b 2 ½1; B. Namely, the second model aiword can be considered a special case of the third model asvword where Let AÂb-asvword be the asvword that is composed of A svwords each of which consists of b memory units. AÂb-asvwords whose size is A Á b must always start at addresses that are multiples of A Á b due to the memory alignment restriction. We denote AÂb-asvwrite to be an AÂb-asvword assignment and AÂb-asvread to be an AÂb-asvword read operation. Reading only one unit U (using AÂ1-asvread) is denoted by U for short. Due to the memory alignment restriction, an AÂb-asvwrite operation cannot atomically write to b-svwords located in different AÂb-asvwords. Since in reality, A and B are a power of 2, in this model, we assume that either
Ã (in the case of B < A). For the sake of simplicity, we assume that b 2 f1; Bg holds. A variant of the model in which b ¼ 2 c ; c ¼ 0; 1; . . . ; log 2 B, and A; B are powers of 2, can be established from this model (cf. Section 6). Since both AÂ1-asvwords and AÂB-asvwords are aligned from the address base of the memory space, any AÂB-asvword can be aligned with B A Â 1-asvwords, as shown in Fig. 2 . Fig. 2 illustrates the asvword model in which each dashdotted rectangle/square represents an svword and each red/solid rectangle represents an asvword composed of eight svwords (i.e., A ¼ 8). The two rows show the memory alignment corresponding to the size b of svwords, where b is 1 or 2 (i.e., B ¼ 2), on the same 16 consecutive memory units with addresses from 0 to 15. An asvwrite operation can atomically write to some or all of the eight svwords of an asvword. Unlike the aiwrite operation in the second model, which can atomically write to at most 8 units (or A units), the asvwrite operation in the third model can atomically write to 16 units (or A Á B units) using a single 8 Â 2-asvwrite operation (i.e., write to the whole set of eight 2-svwords, cf. row b ¼ 2). For an 8 Â 1-asvword on row b ¼ 1, there are two methods to update it atomically using the asvwrite operation: 1) writing to the whole set of eight 1-svwords using a single 8 Â 1-asvwrite (cf. SIMD core 1) or 2) writing to a subset consisting of four 2-svwords using a single 8 Â 2-asvwrite (cf. SIMD core 2). However, if only one of the eight units of an 8 Â 1-asvword (e.g., unit 14) needs to be updated and the other units (e.g., unit 15) must remain untouched, the only possible method is to write to the unit using a single 8 Â 1-asvwrite (cf. SIMD core 3). The other method, which writes to one 2-svword using a single 8 Â 2-asvwrite, will have to overwrite another unit that is required to stay untouched (cf. SIMD core 4).
PRELIMINARY RESULTS ON WAIT-FREE CONSENSUS
This paper uses the conventional terminology from bivalency arguments [13] , [5] , [19] . The configuration of an algorithm at a moment in its execution consists of the value of every shared object and the internal state of every process. A configuration is univalent if all executions continuing from this configuration yield the same consensus value and multivalent otherwise. A configuration is critical if the next operation op i by any process p i will carry the algorithm from a multivalent to a univalent configuration. The operations op i are called critical operations. The critical value of a process is the value that would get decided if that process takes the next step after the critical configuration.
Definition 3.1 (Wait-free consensus). Wait-free consensus is a problem in which each process starts with an input value from some set S, jSj ! 2, and must eventually produce an output value so that the following properties are satisfied in every execution:
. Agreement: the output values of all processes are identical; . Validity: the output value of each process is the input value of some process; . Wait freedom: each process produces an output value after a finite number of steps.
Definition 3.2 (Consensus number).
The consensus number of an object type is either the maximum number of processes for which wait-free consensus can be solved using only objects of this type and registers, 3 or infinity if such a maximum does not exist.
Before proving the consensus number of single-Ã word assignments, we present the essential features of any waitfree consensus algorithm ALG for N ! 2 processes using only single-Ã word assignments and registers, where Ã word can be svword, aiword, or asvword. Proof. We first prove that ALG must have a critical configuration C Ã by contradiction. Suppose that ALG has no critical configuration. Since ALG solves wait-free consensus for N ! 2 processes with different input values, ALG's initial configuration C 0 is multivalent due to ALG's validity property (cf. Definition 3.1). Since ALG has no critical configuration due to the hypothesis, in any multivalent configuration C i (e.g., C 0 ), there always exists an operation that carries ALG from C i to another multivalent configuration C iþ1 . That means there must exist a nonterminating execution, a contradiction to ALG's wait-freedom property (cf. Definition 3.1).
We now prove that the critical operations op i of processes p i with different critical values must write to the same object O by contradiction.
. Suppose that the critical operation op i of a process p i is to read an object O and carries ALG from a critical configuration C Ã to an x-valent configuration. Since configuration C Ã is critical, there must be a process p j whose critical operation op j carries ALG from C Ã to a y-valent configuration, y 6 ¼ x. The configuration C 1 that immediately follows the execution e 1 ¼ op i ; op j continuing from C Ã is x-valent since p i executes its critical operation op i first. Similarly, the configuration C 2 that immediately follows the execution e 2 ¼ op j continuing from C Ã is y-valent. Due to the hypothesis that op i only reads object O, configurations C 1 and C 2 are indistinguishable to process p j , 4 a contradiction since C 1 is x-valent and C 2 is y-valent. Therefore, the critical operations of processes with different critical values must be write operations. . Suppose that in a critical configuration C Ã , there are two processes p i and p j whose critical operations op i and op j are to write x and y, x 6 ¼ y, to different objects O i and O j , respectively. The configuration C 1 that immediately follows the execution e 1 ¼ op i ; op j continuing from C Ã is x-valent since p i executes its critical operation op i first. Similarly, the configuration C 2 that immediately follows the execution e 2 ¼ op j ; op i (i.e., reversing the order of op i and op j in e 1 ) is y-valent. Due to the hypothesis that op i and op j write to different objects O i and O j , configurations C 1 and C 2 are indistinguishable to processes p i and p j , a contradiction since C 1 is xvalent and C 2 is y-valent. t u Definition 3.3. One-writer (respectively, two-writer) unit, or 1W-unit (respectively, 2W-unit) for short, is a memory unit that is written by only one critical operation (respectively, two critical operations) in a critical configuration.
Lemma 3.2. In a critical configuration C Ã of ALG, critical operation op i by each process p i must atomically write to 1. a one-writer unit u i written by p i and 2. two-writer units u i;j written by two processes p i and p j , where p j 's critical value is different from p i s, 8j 6 ¼ i.
Proof. The proof is similar to the bivalency argument of [5, Theorem 13] . Due to Lemma 3.1, ALG must have a critical configuration C Ã and critical operations op i of processes p i with different critical values must be write operations. Let x be p i 's critical value in the critical configuration C Ã . Since configuration C Ã is critical, there must be another process p j whose critical value y is different from x. Let op j be p j 's critical write operation.
. We first prove that op i must write to an one-writer unit by contradiction. Suppose that all op i 's units are overwritten by p j 's and other processes operations. k , configurations C 1 and C 2 are indistinguishable to processes p i and p j , a contradiction since C 1 is x-valent and C 2 is y-valent. t u
CONSENSUS NUMBER OF THE SVWORD MODEL
In this section, we first present a wait-free consensus algorithm for three processes using only the single-svword assignment with B ! 5 and registers. Then, we prove that we cannot construct any wait-free consensus algorithms for more than three processes using only the single-svword assignment and registers regardless of how large B is. 4 . Two configurations c and c 0 are indistinguishable to a process p j if the internal state of process p j and the value of every shared object are the same in c and c 0 [20] .
The new wait-free consensus algorithm SVW_CONSEN-SUS is presented in Algorithm 1. The main idea of the algorithm is to utilize the size-variation feature of the svwrite operation. A b-svwrite operation can atomically write up to b values to b consecutive memory units if each of the values can be stored in one memory unit. Therefore, keeping the values to be atomically written as small as possible will maximize the number of processes for which b-svwrite can solve the consensus problem. Unlike the wait-free consensus algorithm using the m-word assignment by Herlihy [5] , which requires the word size to be large enough to accommodate a proposal value, the new algorithm stores proposal values in shared memory and uses only 2 bits (or one unit) to determine the preceding order between two processes. This allows a single-svword assignment to write atomically up to B (or The SVW_CONSENSUS algorithm has two phases. In the first phase, two processes p 0 and p 1 will achieve an agreement on their proposal values (cf. Algorithm 2). The agreed value, P ROP OSAL½first, is the proposal value of the preceding process, whose SVWRITE (line 2SF or 4SF) precedes that of the other process (cf. Lemma 4.1). Due to the memory alignment restriction, in order to be able to allocate memory for the WR 1 variable (cf. Algorithm 1) on which p 0 's and p 1 's SVWRITEs can partly overlap, p 0 's and p 1 's SVWRITEs are chosen as 2-svwrite and 3-svwrite, respectively. The WR 1 variable is located in a memory region consisting of four consecutive units fu 0 ; u 1 ; u 2 ; u 3 g of which u 0 is at an address multiple of 2 and u 1 at an address multiple of 3. This memory allocation allows p 0 and p 1 to write atomically to the first two units fu 0 ; u 1 g and the last three units fu 1 ; u 2 ; u 3 g, respectively (cf. Fig. 3a) . The WR 1 variable is the set fu 0 ; u 1 ; u 2 g (cf. the solid squares in Fig. 3a) , namely, p 1 ignores u 3 (cf. line 4SF in Algorithm 2).
Subsequently, the agreed value will be used as the proposal value of both p 0 and p 1 in the second phase in order to achieve an agreement with the other process p 2 (cf. Algorithm 3). Let p first be the preceding process of p 0 and p 1 in the first phase. Fig. 1a ). This memory allocation allows p 0 , p 1 , and p 2 to write atomically to the first two units fu 0 ; u 1 g, the last three units fu 4 ; u 5 ; u 6 g, and the five middle units fu 1 ; . . . ; u 5 g, respectively. The WR 2 variable is the set fu 0 ; u 1 ; u 2 ; u 5 ; u 6 g (cf. the solid squares in Fig. 3b ). Proof. Without loss of generality, we consider the value returned by the SVW_FIRSTAGREEMENT procedure that is invoked by process p 0 , i.e., i ¼ 0.
If p 0 precedes p 1 (i.e., p 0 's SVWRITE (line 2SF) precedes p 1 's SVWRITE (line 4SF)), their unit WR 1 ½1 is either Lower (when p 1 has not executed its SVWRITE yet) or Higher (when p 1 's SVWRITE has overwritten the value Lower written by p 0 's). In the former case, W R 1 ½2 ¼ ? holds (line 6SF), making the procedure return 0 (line 7SF). In the latter case, WR 1 ½2 6 ¼? and the procedure checks WR 1 ½1 at line 8SF. Since predicate ðW R 1 ½1 ¼ Higher and i ¼ 0Þ holds, the procedure returns 0 (line 9SF).
If p 1 precedes p 0 , their unit WR 1 ½1 from line 8SF is either Higher (when p 0 has not executed its SVWRITE yet) or Lower (when p 0 's SVWRITE has overwritten the value Higher written by p 1 's). The former case cannot happen since p 0 executes its SVWRITE at line 2SF (i.e., before line 6SF) in the SVW_FIRSTAGREEMENT procedure and the procedure is assumed to be invoked by p 0 . In the latter case, the procedure returns 1 (line 11SF) since the predicate at line 6SF fails, and subsequently, the predicate at line 8SF fails. Note that since: 1) p 0 and p 1 invoke the SVW_ SECONDAGREEMENT procedure (line 5V, Algorithm 1) only after getting first from the SVW_FIRSTAGREEMENT procedure (line 3V) and 2) the reference to first (instead of a value of first) is passed to SVW_SECONDAGREEMENT, and the value first returned by SVW_SECONDAGREE-MENT in these two cases is defined.
If Proof. It is obvious from the pseudocode in Algorithms 1, 2, and 3 that the SVW_CONSENSUS algorithm is wait-free. From Lemma 4.2, the SVW_CONSENSUS algorithm returns the same values for all invoking processes. The value is either P ROP OSAL½2 (if p 2 precedes both p 0 and p 1 ) or P ROP OSAL½first, first 2 f0; 1g (otherwise). t u Lemma 4.4. The single-svword assignment has consensus number at least 3, 8B ! 5.
Proof. Since there is a wait-free consensus algorithm for three processes using only registers and the singlesvword assignment with B ! 5 (Lemma 4.3), this lemma immediately follows. t u Lemma 4.5. The single-svword assignment has consensus number at most 3, 8B ! 5.
Proof. We prove the lemma by contradiction. Assume that there is a wait-free consensus algorithm ALG for four processes p; q; r, and t. From Lemma 3.1, ALG must have a critical configuration C Ã and the critical operations op i of processes p i with different critical values must be write operations. At the critical configuration C Ã , we can always divide the set of the four processes into two nonempty subsets S and S, where S consists of at most two processes with the same critical value called V and S consists of processes with critical values different from V (If three of the four processes have the same critical value, the other process is chosen as S). Since the svwrite operation writes to consecutive memory units in the conventional one-dimensional memory address space, let ½k f ; k l be the range of consecutive units to which a process k 2 fp; q; r; tg atomically writes using its critical operation op k . For any pair of processes fh; kg, where h and k belong to different subsets S and S, ½h f ; h l and ½k f ; k l must partly overlap (due to the second requirement of Lemma 3.2) and none of them are completely covered by ranges ½v f ; v l of the other processes v (due to the first requirement of Lemma 3.2).
Figs. 3c and 3d illustrate the main idea of the proof when S consists of one and two processes, respectively. In Fig. 3c , the range ½t f ; t l of process t cannot partly overlap with that of process p without completely covering (or being covered by) the range of process r or q. In Fig. 3d , t and r belong to different subsets S and S, respectively, but their ranges cannot partly overlap. The detailed proof is as follows:
. If S consists of one process, let S ¼ fpg. Since p's critical value is different from those of the three other processes q; r, and t, process p's critical operation must atomically write to four units u p;q , u p;r , u p;t , and u p (cf. Lemma 3.2). The atomic write operation determines the relative ordering between p and the three other processes with critical values different from p's: if p's operation precedes q's, p is considered preceding q. Without loss of generality, assume that p f < q f p l < q l where the 2W-unit u p;q of p and q is between q f and p l , q f u p;q p l , and the 1W-units u p < q f and u q > p l (cf. Fig. 3c ).
We prove that r f < p f r l < p l . Since ranges ½r f ; r l and ½p f ; p l must partly overlap, either r f < p f r l < p l or p f < r f p l < r l must hold. If the latter holds, q l < r l must hold due to the first requirement of Lemma 3.2 for the process r. That means p f < q f < q l < r l , or q's range ½q f ; q l is covered completely by the overlapping ranges ½p f ; p l and ½r f ; r l , violating the first requirement of Lemma 3.2 for the process q.
Arguing similarly, we have t f < p f t l < p l . If t f r f , r's range ½r f ; r l is covered completely by the overlapping ranges ½t f ; t l and ½p f ; p l . Therefore, r f < t f must hold, leading to t's range ½t f ; t l covered completely by the overlapping ranges ½r f ; r l and ½p f ; p l , a contradiction to the first requirement of Lemma 3.2 for the process t. . If S consists of two processes, let S ¼ fp; tg. Since p's and t's critical value is different from those of the two other processes q and r, processes p and t must atomically write to units fu p;q , u p;r , u p g and fu t;q , u t;r , u t g, respectively (cf. Lemma 3.2). Similarly, q and r must atomically write to units fu p;q , u t;q , u q g and fu p;r , u t;r , u r g, respectively.
Since p must atomically write to units fu p;q , u p;r , u p g, arguing similarly to the above case S ¼ fpg, we have either r f < p f r l < q f p l < q l (cf. Fig. 3c ) or q f < p f q l < r f p l < r l (i.e., exchange r and q in Fig. 3c ). Without loss of generality, assume that the former holds.
Similarly, since: 1) q must atomically write to units fu p;q , u t;q , u q g and 2) p f < q f p l < q l , we have p f < q f p l < t f q l < t l (cf. Fig. 3d ). On the other hand, since q f < t f q l < t l and t must atomically write to units fu t;q , u t;r , u t g, we have q f < t f q l < r f t l < r l . This contradicts the assumption r f < p f r l < q f p l < q l . t u Lemma 4.6. The single-svword assignment (svwrite) has consensus number 1 when B ¼ 2.
Proof. We prove the lemma by contradiction. Assume that there is a wait-free consensus algorithm ALG for two processes (with different proposal values) p 0 and p 1 using only svwrites and registers. Algorithm ALG must have a critical configuration C Ã (Lemma 3.1) in which p i 's critical operation must atomically write to both p i 's 1W-unit u i ; i 2 f0; 1g and a 2W-unit u 0;1 (Lemma 3.2). Since B ¼ 2, in order to atomically write to two units, both p 0 's and p 1 's critical operations must be 2-svwrites, which prevents the two critical operations from partly overlapping due to the memory alignment restriction. That means that if p 0 's critical operation is the first operation writing to the 2-svword containing u 0;1 and u 0 , p 1 's critical operation, which must write to u 0;1 , will then overwrite the 2-svword completely, violating the first requirement of Lemma 3.2 for process p 0 . t u Lemma 4.7. The single-svword assignment (svwrite) has consensus number 2 when B 2 f3; 4g.
Proof. Since the SVW_FIRSTAGREEMENT procedure (Algorithm 2) solves wait-free consensus for two processes using svwrites and registers (Lemma 4.1) when B ! 3, the single-svword assignment (svwrite) has consensus number at least 2 when B 2 f3; 4g. We now prove by contradiction that when B 2 f3; 4g, there is no wait-free consensus algorithm for three processes using only svwrites and registers. Assume that there is a wait-free consensus algorithm ALG for three processes p; q, and r. Algorithm ALG must have a critical configuration (Lemma 3.1) in which we can always divide the set of these three processes into two nonempty subsets S and S, where S consists of a process p with critical value V and S consists of processes q and r with critical values different from V . Since the svwrite operation writes to consecutive memory units in the conventional one-dimensional memory address space, let ½k f ; k l be the range of consecutive units to which a process k 2 fp; q; rg atomically writes using its critical operation op k . For any pair of processes fh; kg, where h and k belong to different subsets S and S, ½h f ; h l and ½k f ; k l must partly overlap (due to the second requirement of Lemma 3.2) and none of them are completely covered by ranges ½v f ; v l of the other processes v (due to the first requirement of Lemma 3.2).
Arguing similarly to the proof of Lemma 4.5 results in that ranges ½p f ; p l , ½q f ; q l , and ½r f ; r l must partly overlap each other, as shown in Fig. 3c. . If B ¼ 3, range ½p f ; p l , which contains u p;r , u p and u p;q , must be a 3-svword starting at an address multiple of 3, namely, p f ¼ 3a; a 2 IN (integers). In order to partly overlap with range ½p f ; p l , ranges ½r f ; r l and ½q f ; q l must be 2-svwords due to the memory alignment restriction. That means r f and q f are addresses multiple of 2: r f ¼ 2b and q f ¼ 2c, where b; c 2 IN. On the other hand, since range ½p f ; p l is a 3-svword, there is exactly one unit between r l and q f (cf. Fig. 3c) , or
, range ½p f ; p l must be a 4-svword starting at an address multiple of 4, namely, p f ¼ 4a; a 2 IN. Indeed, if range ½p f ; p l is a 3-svword, arguing similarly to the case B ¼ 3 will result in a contradiction since r l and q f must be odd and even, respectively, while q f ¼ r l þ 2.
Since range ½p f ; p l is a 4-svword, in order to partly overlap with range ½p f ; p l , ranges ½r f ; r l and ½q f ; q l must be 3-svwords due to the memory alignment restriction. That means r f and q f are addresses multiple of 3: r f ¼ 3b and q f ¼ 3c, where b; c 2 IN. Since range ½p f ; p l must not be covered completely by ranges ½r f ; r l and ½q f ; q l , c ! ðb þ 2Þ must hold (cf. Fig. 3c ). On the other hand, since range ½p f ; p l is a 4-svword, there are at most two units between r l and q f (cf. Fig. 3c ). That means q f À r l 3, or 3ðc À bÞ 5, a contradiction to c ! ðb þ 2Þ. 
CONSENSUS NUMBER OF THE AIWORD MODEL
In this section, we prove that the single-aiword assignment (or aiwrite for short) has consensus number exactly b 2 c processes (cf. Algorithm 4) using only the aiwrite operation and registers. Subsequently, we prove that there is no waitfree consensus algorithm for N þ 1 processes using only the aiwrite operation and registers. The main idea of the AIW_CONSENSUS algorithm is to gradually extend the set S of processes agreeing on the same value by one at a time. This is to minimize the number of units that must be written atomically by the aiword operation (cf. Lemma 5.4). The algorithm consists of N rounds and a process p i ; i 2 ½1; N, participates in rounds r i . . . r N . A process p i leaves a round r j ; j ! i and enters the next round r jþ1 when it reads the value upon which all processes in round r j (will) agree. A round r j starts with the first process that enters the round, and ends when all j processes p i ; 1 i j, have left the round. At the end of a round r j , the set S consists of j processes p i ; 1 i j.
A process p i participates in the consensus protocol from round i by initializing its agreed value A i ½i in round i to its proposal value buf i (line 1I). In order to determine whether it precedes all ði À 1Þ other processes p k participating in round i, k ¼ 1; . . . ; ði À 1Þ, p i atomically writes Higher to its unit U Note that all processes p k participating in round i, k < i, have the same proposal value, which is their agreed value in the previous round ði À 1Þ (line 12I and Lemma 5.1). Otherwise, p i keeps its proposal value buf i as its agreed value in round i, and subsequently, enters the next round ði þ 1Þ (line 11I). Process p i participates in rounds j; j ¼ ði þ 1Þ; . . . ; N, by initializing its agreed value A j ½i in round j to its agreed value A jÀ1 ½i in the previous round ðj À 1Þ (line 12I). Since all processes p k participating in round j, k < j, have agreed on the same value in the previous round ðj À 1Þ (cf. Lemma 5.1)), they have the same proposal value in round j. Therefore, p i needs to change its agreed value A j ½i only if p j precedes all processes p k ; k < j. After atomically writing Lower to its units U j i and U j j;i , p i checks if p j precedes all other processes p k ; k < j (lines 14I-21I). If so, p i agrees on p j 's proposal value A j ½j by writing this value to A j ½i (line 23I). After obtaining its agreed value A N ½i in round N, p i agrees with all other processes on the same value (Lemma 5.1) and finishes the consensus protocol (line 27I).
Definition 5.1. A correct process is a process that does not crash.
Definition 5.2. The agreed value v of a correct process p i in round r j ; j ! i, is the value of A j ½i when p i reaches either line 10I (if i ¼ j) or line 25I in iteration j of the for-loop 11I-26I (if i < j). We say that p i agrees on v in r j .
Lemma 5.1. All correct processes p i agree on the same value in round r j , where 1 i j N.
Proof. We will prove this lemma by induction on j, the round index. The lemma is true for j ¼ 1 since there is only one process p 1 in round r 1 . Assume that the lemma is true for ðj À 1Þ, we need to prove that the lemma is true for j. That means we need to prove that if all correct processes p i ; 1 i j À 1, agree on the same value in round r jÀ1 , then all correct processes p i ; 1 i j, will agree on the same value in round r j . Indeed, since all correct processes p i ; 1 i j À 1, agree on the same value in round r jÀ1 , their proposal values in round r j are the same (line 12I) called A j S . Let A j j be p j 's proposal value in round j, its original proposal value (line 1I). The agreed value in round r j will be either A j S or A j j . At this moment, we assume that AIWRITE (at line 3I or 13I) is atomic (cf. Fig. 4 for the layout of units U j j , U j i , and U j j;i on an aiword when j ¼ N). We will prove that in round r j , the agreed value of participating processes will be A j j if p j precedes all the other processes p i ; 1 i < j (i.e., p j 's AIWRITE precedes all the other processes AIWRITE), or A j S otherwise.
. If p j precedes all the other processes p i ; 1 i < j, all processes will see U j j 6 ¼? after their AIWRITE (line 3I for p j or line 13I for p i ; i < j). Let p l ; l ¼ 1; . . . ; j, be the process that is executing the AIW_CONSENSUS procedure.
If l ¼ j, p j determines its relative ordering with processes p i ; i < j, using their unit U j j;i , which is only written by p j 's and p i 's AIWRITEs (lines 4I-9I). Since p j precedes all the other processes p i , predicate U With the assumption that AIWRITE can atomically write to p j 's units at line 3I or p i 's units at line 13I, it follows directly from Lemma 5.1 that all the N processes will achieve an agreement in round r N .
Lemma 5.2. The AIW_CONSENSUS algorithm is wait-free and can solve the consensus problem for N ¼ b Aþ1 2 c processes. Proof. The time complexity for a process using AIW_ CONSENSUS to achieve an agreement among N processes is OðN 2 Þ due to the for-loops at lines 11I and 16I. Therefore, the AIW_CONSENSUS algorithm is wait-free.
From Lemma 5.1, the AIW_CONSENSUS algorithm can solve the consensus problem for N ¼ b Aþ1 2 c processes if AIWRITE can atomically write to p j 's j units at line 3I or p i 's 2 units at line 13I. Since AIWRITE can write to an arbitrary subset of A units of an aiword AI, if AIWRITE can atomically write to a units of AI, a A, it can atomically write to b units of AI where b a. Therefore, we only need to prove that the requirement is satisfied for the case j ¼ N.
Indeed, since N ¼ b Second, we prove that the ALG algorithm maximizes N when k is 2, the minimum. In order to maximize the number N of processes, we need to minimize the number M of the processes' 1W-/2W-units that must be located in the A-aiword AI, where A is a constant. The number M in the ALG algorithm is:
That means M will be less if there are only two subsets s I ; s II of processes with the same critical value, where n I ¼ P iÀ1 l¼1 n l and n II ¼ P k l¼i n l . In this case, we have
It follows that M achieves the minimum ð2N À 1Þ when n I ¼ 1 or n I ¼ N À 1.
Since N ! b 
CONSENSUS NUMBER OF THE ASVWORD MODEL
We will prove that the single-asvword assignment has consensus number exactly N, where
The intuition behind the higher consensus number N of the asvword model (cf. (4)) compared with the aiword model is that process p N in Algorithm 4 can atomically write to A Á B units using AÂB-asvwrite instead of only A units using Aaiwrite (line 3I). As illustrated in Fig. 2 , an 8 Â 2-asvwrite can atomically write to 16 memory units (i.e., write to eight consecutive 2-svwords each of which is composed of two memory units) (cf. row b ¼ 2), whereas an 8 Â 1-asvwrite (or 8-aiwrite) can atomically write to only eight memory units (cf. row b ¼ 1). However, since an 8 Â 2-asvwrite uses 2-svwords as its minimum units, it cannot write to only one memory unit. For instance, using 8 Â 2-asvwrite, SIMD core 4 cannot write to only memory unit 14 (cf. row b ¼ 1), but it must write to both memory units 14 and 15 that comprise 2-svword 7 (cf. row b ¼ 2). Therefore, to prevent p N from overwriting unintended memory units when using A Â B-asvwrite, each B-svword located in A l ; 1 l B, contains either units U We first prove that the asvwrite operation has consensus number at least N (cf. (4)). We prove this by presenting a wait-free consensus algorithm ASVW_CONSENSUS for N processes using only the asvwrite operation and registers. Subsequently, we prove that there is no wait-free consensus algorithm for N þ 1 processes using only the asvwrite operation and registers. The rest of this section presents a complete proof of the exact consensus number. Proof. Since the asvword model is an extension of the aiword model, the asvwrite operation has consensus number at least N ¼ b
Aþ1
2 c (cf. Theorem 5.1). When the size b of svwords is the same for all asvwrites, the asvword model degenerates to the aiword model. The asvwrite operation can achieve a higher consensus number when the size b of svwords is allowed to be different between asvwrites, namely, both AÂ1-asvwrites and AÂB-asvwrites are utilized.
However, we will prove by contradiction that when B ¼ tA, the combination of AÂ1-asvwrites and A Â Basvwrites does not provide any additional strength. Assume that there is a wait-free consensus algorithm ALG for N processes, where N ! b Aþ1 2 c þ 1, using asvwrites and registers. Due to Lemma 3.1, ALG must have a critical configuration C Ã and the critical operations op i of processes p i with different critical values must be write operations. At the critical configuration C Ã , assume that there are two processes p and q that have different critical values and use the AÂ1-asvwrite and AÂB-asvwrite as their critical operations to write to their 2W-unit u p;q (cf. Lemma 3.2), respectively. Since p must atomically write to its 1W-unit u p and 2W-unit u p;q using an AÂ1-asvwrite, the two units must be located in the same AÂ1-asvword AS p that starts and ends at addresses kA; k 2 IN, and ðk þ 1ÞA À 1, respectively, due to the memory alignment restriction, as illustrated in Fig. 5a . Two rows b ¼ 1 and b ¼ B in Fig. 5a illustrate the memory alignment corresponding to the size b of svwords on the same 32 consecutive memory units with addresses from 0 to 31. Let k ¼ at þ b, where a; b 2 IN; t ¼ B A ; b t À 1. Since AÂB-asvwrite uses B-svwords as its working units q whose write operation is AÂB-asvwrite must write to the B-svword SV q that overlaps u p;q . The starting address and ending address of SV q are aB and ða þ 1ÞB À 1, respectively, due to the memory alignment restriction (cf. Fig. 5a ). We have aB ¼ atA kA and ða þ 1ÞB ¼ ðta þ tÞA ! ðk þ 1ÞA, namely, SV q overlaps the whole AS p . That means q overwrites the whole AS p including p's 1W-unit u p , a contradiction to the first requirement of Lemma 3.2 for process p. t u Lemma 6.2. The single-asvword assignment has consensus number at least
Proof. We prove this lemma by presenting a wait-free consensus algorithm ASVW_CONSENSUS for m processes using only asvwrites and registers. The ASVW_CONSEN-SUS algorithm is similar to the AIW_CONSENSUS algorithm (Algorithm 4) except that the AIWRITE operations used at lines 3I and 13I are replaced by the ASVWRITE operations.
Similarly to the proof of Lemma 5.2, we will prove that in round m: 1) p m 's ASVWRITE can atomically write to only m units fU 
Proof. We prove this lemma by contradiction. Assume that there is a wait-free consensus algorithm ALG for N processes, where N > M. Due to Lemma 3.1, ALG must have a critical configuration C Ã and the critical operations op i of processes p i with different critical values must be write operations. At the critical configuration C Ã , we divide N processes into k ! 2 subsets s 1 ; . . . ; s k each of which consists of processes with the same critical value. Let n 1 ; . . . ; n k to be the size of the subsets, we have P k l¼1 n l ¼ N. Let p i j be a process in s i , 1 j n i . Since p i j 's critical value is different from that of ð P l6 ¼i n l Þ processes in the ðk À 1Þ other subsets, p i j 's critical operation must atomically write to its 1W-unit and ð P l6 ¼i n l Þ 2W-units (cf. Lemma 3.2). The operation determines the relative ordering between p Second, we prove that the ALG algorithm maximizes N when k is 2, the minimum. Since all the 1W-units and 2W-units of N processes must be located in AS of a fixed size, in order to maximize N, we need to minimize the number M of the 1W and 2W-units used by the N processes. Using similar argument to the proof of Lemma 5.4, it follows that M achieves the minimum ð2N À 1Þ when there are only two subsets: one containing ðN À 1Þ processes p with the same critical value and the other containing only 1 process q.
Lastly, we prove that N cannot be larger than M defined in (6) .
If q uses an AÂ1-asvwrite, let A q be the AÂ1-asvword written by q. A q contains q's 1W-unit u q and all ðN À 1Þ 2W-units u p;q . We prove that the number of A q 's 1-svwords required by a process p is at least 2. Indeed, if p uses AÂ1-asvwrite, both its 1W-unit u p and 2W-unit u p;q must be located in A q since AÂ1-asvwrites cannot atomically write to two 1-svwords located in different AÂ1-asvwords. Therefore, p requires two 1-svwords of A p . If p uses an AÂB-asvwrite, its 2W-unit u p;q must be Bsvword so that p's AÂB-asvwrite does not overwrite other 1W-units nor 2W-units that belong to other processes. Since B ! 2, p's 2W-unit u p;q requires at least two 1-svwords of A q . That means the number of A q 's 1-svwords required by N processes including q is at least 2ðN À 1Þ þ 1 ¼ 2N À 1. Since the AÂ1-asvword A q has A 1-svwords, it follows that N ¼ b
Otto J. Anshus is a professor of computer science at the University of Tromsø. His research interests include operating systems, parallel and distributed architectures and systems, scalable display systems, data-intensive computing, highresolution visualizations, and human-computer interfaces. He is a member of the IEEE Computer Society, the ACM, and the Norwegian Computer Society. More information about his research can be found at otto.anshus@uit.no.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
