We prove two new space lower bounds for the problem of implementing a large shared register using smaller physical shared registers. We focus on the case where both the implemented and physical registers are single-writer, which means they can be accessed concurrently by multiple readers but only by a single writer. To strengthen our lower bounds, we let the physical registers be atomic and we only require the implemented register to be regular. Furthermore, the lower bounds hold for obstruction-free implementations, which means they also hold for lock-free and wait-free implementations.
Introduction
In most shared memory multi-processor systems, processes communicate with each other by reading from and writing to shared registers. Modern systems typically allow you to atomically read and write some constant number of bits. To read and write larger amounts of data, you would have to implement a larger register using the smaller physical ones provided by the system. This paper studies the space complexity required by such implementations. We define the space complexity of an implementation to be the number of physical registers it uses. The step complexity of an operation is defined to be the worst case number of steps needed to complete the operation.
This problem varies in several dimensions. There are three common correctness conditions for shared registers, safe, regular, and atomic, which were introduced by Lamport back in 1986 [7] . Atomicity is the strongest of the three conditions and safety is the weakest. All atomic registers are regular and all regular registers are safe. In this paper we only consider regular and atomic registers. Shared registers can also differ in the number of readers and the number of writers allowed to access the register concurrently. In this paper we only consider single-writer (SW) registers, which are registers that can be accessed concurrently by multiple readers but only a single writer. Finally, we restrict our attention to non-blocking implementations. This excludes the use of locks and other blocking techniques. There are three common non-blocking progress guarantees that appear in the literature: obstruction-freedom, lock-freedom, and wait-freedom. Obstructionfreedom is the weakest natural non-blocking guarantee and it includes all lock-free and wait-free algorithms. Wait-freedom is the strongest guarantee and it ensures that every process makes progress regardless of how the processes are scheduled. The terms obstruction-freedom, regular, and atomic are defined formally in Section 2. Table 1 lists some previous implementations of an m-value SW register from b-value SW registers. The number of readers is represented by r. A reader is considered to be invisible if it never writes to shared registers. The 'Invisible' column contains a 'yes' whenever the implementation has at least one invisible reader. All implementations listed in the table are wait-free. The register type of an implementation is atomic, if it implements a large atomic register using smaller atomic registers. Similarly we say it's regular if it is a regular from regular implementation. Some papers [8] assume additional atomic primitives like, swap, fetch-and-add, and compare-and-swap, but we focus on implementations without any additional primitives.
Prior Work
Register Type Invisible? Space Read Write [4] shows that our lower bound is asymptotically tight for the invisible reader case. Their implementation was first introduced for the b = 2 case. Later, Chen and Wei [5] show how it can be generalized for any b ≥ 2. When m is a power of b, the number of registers used by the implementation is m−1 b−1 , which matches our lower bound exactly.
There are some previous space lower bounds for this problem. Chaudhuri and Welch prove multiple lower bounds in [4] . The one that is most relevant to this paper says that any regular from regular implementation where b = 2 requires at least ⌈max(log m + 1, 2 log m − log log m − 2)⌉ space. Chaudhuri, Kosa and Welch [3] prove that any regular from regular implementation where b = 2 and the writer only performs a single operation requires Ω(m 2 ) space. This shows that their one-write algorithm is space optimal. They also prove a space lower bound of 2m − 1 − ⌈log m⌉ for a slightly more general case. Berger, Keidar and Spiegelman [2] consider a class of algorithms where each read operation has to see at least τ ≥ 2 values written by the same write operation before the read operation is allowed to return. They show that in this setting, any wait-free, regular from atomic implementation requires τ m space for the invisible reader case and τ +(τ −1) min(m−1, r) space for the general case. This lower bound helps explain the space complexity of Peterson's [9] as well as Chen and Wei's [5] implementation because τ = log m log b in both implementations. However, the lower bound does not apply to any of the invisible reader algorithms from Table 1 because τ equals 0 or 1 in all those algorithms.
We define some important terms in Section 2 and we prove both our lower bounds in Section 3.
Model
A single-writer (SW) register R is a shared register where only one process can perform write operations and any number of processes can perform read operations. We say that a process owns R if it can write to R. We will work in the standard asynchronous shared memory model [1] with r readers and one writer, which communicate through shared physical registers. Processes may fail by crashing.
In our model, an execution is an alternating sequence of configurations and steps C 0 , e 1 , C 1 , e 2 , C 2 , . . . , where C 0 is an initial configuration. Each step is either a read or write of a physical register. Configuration C i consists of the state of every register and every process after the step e i is applied to configuration C i−1 .
A register is atomic if its operations are linearizable [6] . A register is regular if the value returned by each read is either the value written by the last write operation completed before the first step of the read or the value written by a write operation concurrent with the read operation. Note that every atomic register is also regular.
The rest of this paper will focus on the obstruction-free progress guarantee which says that if at any point in the execution, an operation is allowed to run in isolation (with all other processes suspended), then it will terminate in a finite number of steps. Any wait-free and lock-free algorithm is also obstruction-free.
Space Lower Bounds
This section proves two new lower bounds on the number of atomic registers needed to implement a large regular register. The term 'implementation' will frequently be used as a shorthand which means 'obstruction-free implementation of a regular SW register from smaller atomic SW registers'. The first lower bound, Theorem 3.5, applies to all implementations with an invisible reader. This lower bound can be used to easily prove a more general lower bound that holds for the visible reader case as well. This is done in Theorem 3.6.
Throughout the proofs, there are three important algorithm parameters that come up repeatedly: m, the number of values that can be represented by the simulated register, n, the number of physical registers in the implementation, and finally, S, the total fanout of the implementation. If each of the physical registers can represent b values, then the total fanout is simply nb. However, it greatly simplifies the proofs to consider implementations that use physical registers of different sizes. Below is the definition of 'total fanout' for this more general setting. Here is an overview of the proofs in this section. The first proof is for the main technical lemma, Lemma 3.3, which says if there exists an m-value register implementation with S fanout and an invisible reader, then there exists an (m − 1)-value register implementation with S − 1 fanout and an invisible reader. Once this lemma is established, the rest of the proof is straight forward. Lemma 3.4 uses Lemma 3.3 inductively to argue that S must be large when m is large. The invisible reader lower bound, Theorem 3.5, is basically a special case of Lemma 3.4 where all the physical registers have the same size. And finally a short proof of the general lower bound, Theorem 3.6, can be derived using the invisible reader lower bound.
The main idea behind Lemma 3.3 is to look at the decision tree of the invisible reader. The internal nodes of the decision tree are labeled by physical register and the leaves are labeled by return values. We keep minimizing the decision tree until we find a leaf with value v with a parent such that there exists a configuration C where the invisible reader is at the parent (i.e. it's just about to read the register at the parent) and for it to be 'unsafe' for the invisible reader to return v. It is 'safe' to return a value at a configuration if the reader can do so without violating the semantics of regular registers. It is 'unsafe' otherwise. After configuration C, if we never write the value v again then it will forever be unsafe for the invisible reader to return the value v. This means that the register at the parent node can never again point to the leaf with value v because if it did, the invisible reader paused at parent might execute and return v, an unsafe value. So the register at parent can take on one less value. This register could also appear in other parts of the decision tree, and it would have a reduced value set everywhere it appears. Therefore by removing the value v, we can reduce the total fanout of the implementation by 1. A more detailed version of this argument appears in the proof.
Before diving into the main technical lemma, we first define some useful notation. Note that the definition doesn't care how many readers there are as long as there is at least one invisible reader. Proof. Suppose E(m, n, S) is true. Then there exists an algorithm A m which satisfies the conditions from Definition 3.2. Our goal is to construct an algorithm A m−1 to show that E(m − 1, n, S − 1) is also true. We begin by setting A ′ m = A m and running the following process on A ′ m . The goal of this process is to minimize A ′ m until we find a decision tree node and a configuration with desirable properties. We say that it is 'safe' for a reader to return a value at a configuration C if the reader can return the value without violating regular register semantics.
1. Let T be the decision tree of an invisible reader in algorithm A ′ m . Let r be the reader process that runs this decision tree.
2. Consider the set of leaves in T that are closest to the root. Let ℓ be any leaf in this set and let v be the value of ℓ.
3. Since m ≥ 2, ℓ can't be the root of T , so ℓ must have some parent node p.
4.
If it is safe for r to return v in all configurations where r is at node p, then replace the subtree rooted at p with the leaf ℓ (this replacement maintains the correctness of the decision tree). Repeat from step 1 using this new algorithm A ′ m .
5. Otherwise, we know that there exists a configuration C where reader r is at node p and it is not safe for r to return v. We have found the decision tree node p and the configuration C that we were looking for, so the process terminates.
This process is guaranteed to terminate within a finite number of iterations because A ′ m is initially obstruction free. This means that there's a finite number of nodes between the root of T and its closest leaf in the initial iteration. Each iteration reduces this distance by 1, so the process will eventually terminate.
Before we get to the main part of the proof, we will take a break and fix a minor technical issue with A ′ m . In step 4 of the process we may have deleted some registers and reduced the total fanout of A ′ m . Ideally we would like A ′ m to have the same register count and total fanout as the original A m . This can be achieved by 'padding' A ′ m with dummy registers until it reaches n register and S total fanout. These registers do not impact the algorithm, they are just there to increase the space complexity and total fanout. In general if an algorithm uses x registers and has y total fanout, we can pad the algorithm so that it uses x ′ > x registers and has y ′ > y total fanout as long as y ′ − y ≥ x ′ − x (since adding a register increases the total fanout by at least 1). Now we can say that A ′ m uses n registers and has S total fanout. Everything is in place for our main argument. If there is write in progress at configuration C, then run it to completion and call the resulting configuration C ′ . Otherwise, there is no pending write, so we let C ′ equal C. C ′ will be the initial state of our (m − 1)-value register implementation.
If it is not safe for reader r to return value v at configuration C, then we know that there is no partial write of v at configuration C. Therefore it is also not safe for r to return v at configuration C ′ . We will keep the reader r paused at node p. Suppose there are no more writes of v after configuration C ′ . Then, after configuration C ′ , r will never be allowed to return v (if it did, it would violate regular register semantics). This means that the node p will never be allowed to point to the leaf ℓ after configuration C ′ (if it did, then we would resume r and r would read p and return v). Note that we do not actually need to pause the reader r at node p. Since the reader r is invisible, the other processes do not know whether or not the reader is paused, so p cannot be changed to point to ℓ. Therefore if we remove v from the value set starting from configuration C ′ , the algorithm A ′ m actually implements an m − 1 valued regular register using n space and S − 1 fanout (the fanout of register p is reduced by 1 since it never again points to ℓ). This algorithm is also obstruction free and has an invisible reader r, so it proves that E(m − 1, n, S − 1) is true.
The next lemma is proven by inductively applying the previous lemma and it intuitively say that S must be large if m is large. The S − n + 1 term in the lemma statement looks mysterious at first, but it is actually just the number of leaves in a rooted tree with n internal nodes and total fan-out S. In the rooted tree context, total fan-out just means the sum of the number of children at each internal node. Proof. This proof is by induction on m. In the base case where m = 1, this lemma holds because the total fanout S is always at least as large as the number of registers n. This means that S − n + 1 ≥ 1 = m. Now suppose that the lemma holds for some m − 1 ≥ 1. We want to show that it holds for m as well. Pick any n and S such that E(m, n, S) is true. Since m ≥ 2, by Lemma 3.3, we know that E(m − 1, n, S − 1) is true as well. By the inductive hypothesis, we know that (S − 1) − n + 1 ≥ (m − 1), which means that S − n + 1 ≥ m as required. 
Acknowledgements
Special thanks to Faith Ellen and Peter (Tian Ze) Chen for the many helpful discussions. I would also like to thank Alexander Spiegelman for noticing that this proof works for more than just wait-free algorithms.
