In this work, we propose an adder for the 2-Dimensional Nearest-Neighbor, Two-Qubit gate, Concurrent (2D NTC) architecture, designed to match the architectural constraints of many quantum computing technologies. The chosen architecture allows the layout of logical qubits in two dimensions with √ n columns where each column has √ n qubits and the concurrent execution of one-and two-qubit gates with nearest-neighbor interaction only. The proposed adder works in three phases. In the first phase, the first column generates the summation output and the other columns do the carry-lookahead operations. In the second phase, these intermediate values are propagated from column to column, preparing for computation of the final carry for each register position. In the last phase, each column, except the first one, generates the summation output using this column-level carry. The depth and the number of qubits of the proposed adder are ( √ n) and O(n), respectively. The proposed adder executes faster than the adders designed for the 1D NTC architecture when the length of the input registers n is larger than 51.
INTRODUCTION
Quantum computers have been proposed to exploit the exotic properties of quantum mechanics for information processing. Among many potential uses, two quantum algorithms have received the bulk of the attention. One is Shor's large number factoring algorithm [Shor 1997 ], and the other is Grover's unstructured database search algorithm [Grover 1996] , though there has also been much progress recently on other A two-page short abstract was presented at AQIS 2010. This research is supported in part by the Japan Society for the Promotion of Science (JSPS) through its Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program), and in part by the National Research Foundation of Korea Grant funded by the Korean Government (Ministry of Education, Science and Technology) [NRF-2010-359-D00012] . Authors' addresses: B.-S. Choi (corresponding author), Center for Quantum Information Processing, Department of Electrical and Computer Engineering, University of Seoul, Seoul, 130-743, Republic of Korea; email: bschoi3@gmail.com; R. Van Meter, Faculty of Environment and Information Studies, Keio University, Fujisawa, Japan. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from algorithms [Mosca 2008; Bacon and van Dam 2010; Brown et al. 2010 ]. Quantum algorithms are often shown more efficient than classical ones by analyzing the number of queries to an oracle. However, for a more exact performance analysis, we need to analyze the quantum algorithms in terms of the detailed quantum circuits necessary to implement them. Among many circuits, as in classical computation, a core set of subroutines whose behavior will strongly impact the performance of the overall algorithm is arithmetic, hence we focus on the adder in this work.
Numerous quantum addition circuits have been proposed using abstract models of the computer itself [Vedral et al. 1996; Beckman et al. 1996; Fredkin and Toffoli 1982; Feynman 1996; Glassner 2001; Cheng and Tseng 2002; Cuccaro et al. 2004; Draper 2000; Draper et al. 2006; Takahashi and Kunihiro 2005; Van Meter and Itoh 2005] . Incorporating the behavior of these circuits, we can estimate the overall quantum speedup more accurately than simply addressing the issue at the query level, and confirm again that the quantum speedup is very high. However, it is not possible to determine the exact performance gain unless the practical issues of architecture are considered; both the constant factors and the leading order of both the computational complexity and minimum execution time (or circuit depth) depend on the assumed underlying machine. Hence, we have to consider many issues such as error correction, communication, gate, and qubit technologies [Metodi and Chong 2006] . For example, Maslov et al. [2008] pointed out the importance of the problem of placing circuit variables on the underlying qubit layout. Unfortunately, it is impossible to consider all practical issues at the same time. To avoid this problem, we usually define a model quantum computer architecture incorporating as many practical constraints as possible. For many quantum computer architectures, the 2D NTC architecture is a reasonable model capturing the key factors that impact performance, which will be explained precisely in Section 3. The 2D structure allows a single qubit to interact with four neighboring qubits. With more neighboring qubits than the 1D case, the 2D layout should show higher performance, thanks to reduced distance between many pairs of qubits and the potential for more concurrent movement of qubits. Likewise, a 3D layout should show higher performance than the 2D case, but the complexity of fabricating and controlling qubits in three dimensions likely makes it impractical. Therefore, we believe that the 2D layout is the most reasonable choice at the middle level of performance and control overhead. Thus, it would be interesting to understand the quantum speedup in this context. Surprisingly, given the likely importance of two-dimensional physical and logical layouts for real-world systems, as far as we know, there are no quantum addition circuits designed specifically for the 2D NTC architecture. Van Meter and Oskin indicated that an adder could be constructed with O( √ n) time complexity on a 2D architecture, but no circuit was provided [Van Meter and Oskin 2006] . Therefore, it would be worthwhile to design explicitly a quantum addition circuit for the 2D NTC architecture and estimate the performance gain. Based on this, our contributions are as follows.
-Propose a quantum adder on the 2D NTC architecture. First, we lay out the qubits in a √ n × √ n array where n is the input size, in qubits. Based on this layout, we propose a three-phase quantum addition algorithm. In the first phase, the first column does a ripple-carry addition and the other columns do carry-lookahead operations. In the second phase, the column-level carry is propagated in ripple fashion between the columns. In the last phase, each column transports its column-level carry input into the cells to generate the final summation value.
While this circuit builds on known classical techniques, the management of information flow in the 2D structure results in a unique hybrid structure with distinct advantages over existing circuits.
A ( √ n)-Depth Quantum Adder on the 2D NTC Quantum Computer Architecture 24:3 -Analyze the proposed adder. We decompose the necessary quantum circuit blocks using only one-and two-qubit gates. Next, we add SWAP operations as necessary to transport qubits in order to satisfy the NTC constraint. We found that the depth of the proposed adder is 140 √ n − 72, in terms of one-and two-qubit gates. Asymptotically, the depth is ( √ n), meeting the depth lower bound we established in earlier work [Choi and Van Meter 2011] . To execute many quantum gates in parallel, the proposed adder utilizes 2n − √ n working qubits. -Compare to other adders. Since the 2D NTC layout generalizes the 1D NTC architecture, the adders designed for the 1D NTC architecture can also be implemented on the 2D NTC architecture without modification. After reevaluating the depth of the adders for the 1D NTC architecture, we find that our new 2D adder works faster when n ≥ 51, and is about six times faster for n = 2048.
This article is organized as follows. Some fundamental knowledge about universal gates, fault tolerance, and the chosen quantum computer architectures are explained in Section 2. Some related work is discussed in Section 3. The proposed adder is explained in Section 4, including the layout of qubits, information and circuit flows, block-and circuit-level decompositions, and clearing ancilla qubits. The temporal and spatial resources are analyzed and compared with other designs in Section 5. Finally, we conclude this work and point out some constraints and future work in Section 6.
BACKGROUND
In this section, we discuss our choice of computationally universal quantum gates, connecting that first to the issues of fault tolerance in quantum computers, then to quantum computer architectures. In this article, we assume that readers are familiar with the basic concepts of qubits and gates, and the graphical notation for representing gates and circuits. Those unfamiliar with quantum computation may wish to refer to popular [Williams and Clearwater 2000] or technical [Nielsen and Chuang 2000] texts for more in-depth discussion.
Computationally Universal Gates
Like NAND or NOR, which (when combined with fanout) are universal gates for classical computation, there are sets of elementary gates that can be used to execute any arbitrary quantum computation. Any given quantum circuit can decomposed into a sequence of these gates. A typical set is the H, T, and CNOT gates, defined as follows and represented graphically as in Figure 1 .
(2)
Although the preceding three gates are a minimal set of universal gates, it is convenient to add another gate to our set. The SWAP gate exchanges the contents of two qubits and is used to move variables around inside the system.
Fault Tolerance
In contrast to modern classical digital circuits, which are very robust, quantum computers are very sensitive to errors induced by both the environment and imperfect control processes. Therefore, Quantum Error Correction (QEC) is applied to encode a logical qubit in multiple physical qubits, and fault-tolerant methods for executing logical gates on the encoded logical states without propagating errors also have been proposed [Aharonov and Ben-Or 2008] . On many QEC schemes, certain logical gates are straightforward, while others require complex constraints. On Calderbank-Steane-Shor (CSS) codes, the H gate can be executed on the encoded state simply by applying H gates to each of the component qubits. Similarly, the CNOT and SWAP can be executed in a transversal fashion, in which the same physical gate is applied to corresponding pairs of qubits in the two logical states. The T gate, however, requires the use of a specially prepared ancilla state; further details are shown in Gottesman [1998 Gottesman [ , 2009 . The Toffoli, or CCNOT, gate cannot be executed directly, but instead requires decomposition into smaller components. This can be done using CNOTs and T gates. While other papers on quantum adders have used Toffoli gate count or a five-qubit decomposition, these approaches are difficult to implement directly in fault-tolerant systems, so we have instead chosen a T gate-based decomposition consisting of 17 gates (14 unit-time steps), as shown in Figure 2 , in order to provide more realistic performance numbers.
Thus, although the exact execution time depends on the details of the logical gates and their physical implementations, treating CNOT, SWAP, H, and T gates as unit time is a reasonable approximation at the logical level for a broad range of possible systems.
A ( √ n)-Depth Quantum Adder on the 2D NTC Quantum Computer Architecture 24:5
Quantum Computer Architectures
Because a quantum computing system is more than just a collection of devices, the quantum computer architecture also has to be defined in conjunction with the chosen quantum technology. When a quantum technology is described, we can classify the architectures that can be built based on the achievable physical operations. Among many physical constraints we usually focus on four parameters: the gate distance, the gate width, the possibility of concurrent execution of gates, and the physical and logical layout of qubits. The first issue, gate distance, determines the impact of the physical connectivity graph. We can roughly divide architectural models into two classes: those that assume that a connectivity graph exists and gates can be executed only between qubits that are connected, and those that assume that gates can be executed over any distance without penalty, ignoring communication costs. The former we refer to as nearest-neighbor architectures, the latter as arbitrary-distance architectures. Möttönen and Vartiainen [2006] and Shende et al. [2006] studied the effect of the interaction distance on the circuit decomposition, showing the necessity of introducing additional gates, resulting in some overhead. An arbitrary-distance architecture supporting concurrent gate operations, if feasible, would exhibit better performance on many problems. Arbitrarydistance interaction can be approximated by using "flying qubits" or measurementbased quantum computing [Raussendorf et al. 2003 ]. Unfortunately, truly arbitrarydistance interaction is hard to implement, hence the nearest-neighbor approach is generally used with an appropriate connectivity graph, discussed further shortly.
Second, the number of qubits that a gate affects impacts the performance of the computer. Although we can imagine any arbitrary gate with many input qubits, abstract algorithmic papers frequently assume a three-qubit Toffoli gate is available, while papers concentrating on a physical design usually match their gate set to the technology, generally allowing only one-or two-qubit gates. Barenco et al. [1995] showed one method for decomposing a given quantum circuit into two-qubit gates.
Third, the possibility of concurrent execution of gates affects the speed of the computer. Steane [1998] investigated the necessity of concurrent execution for error correction and fault tolerance. Scalable error correction of quantum memory requires (n) concurrent gates. On top of this physical structure, concurrent execution of logical gates is generally straightforward.
Fourth, the layout of qubits affects the performance. With these basic constraints in place, we can more fully discuss the connectivity graph. The physical structure will allow certain two-qubit interactions to take place, which can be represented as edges connecting the nodes (representing qubits) in a graph. Any quantum technology based on lithographic techniques (e.g., most quantum dots [Shin et al. 2010] , Josephson junction circuits of all forms , and some scalable ion traps [Kielpinski et al. 2002] ) is implemented on a plane. The interactions are generally limited to nearest neighbors on that plane, with the qubits organized in a line, a two-dimensional mesh, or perhaps a recursive H structure [Copsey et al. 2003 ]. Physically, we can consider a one-, two-, and three-dimensional square lattice layout of qubits as shown in Figure 3 . In the figure, a qubit can interact with two, four, or six neighboring qubits on the 1D, 2D, and 3D NTC architectures, respectively. Some trapped-ion systems [Häffner et al. 2005 ] and liquid Nuclear Magnetic Resonance (NMR) [Laforest et al. 2007] technologies are experimental systems based on a one-dimensional qubit layout. The original Kane proposal [Kane 1998 ] is also based on this model. Arrays of trapped ions [Häffner et al. 2005 ] and Josephson junctions [Helmer et al. 2007; Douçot et al. 2004 ] are being designed in two-dimensional arrangements. By stacking two-dimensional planes, we can create a three-dimensional layout [Pérez-Delgado et al. 2006 ]. 
RELATED WORK
In large systems, the physical structure is first used to implement quantum error correction. Szkopek et al. [2006] , Svore et al. [2005] , and others have investigated the effect of the limited interaction distance on Calderbank-Steane-Shor (CSS) code [Calderbank and Shor 1996; Steane 1996 ] threshold values. Any CSS code-based error correction mechanism [Devitt et al. 2009 ] will result in a layout of logical qubits that is similar to or lower dimension that the physical system on which it resides. Thus, we can see that the logical and physical connectivity graphs are generally related, and that one-and two-dimensional layouts are likely to figure prominently in physically realizable designs.
Among prominent error correction approaches, topological error correction mechanisms [Raussendorf and Harrington 2007] behave differently, and may require different conceptual tools. Like the measurement-based approaches mentioned before, they can approximate long-distance interactions but at the expense of significant "wiring" resources consuming space in the system, giving this problem more of the flavor of a classical place-and-route circuit design problem.
Although we can conceive of several architectures with the aforesaid parameters, we focus on a specific logical architecture with the nearest-neighbor only limitation, two-qubit gates, concurrent execution of gates, and two-dimensional layout of qubits, which we call the 2D NTC architecture [Van Meter and Itoh 2005] .
At application level, a few researchers have investigated circuits within a particular architectural context. Maslov [2007] and Takahashi et al. [2007] , for example, studied the quantum Fourier transform and the stabilizer code with the nearest-neighbor constraint.
Numerous arithmetic circuits have been proposed, mostly without explicit reference to an architectural model. The basic elementary quantum arithmetic operations including addition have been proposed by Vedral et al. [1996] and Beckman et al. [1996] , following seminal work on elementary reversible full-and half-adders by Fredkin and Toffoli [1982] , and Feynman [1996] . Glassner proposed a one-qubit full adder [Glassner 2001 ]. Subsequently, Cheng and Tseng proposed an n-qubit full adder and subtractor based on the work of Glassner [Cheng and Tseng 2002] . Reducing the space requirements for those earlier adders [Vedral et al. 1996 
ADDER ON THE 2D NTC ARCHITECTURE
In this section, we first briefly explain the idea of the proposed adder. Second, we show how the qubits are laid out on the 2D structure. Third, we explain an addition algorithm based on a slight modification of carry-lookahead addition. Fourth, we discuss how the addition algorithm is mapped to the circuit blocks. Finally, we show how the ancilla qubits can be initialized.
To exploit the 2D NTC architecture, we propose a three-phase method. The given 2D layout can be grouped into √ N columns. In the first phase, the first column does a simple ripple-carry addition. At the same time, the other columns do the prefix addition. In the second phase, the carry output of the first column propagates into the following columns sequentially to generate the column-level carry inputs. In the third phase, each column, except the first, generates the summation output sequentially.
Proposed Layout of Qubits
On the 2D NTC structure, we can lay out the qubits as shown in Figure 4 . In the figure, the two input registers are A = a n · 2 n−1 + a n−1 · 2 n−2 + · · · + a 1 ,
As shown in the figure, the two inputs a i and b i are interleaved where 1 ≤ i ≤ n. The number of rows and columns are 2 √ n and √ n, respectively. We group a i and b i together as a cell and define a cell location notation (k-th column, j-th row) where k = i/ √ n and j = i − (k − 1)
√ n. The figure shows only the input qubits for clarity. For simplicity, we assume without loss of generality that √ n is an integer. Note that the chosen layout of qubits is one of many possibilities. Since practical constraints such as those imposed by the NTC architecture affect the overall circuit decomposition, the layout of qubits has to be chosen properly. While we believe our layout is good, finding the single optimal solution is a classical place-and-route problem. Software tools for this remain to be developed. Therefore, we restrict ourselves to a single arrangement for this article, but leave the further optimization as future work. Recent research results may point the way to additional improvements in this area [Hirata et al. 2009; Saeedi et al. 2010 ].
Three Phases
To set the stage for the later arithmetic discussions, let us first explain the ripple for two n-qubit input registers, a and b. Since the sum for the i-th position s i is generated by a i ⊕ b i ⊕ c i , where a i and b i are the i-th qubits in the input registers, and c i is the carry input from the summation of the (i − 1)-th position, the time complexity of the addition depends on how fast the carry information can be transported between the bit positions.
The simplest circuit is the ripple-carry adder, which propagates the carry information stepwise from position to position. The carry output for the (i + 1)-th position, c i+1 , should be one if a majority of the bits a i , b i , and c i are one, and zero otherwise; it is Two inputs are A = a n · 2 n−1 + a n−1 · 2 n−2 + · · · + a 1 and B = b n · 2 n−1 + b n−1 · 2 n−2 + · · · + b 1 . i-th qubit is located at (k, j) position where k = i/ √ n and j = i − (k − 1) √ n. Ancilla qubits are not shown for simplicity. generated by a i · b i ⊕ a i · c i ⊕ b i · c i . Therefore, the final summation value s n is generated only after n ripple-carry time steps. To reduce this time, a carry-lookahead method was devised. In this method, two additional values are defined as follows.
Implicitly, g i and p i determine whether this bit position generates a carry out independent of the carry in, or propagates its incoming carry to its output carry, respectively. Only one of these may be true, though both may be false (a property called carry kill, though kill is not necessary in the actual circuit). The carry output for the (i + 1)-th position is generated as c i = g i ⊕ p i · c i−1 . Therefore, if g i is one, c i has no dependence on c i−1 , and hence disconnects the carry chain. However, if g i is zero and p i is one, c i is dependent on c i−1 . In the worst case, the longest chain is from c 1 to c n . To decompose this long chain into subunits, two variables G [i, j] and P [i, j] are also defined as follows. 
from position i all the way to position j. By calculating these values concurrently and progressively increasing the span of G and P, the total time to create complete carry information for the entire register can be reduced to (log n), provided that communication within the system is adequately fast. Unfortunately, this carry-lookahead addition algorithm is defined assuming no limitation of interaction distance, and hence cannot be applied for the 2D NTC architecture without modification. In this work, we modify the carry-lookahead, creating a circuit of three phases as follows.
Phase 1: Ripple-Carry Addition on the First Column, and Carry-Lookahead on the Other
Columns. As shown in Figure 5(a) , the first column does the typical ripple-carry addition. From the first position to the last position, each position generates a summation value and a carry output as follows. We have
where c 1 = 0. Since the carry output of the i-th position must be used as input for the next (i + 1)-th position, there is an information dependency, causing this step to take ( √ n) time for the √ n qubits for this column. During this time, the other columns concurrently generate other necessary information for carry-lookahead operations as shown in Figure 5(b) . For example, the k-th column works as follows. First, each (k, j) cell generates g (k−1) √ n+ j and p (k−1) √ n+ j concurrently,
and
where 1 < j ≤ √ n. After that, each (k, j) cell generates
sequentially, where
The same process is applied for the other columns. After this phase, the first column generates its final summation output and also the carry output c √ n+1 . The other columns generate the column-level carry-lookahead information G
Phase 2:
Inter-column Carry Propagation. The final carry output of the first column, c √ n+1 , is given as an initial input value for the column-level carry generation logic as shown in Figure 6(a) . Each column, except the first, generates its column-level carry output as 
After that, each cell can generate the final summation value as
4.2.4. Example for 9 Qubits. To make our concepts concrete, we explain an example for nine qubits for each input arranged in three columns. The overall circuit at block level is shown in Figure 8 . In the figure, T i is time in units of circuit blocks rather than individual gates. The circuit blocks and their circuit-level decompositions will be explained in the following sections.
In Phase 1, Column 1 does the ripple-carry addition as shown in Figure 5 (a) and (b). At the same time, the other columns, Column 2 and Column 3 generate g i and p i concurrently at T 1 , and G [ j,k] and P [ j,k] sequentially through T 2 and T 3 as shown in Figure 5 (b). After this phase, three summation values, s 1 , s 2 , and s 3 , are calculated. The final carry output of the first column, c 4 , is also generated. Using the incoming carry for each column, all carries for each position are generated sequentially. After that, the final summations are generated concurrently.
In Phase 2, as shown in Figure 6(b) , the carry output of the first column, c 4 , is transported to the next column to generate the carry output of the second column, c 7 , at time step T 4 . Likewise at time step T 5 , the carry output of the third column, c 10 , is generated. Note that c 10 is the final carry output of the whole addition.
In Phase 3, the column-level carry outputs, c 4 and c 7 , are used for generating carry inputs for each bit position as shown in Figure 7(b) . At time step T 6 , c 6 and c 9 are generated from c 4 and c 7 by using C. After that c 4 and c 7 are transported to the lower row. In this example, the lower row is the first row, and hence the block C 1 s are used for generating c 5 and c 8 at time step T 7 . Finally, c 4 and c 7 are transported to the block for summation. After these steps, all of the inputs for the sum blocks have been prepared, and hence can generate the final summation output from s 4 to s 9 concurrently at time step T 8 .
Block-Level Decomposition
In the first phase, the first column and the other columns use different circuit blocks. The circuit blocks for the first column are shown in Figure 5(a) and (b) . To do the ripplecarry addition, a half-adder (HA) for the first position and √ n-1 full-adders (FA) are used. The circuit blocks for the other columns are shown in Figure 5(b) . As explained in the previous part, it generates first g (k−1) √ n+ j and p (k−1) √ n+ j concurrently by using the g, p circuit blocks and then G[(k−1) √ n+1, (k−1) √ n+ j] and P[(k−1) √ n+1, (k−1) √ n+ j] sequentially by using the G, P circuit blocks.
The circuit block for the second phase is shown in Figure 6(b) . The circuit block Colcarry has three inputs: G and P from the corresponding column and Column carry from the lower column. Figure 7(b) shows the circuit blocks for the third phase. In the figure, c and c1 represent the blocks for generating carry output for i-th position. Note for the first row, p and g are the same as P and G, and hence the circuit block is slightly different. SUM, SUM1, and SUM2 are for generating the final summation value for j-th position.
Circuit-Level Decomposition
In this subsection, we decompose the circuit blocks using H, CNOT, T, and SWAP. For a systematic approach, we first consider the AC architecture and hence generate gate arrays without NTC constraints. Since our target architecture is 2D NTC, we then modify them with NTC constraints by adding several SWAP gates. 4.4.1. Circuit without NTC Constraints. We decompose the circuit blocks for the three phases with the chosen elementary gates. The decomposed circuits for each block are shown in Figures 9 to 13. Note that in each figure, the circuit on the left is for arbitrary-distance interaction (i.e, without NTC constraints), and on the right is for NTC-constrained systems.
The circuit of the HALF ADDER is shown in Figure 9 (a). Figure 9 (b)(left) [Cheng and Tseng 2002] shows a decomposition of FULL ADDER into elementary gates. The circuits for g and p, and the generalized G and P are shown in Figure 10 (a) and (b)(left), respectively. The circuit of Column carry is shown in Figure 11 (left). The initial circuit for Carry is shown in Figure 12 (a)(left). After this circuit, the Col carry is transported to the top position, and the other values in the column are shifted down one row. Since the carry for the first row is different from the other rows, Figure 12 (b)(left) shows its circuits. The circuits for SUM are shown in Figures 13(a) . For the second and the first rows, we have to use slightly different circuits as shown in Figures 13(b) (left) and 13(c), respectively.
Circuits
With NTC Constraints. Now we modify the circuit blocks to satisfy the NTC constraints. The modified blocks for NTC architecture are also shown in Figures 9 to 13. For example, to satisfy the NTC constraint, we redesign the circuit in Figure  9 (b)(left) into Figure 9 (b)(right) by adding several SWAP gates to move the qubits to neighboring positions. For generating |Col Carry k+1 , a single CCNOT is enough. However, to propagate it to the next column and to propagate |Col Carry k to the rows, a SWAP is necessary. For implementing the last SWAP gate in the neighbor interaction only case, several SWAPs are necessary, as shown in Figure 11(right) . Since the Col carry has to be moved to the upper row, several SWAPs are necessary A ( √ n)-Depth Quantum Adder on the 2D NTC Quantum Computer Architecture 24:15 Fig. 10. g, p , G, and P. as shown in Figure 12 (a)(right). In this work, we have added the SWAPs by hand, but it is possible for compilation tools to perform this step automatically. This approach is also applied for the remaining circuit blocks.
Clearing Ancillae Qubits
In general, qubits in quantum circuits serve three roles: input, output, and auxiliary qubits known as ancillae. The input and output are as in classical circuits. The ancilla qubits are used for holding the temporary results during computation. The overall circuit must have the same number of inputs and outputs in order to satisfy the reversible unitary condition. To both eliminate unwanted entanglement and free the ancillae for reuse, we must clear the ancillae. A typical way to do that is to apply the inverse circuit after copying the output as shown in Figure 14(a) . For adders, a slightly more efficient method is possible, as follows.
A ( √ n)-Depth Quantum Adder on the 2D NTC Quantum Computer Architecture 24:17
As shown in Table II , three types of ancilla qubits are used, c i , P [i, j] , and Column carry k . To clear these ancillae, we have used the strategy proposed in reference Draper et al. [2006] . The key idea of this approach is based on the observation that in two's complement arithmetic −x ≡x + 1 (mod 2 n ),
wherex is the bitwise inversion of x. Let us consider an addition of A and B, ADD(A, B, 0) = (A, S, C), where S and C are the bitwise sum and carry vectors, respectively. In this addition, A, B, 0 are inputs for the addition, and A, S, and C are outputs. To satisfy the unitary condition, A is unchanged, but the other two inputs B and 0 change to S and C respectively.
Meanwhile, the vectorB can be generated by the addition of A andS as follows.
Based on this result, we can derive another addition, ADD(A,S, 0) = (A,B, D) , wherē B and D are outputs for sum and carry vectors, respectively.
It is worth noting that C must be equal to D because of
Now we follow the circuit as shown in Figure 14 . Conceptually any addition circuit can be divided into two parts, C ARRY generation (C i ) and SU M generation (S i ). As shown in the figure, we apply C ARRY as follows.
As the second step, we apply SU M as follows.
Next, apply the two operations
Meanwhile,
Since the two carry vectors C and D for A + B and A +S are the same, the previous line changes to 1 CNOT + 2 SWAPs 3 SUM2
1 CNOT 1 1) # of unit-gate steps Therefore, running the inverse operation,
Finally, apply NOT 2 as follows
to generate the final sum and clean ancillae.
ANALYSIS

Depth Analysis
To analyze the depth of the proposed adder, we have to decompose the circuit blocks into elementary gates such as H, T, CNOT, and SWAP gates having unit delay. In addition, we also use a CCNOT gate which can be decomposed as shown in Figure 2 .
Since these gates can be fault-tolerantly implemented, we expect the proposed adder can be implemented fault-tolerantly as well. Based on the revised circuits satisfying the NTC constraint, we can summarize the depth of each elementary gate and circuit block as shown in Table I .
The proposed adder works in three sequential phases, and hence the overall depth is the sum of the depths for each phase. The depth for each phase is the "long pole", or the longest delay among the parallel execution paths. In the first column, one HA and ( √ n − 1) FA operations are executed sequentially. Since HA needs 15 unit-gate steps and FA needs 32 unit-gate steps, 32 √ n − 17 unit-gate steps are needed. On the other hand, the other columns need one g, p + ( √ n-1)G, P, which is 34 √ n − 19. The overall depth for the first phase is the longer of the two column types, hence 34 √ n− 19. The second phase consists of ( √ n − 1) Column carry operations, requiring a total of 18 √ n − 18 time steps. The third phase consists of ( √ n − 1) Carry + Carry1 and SUM1 operations for the longest path. Hence, the depth is 18 √ n + 1 unit-gate steps. By summing the depths of each phase, the total depth is 70 √ n − 36. The preceding depth is only for generating the summation output without clearing the ancillae. For clearing ancillae, we apply more circuits as shown in Figure 14 . Based on this figure, we can decompose the aforesaid three phases into the carry generation flow and the sum generation flow. The first and the second phases are for the carry generation flow. The third phase has to be divided into the carry generation flow and the sum generation flow. The previous depth is apportioned as 70 √ n − 39 for carry generation and 3 for sum generation. As shown in Figure 14 , we need to apply NOT and CNOT gates and then the inverse of the carry generation flow again with the final NOT gate. Hence, the overall depth is 70 √ n−39+3+1+1+70 √ n−39+1 = 140 √ n−72. RCA+CLA-based [Kawata et al. 2008] 10 log n + 6n/log n 4n/log n − n − 1 N/A 10n log n + O(n 2 / log n) 1) When is the present adder faster than the corresponding adder?
Required Space
The number of qubits for the adder is shown in Table II . As shown in the first column, some qubits are used for multiple purposes. The additional number of qubits is 2n− √ n, which is less than twice the minimum 2n qubits [Cuccaro et al. 2004; Takahashi and Kunihiro 2005] .
Comparison to Other Adders
Beyond the asymptotic behavior, it seems more interesting and important to compare with other adders in the practical cases. Specifically, it is necessary to compare adders designed for the 1D NTC architecture since they can be implemented on the 2D NTC architecture without modification, using a simple serpentine qubit layout. The overall analysis and the comparison between the types of adders are shown in Table III . The first column distinguishes the architecture and the second column lists the adder type. For the 1D NTC architecture, we choose three typical adders. Vedral et al. proposed a plain ripple-carry adder [Vedral et al. 1996] Kawata et al. also proposed an adder based on the combination of ripplecarry adder and carry-lookahead adder [Kawata et al. 2008] , labeled RCA+CLA-based.
For comparison, the depth and the size of each adder are shown in the third column. In this work, the depth is measured in units of one-and two-qubit gates for the 1D and 2D NTC architectures. The depth for the AC architecture is based on one-, two-, and CCNOT gates. The size is for the number of qubits for input, output, and ancillae. The fourth column lists the input size at which our adder becomes faster than the corresponding adder. In the fifth column, we calculate KQ, the product of the logical number of qubits (K) and the number of elementary steps (Q), respectively [Steane 2003 ]. KQ is used to estimate the strength of error correction required. KQ can also be used to evaluate trade-offs in space and time. A circuit that uses twice as many qubits can be considered favorable for error correction if its circuit depth is less than half the original. Among the adders for 1D NTC architecture that we surveyed, CDKM has the lowest KQ at ∼36n 2 .
From this table we can point out three key results. First, when the size of input is larger than 51, the present adder works faster than all known 1D NTC adders. Second, the present adder needs about twice the number of qubits that the smallest 1D NTC adders use. Lastly, the present adder has a smaller KQ factor than the CDKM adder when the input size is larger than 215.
CONCLUSION AND OPEN PROBLEMS
In this work, we proposed a quantum adder for the 2D NTC architecture. Our adder has a depth complexity ( √ n) using O(n) qubits, making it asymptotically optimal for such a system. We found that the proposed adder works faster than CDKM, the fastest 1D NTC adder, when the length of the input registers is larger than 51, in exchange for about twice the number of qubits. This performance advantage continues to grow as the size of problem increases, reaching a factor of six for n = 2048, the target size for systems to execute Shor's factoring algorithm.
Although this adder is, to the best of our knowledge, the first one specifically designed for a 2D NTC architecture, we suspect it will not be the last; we anticipate that several improvements are possible. First, the number of gates added to transport qubits to neighboring positions so that gates can be executed is larger than we would like. A different qubit layout may reduce the necessary propagation operations. Second, as with all known reversible addition circuits, the phase for cleaning the ancilla qubits roughly doubles the total number of quantum operations. Perhaps there is some way to reduce this drawback by exploiting some overlap of the clearing phase with the computation phase. Third, as with other carry-lookahead adders, the number of ancillae is larger than for ripple-carry adders; this may prove to be a fundamental tradeoff of space for time. The proposed design attempts to achieve the highest parallel execution at the expense of requiring more ancillae, but this trade-off may prove less than optimal for two reasons. First, qubits themselves are expensive resources, and in many applications could be allocated to other work if not used directly in the adder; second, inserting the ancillae into our layout increases the distance between qubits, forcing the addition of more SWAPs and slowing down the circuit. Lastly, but not least, the place-and-route issue of qubits and gates has to be considered to achieve global optimization.
This circuit is one of only a handful that are designed to a set of explicit architectural constraints. We believe that the results shown here will be useful for a large number of realizable quantum architectures, and that the approach of architecture-aware design is critical for combining quantum technology, error correction, and applications into high-performance systems.
A ( √ n)-Depth Quantum Adder on the 2D NTC Quantum Computer Architecture 24:21
