We first show how to construct an O(n)-depth O(n)-size quantum circuit for addition of two n-bit binary numbers with no ancillary qubits. The exact size is 7n − 6, which is smaller than that of any other quantum circuit ever constructed for addition with no ancillary qubits. Using the circuit, we then propose a method for constructing an O(d(n))-depth O(n)-size quantum circuit for addition with O(n/d(n)) ancillary qubits for any d(n) = Ω(log n). If we are allowed to use unbounded fan-out gates with length O(n ε ) for an arbitrary small positive constant ε, we can modify the method and construct an O(e(n))-depth O(n)-size circuit with o(n) ancillary qubits for any e(n) = Ω(log * n). In particular, these methods yield efficient circuits with depth O(log n) and with depth O(log * n), respectively. We apply our circuits to constructing efficient quantum circuits for Shor's discrete logarithm algorithm.
Introduction
Since Shor's discovery of quantum algorithms for factoring and discrete logarithm problems [1] , many studies have investigated ways of constructing quantum circuits for the algorithms [2, 3, 4, 5, 6, 7] . The resulting circuits are important not only for implementing the algorithms on a quantum computer but also for understanding the computational power of small quantum circuits. These studies have shown that addition of two binary numbers is a key operation for constructing quantum circuits for Shor's algorithms.
We consider the problem of constructing quantum circuits for addition of two binary numbers with better complexity. The complexity measures of a quantum circuit are its size and depth, and the number of qubits in it. Roughly speaking, the size and depth correspond to computation time, while the number of qubits corresponds to the size of memory. We regard the number of qubits as a primary consideration since it seems difficult to realize a quantum computer with many qubits. It is not obvious whether the number of qubits in a quantum circuit for addition can be decreased by using efficient classical ones, though the size or depth can be decreased simply by using them.
An unbounded fan-out gate on n + 1 qubits copies a classical source bit into n copies. In particular, the gate on two qubits is a CNOT gate. If unbounded fan-out gates are available, sublogarithmic-depth quantum circuits for various operations can be constructed [8, 9] . This is because the gate performs the copy operation on an unbounded number of qubits in a constant time. However, it seems difficult to realize such a gate practically. Thus, it is important to minimize the number of target qubits of the gate in a circuit without increasing the complexity of the circuit. When we use unbounded fan-out gates, we consider the complexity measures (size, depth, and the number of qubits) for the number of target qubits of the gate. We call the number of target qubits the length of an unbounded fan-out gate.
There have been many studies of efficient quantum circuits for addition of two n-bit binary numbers. These circuits can be classified according to depth complexity. Draper's and Takahashi quantum circuits for addition. The generality allows us to construct quantum circuits appropriate for various situations we will have to consider practically. For example, if we want to save the number of qubits, we can obtain a qubit-efficient circuit by setting d(n) = n in our method. We can decrease the depth by setting d(n) = log n. Moreover, we can choose an "intermediate" circuit by setting d(n) = √ n.
2 Circuit with Depth O(n)
Ripple-Carry Approach
We use the standard notation for quantum states and the standard diagrams for quantum circuits [17] . As mentioned earlier, the measures of the complexity of a quantum circuit are the number of qubits and its size and depth. The meaning of the number of qubits is obvious. The size of a circuit is defined as the total number of elementary gates in it. The elementary gates are one-qubit unitary gates, CNOT gates, controlled-R t gates, and Toffoli gates, where R t |x = e 2πix/2 t |x for t ≥ 1 and x ∈ {0, 1}. In Section 4, we use the gate for an unbounded fan-out operation F t as an elementary gate, where F t (on t + 1 qubits) is defined as |x i ⊕ y for y, x i ∈ {0, 1}. The symbol ⊕ denotes addition modulo 2. The depth of a circuit is defined as follows. Input qubits are considered to have depth 0. For each gate G, the depth of G is equal to 1 plus the maximal depth of a gate on which G depends. The depth of a circuit is equal to the maximal depth of a gate in it. Intuitively, the depth is the number of layers in the circuit, where a layer consists of gates that can be performed simultaneously. A quantum circuit can use ancillary qubits, which start and end in the state |0 . We usually count the number of ancillary qubits instead of the number of all qubits used in the circuit. We consider the problem of constructing quantum circuits for the operation ADD n defined as
where a n−1 · · · a 0 and b n−1 · · · b 0 are the input binary numbers, z ∈ {0, 1}, and s n · · · s 0 is the sum of the input binary numbers. Our linear-depth circuit and most of the previous ones with a small number of qubits are based on the ripple-carry approach. To explain the approach, we define the carry bit c i (0 ≤ i ≤ n) as follows:
where MAJ is the majority function for three bits defined as MAJ(a, b, c) = ab ⊕ bc ⊕ ca. In the ripple-carry approach, the first step is to compute the carry bit c 1 by using a 0 and b 0 and c 0 . Then, c 2 is computed by using a 1 and b 1 and c 1 . This procedure is repeated until all carry bits are computed. After that, s i (0 ≤ i ≤ n) is computed by the relationship
When the ripple-carry approach is used, the key issue for constructing a quantum circuit with a small number of qubits is how to store carry bits. Cuccaro et al.'s circuits, which are based on the approach, use one ancillary qubit to store c 0 = 0 [18] . The carry bit c i is stored in the qubit initially storing a i−1 for 1 ≤ i ≤ n. To do this, they defined the gate for MAJ depicted in Fig. 1 , which is the main component of their circuits. The gate maps |c i |b i |a i to |c i ⊕ a i |b i ⊕ a i |c i+1 . Takahashi et al.'s circuit, which is also based on the ripple-carry approach, uses no ancillary qubits [11] . All the carry bits are stored in the qubit initially storing z. The main component of their circuit is also the MAJ gate. They use the property that the gate maps |z ⊕ b i |z ⊕ a i |z ⊕ c i to
Our Circuit
We store the carry bit c i in the qubit initially storing a i for 0 ≤ i ≤ n − 1 and store the highorder bit c n in the qubit initially storing z. This would be difficult to do if we use the MAJ gate directly. Our idea is to divide the MAJ gate into two parts. The first part consists of two CNOT gates and the second one consists of one Toffoli gate. It is easy to verify that a Toffoli gate maps
where we consider a n as z. Thus, using CNOT gates (the first parts of the MAJ gate) and a Toffoli gate, we first prepare the state
By applying Toffoli gates (the second parts of the MAJ gate), we can compute c i and store it in the qubit initially storing a i . The final Toffoli gate computes c n and stores it in the qubit initially storing z. The detailed construction is described below.
Let A i and B i denote the memory locations initially storing a i and b i , respectively, for 0 ≤ i ≤ n − 1. Let A n be the memory location initially storing z. Location A i (0 ≤ i ≤ n − 1) will store a i , B i (0 ≤ i ≤ n − 1) will store s i , and A n will store z ⊕ s n at the end of the computation. Our circuit is constructed in the following six steps.
1. For i = 1, . . . , n − 1:
Apply a CNOT gate to a pair of memory locations B i and A i where A i is used for the control qubit.
For
Apply a CNOT gate to a pair of memory locations A i and A i+1 where A i is used for the control qubit. Apply a CNOT gate to a pair of memory locations A i and A i+1 where A i is used for the control qubit.
6. For i = 0, . . . , n − 1:
The circuit for ADD 5 is depicted in Fig. 2 . We describe the changes of the input state of ADD n to show that the circuit works correctly. In Step 1, the input state is transformed into
In
Step 2, the state is transformed into
The first Toffoli gate in Step 3 transforms the state into
This is repeated by using a Toffoli gate. The state after Step 3 is
Step 4, the state is transformed into [11] 0 10n − 9 4n − 5 8n − 7 -Our Circuit 0 7n − 6 2n − 1 5n − 3 √
Step 5, the state is transformed into
, the final step gives us the desired output state.
Complexity Analysis
From the construction, it is obvious that our circuit uses no ancillary qubits. We compute the depth and size of the circuit for n ≥ 3 precisely. In
Step 1, the number of CNOT gates is n − 1 and these gates can be performed simultaneously. Thus, the depth and size of Step 1 are 1 and n − 1, respectively. In Step 2, the number of CNOT gates is n − 1 and thus the depth and size of Step 2 are n − 1. In Step 3, the number of Toffoli gates is n and thus the depth and size of Step 3 are n.
In Step 4, the number of CNOT gates is n − 1 and the number of Toffoli gates is n − 1. Thus, the depth and size of Step 4 are 2n − 2. In Step 5, the number of CNOT gates is n − 2 and thus the depth and size of Step 5 are n − 2. In Step 6, the number of CNOT gates is n and these gates can be performed simultaneously. Thus, the depth and size of Step 6 are 1 and n, respectively. Thus, the depth and size of the whole circuit are 5n − 3 and 7n − 6, respectively. The numbers of CNOT and Toffoli gates are 5n − 5 and 2n − 1, respectively. As discussed in [6] , many proposed quantum computer architectures deal with a unidimensional array of qubits with nearest neighbor interactions only. Thus, it is important for a circuit to work on such a linear nearest neighbor (LNN) architecture. When the input and output binary numbers are arranged on an LNN architecture in an interleaved manner (as in Fig. 2 ), our circuit can be used directly on an LNN architecture in the sense that the circuit can be transformed into one on an LNN architecture without increasing the size or depth asymptotically.
A comparison of our circuit and the previous ones with a small number of qubits is summarized in Table 1 . The symbol "
√ " in the LNN column means that the circuit can be used directly on an LNN architecture in the sense described above. The symbol "-" means that we do not know whether this is the case for the circuit. The size of our circuit is less than that of any other quantum circuit ever constructed for ADD n with no ancillary qubits. When we regard the number of qubits as a primary consideration, our circuit is more efficient than the previous circuits in Table 1 .
Though there exists a size-efficient or depth-efficient circuit with one ancillary qubit [18] , it is worth noting that the difference between the total number of ancillary qubits used by parallel applications of our circuit (as in the next section) and that of the previous circuit with one ancillary qubit depends on the number of circuits applied in parallel and may become large. Moreover, since Toffoli gates are on three qubits and thus may be harder to implement than the other gates (on a smaller number of qubits), it is worth noting that the number of Toffoli gates in our circuit is 2n − 1, which is less than or equal to those of the previous circuits in Table 1 (excluding Draper's O(n 2 )-size circuit).
General Method

Combination Method
The ripple-carry approach decreases the number of ancillary qubits but requires large depth. The carry-lookahead approach decreases the depth but requires many qubits [12] . Our method is based on the combination of these methods and is a generalized and simplified version of Takahashi et al.'s method for constructing a logarithmic-depth circuit with a small number of qubits [13] . In this section, we review the previous method. The carry-lookahead approach is described by using two bits p[i, j] (1 ≤ i < j ≤ n) and g[i, j] (0 ≤ i < j ≤ n) [12] . The bit p[i, j] is 1 if a carry bit is propagated from bit position i to bit position j, and g[i, j] is 1 if a carry bit is generated between bit positions i and j. The p[i, j] and g[i, j] are computed by the following relations:
• For any i, j such that 1
• For any i such that 0
• For any i, j such that 0
by successively doubling the sizes of the intervals under consideration. Lastly, it computes
, and s n = g[0, n]. The key circuit is the one for the second step. We call this circuit the CARRY 1 gate. In general, the CARRY l gate is a circuit for the operation
(⌊n/2 t ⌋ − 1) ancillary qubits and its depth and size are O(log n − l) and O( ⌊log n⌋−1 t=l (⌊n/2 t ⌋ − 1)), respectively. Draper et al.'s quantum carry-lookahead adder uses O(n) ancillary qubits and its depth and size are O(log n) and O(n), respectively.
In Takahashi et al.'s combination method, the input binary number a n−1 · · · a 0 is divided into n/k blocks of length k, where we assume that n is a power of two for simplicity and set k = 2 ⌊log log n⌋ and l = ⌊log log n⌋ + 1. Note that k = Θ(log n) and n is divisible by k. That is, we consider a k-bit binary number a(j) = a (j+1)k−1 · · · a jk for 0 ≤ j ≤ n/k − 1. Similarly, we consider b(j) for b n−1 · · · b 0 . Roughly speaking, the previous method is described as follows:
1. Compute the high-order bit of a(j) + b(j), which is g l−1 [j] = g[jk, (j + 1)k], using the ripplecarry approach [11] for 0 ≤ j ≤ n/k − 1. [19] for 0 ≤ j ≤ n/k − 1, where T t (on t + 1 qubits) is defined as
Compute the value
3. Compute the carry bit c jk = g[0, jk] using the values computed in Steps 1 and 2 for 1 ≤ j ≤ n/k. This is done by using the CARRY l gate.
4. Compute the carry bit g[0, i] using the carry bits computed in Step 3 for 1 ≤ i ≤ n and obtain s i for 0 ≤ i ≤ n. This is done by a circuit based on the ripple-carry approach as in Step 1.
The whole circuit uses O(n/k) (= O(n/ log n)) ancillary qubits and its depth and size are O(k) (= O(log n)) and O(n), respectively.
Our Method
Our idea is to divide the input binary numbers into n/d(n) blocks of length d(n) in Takahashi et al.'s method, where d(n) = Ω(log n). By using the CARRY log d(n)+1 gate, we can construct an
ancillary qubits. This is a simple generalization of the previous method. Though this allows us to construct an O(d(n))-depth circuit for any d(n) = Ω(log n) in contrast to the previous method, it, of course, does not improve the previous O(log n)-depth circuit.
To obtain an efficient circuit, we simplify Steps 1, 2, and 4 in the previous method using the circuit for addition in Section 2. The simplification of Step 4 is due to a direct application of the circuit for addition. To simplify Steps 1 and 2, we use only the first halves of our circuit for addition and Barenco et al.'s circuit for T n [19] . The first half of the circuit for addition outputs the high-order bit of a(j) + b(j) and appropriate inputs to Barenco et al.'s circuit. We use only the first half and we can thus save Toffoli gates, but some qubits represent unuseful values. An important point is that Barenco et al.'s circuit can use these qubits as uninitialized ancillary qubits. We use the first half of Barenco et al.'s circuit and we can thus again save Toffoli gates, but some qubits have unuseful values. This is not a problem since these qubits are reset to the initial values in later steps. The details are described below.
To simplify Steps 1 and 2, since we need to compute only the two bits g[i, j] and p[i, j] for some i, j, it suffices to construct an efficient quantum circuit for the operation
where a w−1 · · · a 0 and b w−1 · · · b 0 are the input binary numbers, r 0 = a 0 , and
. Let A i and B i denote the memory locations initially storing a i and b i , respectively. Let G and P be the memory locations initially storing 0. Location A i will store r i , B i will store p[i, i + 1], G will store g[0, w], and P will store p[0, w] at the end of the computation. The circuit is defined as follows:
1. Apply the first half of the circuit (for two w-bit binary numbers) in Section 2 to a tuple of memory locations A i (0 ≤ i ≤ w − 1) and B i (0 ≤ i ≤ w − 1) and G.
2. Apply a CNOT gate to a pair of memory locations A 0 and B 0 , where A 0 is used for the control bit.
3. Apply the first half of Barenco et al.'s circuit for T w to a tuple of memory locations A i (0 ≤ i ≤ w − 1) and B i (0 ≤ i ≤ w − 1) and P , where A i is used as an uninitialized ancillary memory location.
Step 1 writes the value g[0, w] into the memory location G. The memory location A i stores the value r i .
Step 2 writes p[0, 1] into the memory location B 0 .
Step 3 uses the memory location A i as an uninitialized ancillary memory location and writes the value p[0, w] into the memory location P . The whole circuit uses no ancillary qubits and its depth and size are O(w). We call the circuit the INIT w gate. The INIT 5 gate is depicted in Fig. 3 . To simplify Step 4, it suffices to construct an efficient quantum circuit for the operation
where c ∈ {0, 1}, a w−1 · · · a 0 and b w−1 · · · b 0 are the input binary numbers,
, and d j is defined as
We can directly apply the circuit in Section 2 to constructing such a circuit and thus omit the details. The circuit uses no ancillary qubits and its depth and size are O(w). We call the circuit the SUM w gate. The SUM 5 gate is depicted in Fig. 4. 
The Whole Circuit
We construct a quantum circuit for ADD n . For simplicity, we assume that n is a power of two. Let d(n) = Ω(log n). We set k = 2 ⌊log d(n)⌋ and l = ⌊log d(n)⌋ + 1. Note that k = Θ(d(n)) and n is divisible by k. As described in Section 3.1, we consider k-bit binary numbers a(j) and b(j). Let A i and B i denote the memory locations initially storing a i and b i , respectively. Let Z be the memory location initially storing z ∈ {0, 1}. Location A i will store a i , B i will store s i , and Z will store z ⊕ s n at the end of the computation. We assume that there are ancillary memory locations initially storing 0. The first half of our circuit is defined as follows: 3. Apply the gates in Step 1 in reverse order, where we exclude the gates applied to memory locations storing c (j+1)k for 0 ≤ j ≤ n/k − 1 since we do not erase the value.
4. Apply the SUM k gate to memory locations storing a(j + 1) and b(j + 1) and to a memory location storing c k(j+1) to obtain s k(j+1) , . . . , s k(j+2)−1 for 0 ≤ j ≤ n/k −2. Apply a simplified gate of the SUM k gate to memory locations storing a(0) and b(0) to obtain s 0 , . . . , s k−1 .
The last half part deletes unnecessary carry bits using the fact that the carry bits generated for computing a + s ′ is the same as those for computing a + b, where s ′ is the bitwise complement of s [12] .
Apply a NOT gate to
6. Apply the first half of our circuit excluding Step 4 in reverse order, where we exclude the gates applied to memory locations storing a(n/k − 1) and b(n/k − 1) since we do not erase the last carry bit. The gate writes 0 into a memory location storing c k(j+1) for 0 ≤ j ≤ n/k − 1.
7. Apply a NOT gate to B i to write s i into B i for 0 ≤ i ≤ n − k − 1.
The whole circuit for d(n) = log n and n = 8 (and thus k = l = 2) is depicted in Fig. 5 . We compute the number of ancillary qubits, the depth, and the size precisely. For simplicity, we count only Toffoli gates as in [12, 13] .
Step 1 requires k − O(log n) ancillary qubits and its depth and size are 14k + 4 log n k + O(1) and 14n − O(n/k), respectively, where n k ≥ 4. Thus, the circuit uses O(n/d(n)) ancillary qubits and its depth and size are O(d(n)) and O(n), respectively. For example, for d(n) = log n and n ≥ 16, the number of ancillary qubits, the depth, and the size are approximately 3n/ log n, 18 log n, and 14n, respectively. The corresponding previous bounds are 3n/ log n, 30 log n, and 29n. That is, in this case, the number of ancillary qubits in our circuit is the same as that in Takahashi et al.'s [13] and the leading coefficient of the expression of the size in our circuit is less than half that in Takahashi Figure 5 : The circuit for ADD 8 , where d(n) = log n. The first and third dashed-line boxes represent the carry-lookahead part [12, 13] . The second one represents the parallel applications of the SUM 2 gate.
Circuit with Depth o(log n)
Chandra et al.'s Classical Circuit
If we use only one-qubit and two-qubit gates as elementary gates, we cannot construct an o(log n)-depth circuit for ADD n . This is simply shown by using the logarithmic lower bound for the depth of the circuit for F n [15] . To construct an o(log n)-depth circuit, we decrease the depth of the carry-lookahead part of our method in Section 3 by using a quantum version of Chandra et al.'s efficient classical circuit for addition with (classical) unbounded fan-out gates [14] . We assume that we have unbounded fan-out gates (described in Section 2) as elementary gates. We first consider the simple case where we have unbounded fan-out gates with a long length and then reduce the length. Chandra et al.'s method for constructing the circuit is a generalization of the carry-lookahead approach. Besides the (classical) unbounded fan-out gates, the circuit uses unbounded fan-in gates that compute logical AND (or OR) of an unbounded number of input bits. The depth and size of the circuit for two m-bit binary numbers are O(1) and O(m log * * m), respectively, where
It can be shown that log * * m = o(log * m). Though the definition of the depth of a classical circuit is similar to that of a quantum circuit, the definition of the size of a classical circuit in [14] is different from that of a quantum circuit. More precisely, a classical circuit is defined as a directed acyclic graph and the size is the number of edges in the circuit and the depth is the length of a longest path from an input node to an output node. Chandra et al. give a tighter bound on the size of the circuit, but we use the above bound since it is sufficient for showing that our circuits in Sections 4.2 and 4.3 use a sublinear number of ancillary qubits.
Simple Case
Quantum Version of Chandra et al.'s Circuit
We transform Chandra et al.'s classical circuit for two m-bit binary numbers into its quantum version. Since the size (that is, the number of edges) of the circuit is O(m log * * m), it suffices to consider an unbounded fan-out gate with length O(m log * * m) and a T t gate (corresponding to an unbounded fan-in gate with t inputs in the classical circuit) with t = O(m log * * m). We assume that we have unbounded fan-out gates with length O(m log * * m). If we have one-qubit gates, CNOT gates, T t gates, and unbounded fan-out gates with length O(m log * * m), Chandra et al.'s classical circuit can be simply transformed into its quantum version. Note that an OR gate in Chandra et al.'s circuit is transformed into a T t gate with NOT gates. However, in our setting, we have only one-qubit gates, CNOT gates, and unbounded fan-out gates with length O(m log * * m). Thus, we require a quantum circuit for T t (consisting of one-qubit gates, CNOT gates, and unbounded fan-out gates with length O(m log * * m)). We use Høyer et al.'s circuit for the T t operation (defined in Section 3.1) as the T t gate [9] . They showed that, if unbounded fan-out gates with length O(t) are available, an O(log * t)-depth O(t)-size quantum circuit for T t can be constructed. We can show that Høyer et al.'s circuit uses O(t) ancillary qubits. Since we have unbounded fan-out gates with length O(m log * * m), we can directly use Høyer et al.'s circuit for T t with t = O(m log * * m). Thus, we obtain a quantum version of Chandra et al.'s circuit. We call the circuit the GCLA m circuit, which stands for the generalized carry-lookahead approach for two m-bit binary numbers.
The complexity of the GCLA m circuit is analyzed as follows. To compute the depth of the circuit, since the depth of the original circuit is O(1), it suffices to consider a T t 1 gate, where t 1 is the maximum number of inputs of T t gates in the GCLA m circuit. The depth of the T t 1 gate is O(log * t 1 ). Since t 1 = O(m log * * m), the depth of the T t 1 gate is O(log * (m log * * m)) and thus the depth of the GCLA m circuit is O(log * (m log * * m)). To compute the size of the circuit, we define A t as the number of unbounded fan-in gates with t inputs in Chandra et al.'s circuit, which is equal to the number of T t gates in the GCLA m circuit. Since the size of Chandra et al.'s circuit is O(m log * * m), t tA t = O(m log * * m). The size of a T t gate is O(t). The number of the other gates in the GCLA m circuit is O(m log * * m) (and the size of each gate is 1). Thus, the size of the GCLA m circuit is O( t tA t + m log * * m) = O(m log * * m). A similar argument shows that the number of ancillary qubits in the GCLA m circuit is O(m log * * m). That is, the GCLA m circuit uses O(m log * * m) ancillary qubits and its depth and size are O(log * (m log * * m)) and O(m log * * m), respectively.
Modification of Our Method
We modify our method in Section 3.3 by using the GCLA m circuit as the CARRY l gate. Let e(n) = Ω(log * n). We set k and l as in Section 3.3. Note that k = 2 l−1 = Θ(e(n)). We assume that we are allowed to use unbounded fan-out gates with length O(n). Chandra et al.'s circuit for two ⌊n/2 l−1 ⌋-bit binary numbers is directly applied to perform the operation performed by the CARRY l gate. Thus, we set m = ⌊n/2 l−1 ⌋. In this case, O(m log * * m) = O(n log * * (n/2 l−1 )/2 l−1 ), which is bounded by O(n). Since we have unbounded fan-out gates with length O(n), we can use the complexity analysis described in Section 4.2.1. The GCLA m circuit, which is the CARRY l gate, uses O(n log * * (n/2 l−1 )/2 l−1 ) ancillary qubits and its depth and size are O(log * (n log * * (n/2 l−1 )/2 l−1 )) and O(n log * * (n/2 l−1 )/2 l−1 ), respectively. For simplicity, we consider slightly weaker bounds; it uses O(n log * * n/2 l−1 ) ancillary qubits and its depth and size are O(log * (n log * * n/2 l−1 )) and O(n log * * n/2 l−1 ), respectively. The complexity of the whole circuit obtained by the modified method is analyzed as in the original method.
Step 1 uses O(n/k) ancillary qubits and its depth and size are O(k) and O(n), respec-tively.
Step 2 uses O(n log * * n/k) ancillary qubits and its depth and size are O(log * (n log * * n/k)) and O(n log * * n/k), respectively.
Step 4 requires no new ancillary qubits and its depth and size are O(k) and O(n), respectively. The other steps are similar to the above steps. Thus, the whole circuit uses O(n log * * n/e(n)) (= o(n)) ancillary qubits and its depth and size are O(e(n)) and O(n), respectively. In particular, for e(n) = log * n, the modified method yields an O(log * n)-depth O(n)-size circuit with O(n log * * n/ log * n) (= o(n)) ancillary qubits.
Reduction of the Length of an Unbounded Fan-Out Gate
We prove that the length of an unbounded fan-out gate can be restricted to O(n ε ) in the modified method without increasing the complexity of the circuit, where ε is any small positive constant. Suppose that we are allowed to use unbounded fan-out gates with length f (n). An unbounded fan-out gate with length t = O(m log * * m) (and m = ⌊n/2 l−1 ⌋) can be simply simulated by using an O(log t/ log f (n) + 1)-depth O(t/f (n) + 1)-size circuit with no ancillary qubits that consists only of unbounded fan-out gates with length f (n). In the following, using this simulation, we reconsider the complexity of the T t gate, the GCLA m circuit, and the circuit our method in Section 4.2 yields.
T t gate
The T t gate, which is Høyer et al.'s circuit for the T t operation, is constructed as follows:
1. Construct an O(1)-depth O(t log t)-size circuit with O(t log t) ancillary qubits for reducing the computation of OR of t bits to that of O(log t) bits.
Using the circuit in Step 1, for any
ancillary qubits, where log (d) t is the d-times iterated logarithm log · · · log t.
Using the circuit in
Step 2, construct an O(log * t)-depth O(t)-size circuit for T t with O(t) ancillary qubits.
We can modify the above steps using unbounded fan-out gates with length f (n) as follows:
1. Construct an O(log t/ log f (n) + 1)-depth O(t log t)-size circuit with O(t log t) ancillary qubits for reducing the computation of OR of t bits to that of O(log t) bits.
Using the circuit in Step 1, for any
3. Using the circuit in Step 2, construct an O(log t/ log f (n) + log * t)-depth O(t)-size circuit for T t with O(t) ancillary qubits.
To see this, we first analyze Step 1 in Høyer et al.'s construction. In this step, an unbounded fanout gate with length O(log t) is used in parallel to make O(log t) copies of each of the t input bits. Moreover, an unbounded fan-out gate with length O(t) is used in parallel to prepare appropriate ancillary qubits O(log t) times. As described above, an unbounded fan-out gate with length O(log t) can be simulated by using an O(log log t/ log f (n) + 1)-depth O(log t/f (n) + 1)-size circuit with no ancillary qubits. Similarly, an unbounded fan-out gate with length O(t) can be simulated by using an O(log t/ log f (n) + 1)-depth O(t/f (n) + 1)-size circuit. Thus, the depth of the T t gate is O(log t/ log f (n) + 1). The size is O(t · (log t/f (n) + 1) + (log t) · (t/f (n) + 1)) = O(t log t). These simulations do not require any ancillary qubits. That is, in Step 1, the number of ancillary qubits and size remain unchanged even if we consider unbounded fan-out gates with length f (n). Thus, they also do so in Steps 2 and 3.
Step 2 of Høyer et al.'s construction is done by using Step 1 O(log * t) times to reduce the computation of OR of t bits to that of a constant number of bits.
Step 3 is done by reducing the computation of OR of t bits to that of t/ log * t bits and by using Step 2 with d = log * t. These procedures can be simply applied to the case where we use unbounded fan-out gates with length f (n) and imply the desired depth bound.
The GCLA m circuit
To compute the depth of the GCLA m circuit, it suffices to consider a T t 1 gate for some t 1 and an unbounded fan-out gate with some length t 2 . The depth of the T t 1 gate is O(log t 1 / log f (n)+log * t 1 ) and the depth of an unbounded fan-out gate with length t 2 is O(log t 2 / log f (n) + 1). Since t 1 and t 2 cannot be greater than the size of Chandra et al.'s circuit, the depth of the GCLA m circuit is O(log m/ log f (n)+log * (m log * * m)). To compute the size, we define B t as the number of unbounded fan-out gates with length t used (implicitly) in Chandra et al.'s original circuit, which is equal to the number of unbounded fan-out gates with length t (that are not used in T s gates for any s) in the GCLA m circuit. Since the size of Chandra et al.'s circuit is O(m log * * m), t tB t = O(m log * * m). If t ≥ f (n), an unbounded fan-out gate with length t can be simulated by an O(t/f (n))-size circuit. Thus, the size related to unbounded fan-out gates with length greater than or equal to f (n) in the GCLA m circuit (that is, t≥f (n) (t/f (n))B t ) is O(m log * * m) since t tB t = O(m log * * m). The size related to the T t gates (that is, O( t tA t )) is O(m log * * m). The number of the other gates is O(m log * * m) (and the size of each gate is 1). Thus, the size of the GCLA m circuit is O(m log * * m). The number of ancillary qubits is the same as the size. That is, the GCLA m circuit uses O(m log * * m) ancillary qubits and its depth and size are O(log m/ log f (n) + log * (m log * * m)) and O(m log * * m), respectively. Since m = ⌊n/2 l−1 ⌋, the circuit uses O(n log * * (n/2 l−1 )/2 l−1 ) ancillary qubits and its depth and size are O(log(n/2 l−1 )/ log f (n)+log * (n log * * (n/2 l−1 )/2 l−1 )) and O(n log * * (n/2 l−1 )/2 l−1 ), respectively. For simplicity, we consider slightly weaker bounds; it uses O(n log * * n/2 l−1 ) ancillary qubits and its depth and size are O(log n/ log f (n)+log * (n log * * n/2 l−1 )) and O(n log * * n/2 l−1 ), respectively.
Our Circuit
We set f (n) = n ε and use the GCLA m circuit as the CARRY l gate, where ε is any small positive constant. In this case, the CARRY l gate uses O(n log * * n/2 l−1 ) ancillary qubits and its depth and size are O(log * (n log * * n/2 l−1 )) and O(n log * * n/2 l−1 ), respectively. This is the same situation as that in Section 4.2 except that the length of an unbounded fan-out gate in the CARRY l gate is at most n ε . Thus, the whole circuit uses O(n log * * n/e(n)) (= o(n)) ancillary qubits and its depth and size are O(e(n)) and O(n), respectively. If we set e(n) = log * n, we obtain an O(log * n)-depth O(n)-size circuit with o(n) ancillary qubits.
It is worth noting that the above method for constructing a circuit for ADD n yields an o(log n)-depth O(n)-size circuit with o(n) ancillary qubits using unbounded fan-out gates with a small length. For example, we set f (n) = log n and d(n) = log n/ log log n. In this case, the CARRY l gate uses O(n log * * n log log n/ log n) ancillary qubits and its depth and size are O(log n/ log log n) and O(n log * * n log log n/ log n), respectively. This yields an O(log n/ log log n)-depth O(n)-size circuit with O(n log * * n log log n/ log n) ancillary qubits. Such an o(log n)-depth circuit cannot be constructed by using a quantum circuit only with gates on a bounded number of qubits [15] or by using a classical circuit only with bounded fan-in and unbounded fan-out gates [16] . Hence, unbounded fan-out gates even with a small length are useful for constructing efficient quantum circuits for addition.
Application
We consider the prime field GF(p) for some prime p > 3. An elliptic curve E over GF(p) is the set of points (x, y) ∈ GF(p) × GF(p) satisfying y 2 = x 3 + ax + b, where the constants a, b ∈ GF(p) and 4a 3 + 27b 2 = 0, together with the point at infinity O. It is known that the addition operation in E can be defined and that E with the addition operation forms an abelian group with O serving as its identity [20] . Let P ∈ E, P be the subgroup of E generated by P , and | P | be the order of P . The discrete logarithm problem over the elliptic curve E with respect to the base P is defined as follows: Given a point Q ∈ P , find the integer 0 ≤ d ≤ | P | − 1 such that Q = dP . Shor's discrete logarithm algorithm solves the problem in time polynomial in the length of the binary representation for | P | with high probability [1] . As in [5] , we assume that the length of the binary representation for | P | is equal to that of the binary representation for p.
Proos et al. constructed an efficient quantum circuit for Shor's discrete logarithm algorithm for elliptic curves over GF(p) [5] . Let n be the length of the binary representation for p. The depth and size of the circuit are O(n 3 ). The dominant cost is O(n 2 ) applications of an O(n)-depth O(n)-size quantum circuit for ADD n with n ancillary qubits. For counting the number of qubits in the circuit, it suffices to count the number of qubits in the circuit for division in GF(p) that maps |x |y to |x |y/x for x ( = 0), y ∈ GF(p). The circuit for division in GF(p) uses about 5n qubits: 2n qubits are used for the input register and about 3n qubits are used in the circuit for the extended Euclidean algorithm. In the circuit for the extended Euclidean algorithm, about 2n qubits are used for the input binary numbers and intermediate results, and n qubits are used for ancillary qubits during ADD n .
By simply replacing Proos et al.'s circuit for ADD n with our circuit in Section 2, we can eliminate the n ancillary qubits during ADD n since our circuit for ADD n does not use any ancillary qubits. The resulting circuit uses about 4n qubits. Since Proos et al. do not describe the precise depth or size of their circuit for ADD n , we cannot compare the depth or size of the resulting circuit with that of the original one precisely. However, the depth and size of our circuit for ADD n are asymptotically the same as those of Proos et al.'s. Thus, the depth and size of the resulting circuit are asymptotically the same as those of the original circuit.
By adding o(n) ancillary qubits to the circuit obtained above, we can decrease the depth asymptotically. As shown in Section 3, for any d(n) = Ω(log n), we have an O(d(n))-depth O(n)-size circuit for ADD n with O(n/d(n)) ancillary qubits. If we use this circuit as above, we obtain O(n 2 d(n))-depth O(n 3 )-size circuit for Shor's discrete logarithm algorithm with 4n + O(n/d(n)) qubits. Moreover, as shown in Section 4, if we are allowed to use unbounded fan-out gates with length O(n ε ) for an arbitrary small positive constant ε, we have an O(e(n))-depth O(n)-size circuit for ADD n with o(n) ancillary qubits for any e(n) = Ω(log * n). This circuit yields an O(n 2 e(n))-depth O(n 3 )-size circuit for Shor's discrete logarithm algorithm with 4n + o(n) qubits. We can also use the previous circuits for ADD n to improve Proos et al.'s circuit. However, they do not yield more efficient quantum circuits for Shor's discrete logarithm algorithm than our circuits described above. This is simply because our circuits for ADD n is more efficient than the previous ones.
Conclusions and Future Work
We constructed an O(n)-depth O(n)-size quantum circuit for ADD n with no ancillary qubits. The size is less than that of any other quantum circuit ever constructed for ADD n with no ancillary qubits. Using the circuit, we proposed a method for constructing a small-size quantum circuit for ADD n with a small number of qubits that has a given depth. In particular, we showed that, if we are allowed to use unbounded fan-out gates with length O(n ε ) for an arbitrary small positive constant ε, we can construct an O(log * n)-depth O(n)-size circuit with o(n) ancillary qubits. We applied our circuits to constructing efficient quantum circuits for Shor's discrete logarithm algorithm. Interesting challenges would be to find ways of improving the quantum circuits described in this paper. For example, can we construct an O(log n)-depth O(n)-size quantum circuit for ADD n with O(1) ancillary qubits? Can we construct an O(1)-depth O(n)-size quantum circuit for ADD n with O(n) ancillary qubits using unbounded fan-out gates? In the classical case, we cannot construct an O(1)-depth O(n)-size (that is, the number of edges) circuit for addition with unbounded fan-in and fan-out gates [21] .
