We evaluate the exact number of gates for circuits of Shor's factoring algorithm. We estimate the running time for factoring a large composite such as 576 and 1024 bit numbers by appropriately setting gate operation time. For example, we show that on the condition that each elementary gate is operated within 50 µsec, the running time for factoring 576 bit number is 1 month even if the most effective circuit is adopted. Consequently, we find that if we adopt the long gate operation-time devices or qubit-saving circuits, factorization will not be completed within feasible time on the condition that a new efficient modular exponentiation algorithm will not be proposed. Furthermore, we point out that long gate operation time may become a new problem preventing a realization of quantum computers. key words: Shor's factoring algorithm, quantum circuit, practical running time
Introduction
The security of the RSA cryptosystems is based on the difficulty of factoring a large composite integer [1] . The present world record of the factorized largest composite is RSA-576, which is one of the RSA Challenge numbers [2] . Its bit length is 576. The running time for factoring of this number has not been reported yet, but is estimated to be about 1 month. Therefore, 1024 bit numbers are recommended in using RSA cryptosystems.
P. Shor proposed a quantum algorithm which factorize a composite number in polynomial time [3] . This algorithm is composed of steps for modular exponentiation and the inverse of quantum Fourier transformation. It is known that the former is more difficult than the latter.
Chuang et al. [4] implemented the Shor's factoring algorithm by using NMR with 7 qubits to factorize 15. This is the world record number factorized by quantum computers.
In the case of classical computers, if a problem is theoretically solvable in polynomial time, we can decide that the problem can be solved within feasible time. However, for quantum computers, the above statement is not necessarily correct because of its long gate operation time, which is the time for operating each elementary gate. Hence, factoring a large composite may not be completed within feasible time. In order to estimate the actual time for factoring by using Shor's algorithm, we need to evaluate the number of elementary gates, such as Toffoli, C-NOT or single qubit op- a) E-mail: kunihiro@ice.uec.ac.jp eration gates. Although the number of qubits and the order of the number of gates have so far been studied, we need to evaluate the exact number of gates for factoring circuits in order to attain the above purpose. DiVincenzo [5] described the following five requirements for the implementation of quantum computation.
1. A scalable physical system with well characterized qubits. 2. The ability to initialize the state of qubits to a simple fiducial state, such as |000 . . . . 3. Long relevant decoherence times, much longer than the gate operation time. 4. A "universal" set of quantum gates. 5. A qubit-specific measurement capability.
His paper does not show how many qubits we need, how long the decoherent times are needed, and how quick the gate should be operated. We'll focus on Requirement 3 in this paper. We can estimate the maximal number of operations by dividing the decoherent time by the gate operation time. This value is different from various candidates of physical realization. Nielsen and Chuang [6] describe that this value is about 10 3 − 10 13 . The exact number of operations we need in factoring a large composite has not been clear although their rough estimations have been studied [7] - [9] .
In this paper, we evaluate the exact number of gates (not only its order) for three previously proposed circuits of modular exponentiation. First, we show that the number of gates is much larger than the maximal number of operation despite of a few candidates. Next, we estimate the exact running time needed for factoring a large composite. Since the gate operation time is different from various devices, the actual time for factoring changes according to selection of devices. Hence, by setting the gate operation time appropriately, we estimate the running time. We show that if we adopt some device with long gate operation time or if we adopt a circuit which reduces the number of qubits, factoring a large composite may not be completed within feasible time. Concretely speaking, if we adopt devices whose gate operation times are longer than 50 µsec in the case of basic circuit (later, refered as R-ADD) or 7.8 µsec in the case of qubit saving circuit (later, refered as Q-ADD), the actual time for factoring 576 bit composite is longer than the running time of classical computers. This result is not pessimistic. Our results lead to the conclusion that the short gate operation time may become a new requirement for im- We point out some remarks. The computational lower bound is not known for the modular exponentiation. Hence, the optimal algorithm is not known. If a new more efficient algorithm than our analyzed algorithms is proposed, the situation may drastically changes.
Circuits for Modular Exponentiation

Elementary Gates
In this paper, we consider the followings as elementary gates.
1. single qubit operation ,including NOT gate 2. Controlled NOT (C-NOT) 3. Toffoli gate
The operation of NOT gate is defined by its truth table, in which |0 → |1 and |1 → |0 . The operation of C-NOT gate is described as follows.
Another way of describing C-NOT is
where ⊕ is addition modulo 2. The first qubit is called as a controlled qubit and the second qubit is called as a target qubit. If the controlled qubit is 1, the second qubit will be flipped. It is well known that all quantum operation can be decomposed into single qubit operations and C-NOT gates. In this paper, we also consider the Toffoli gate as an elementary gate. The operation of the Toffoli gate is described as follows.
If both c 1 and c 2 are equal to 1, the third qubit will be flipped.
Modular Exponentiation
Let N be a large composite number to be factored and n be a bit length of N. Shor's factoring algorithm is composed of five steps: (1) construction of a flat superposition, (2) modular exponentiation, (3) an inverse of quantum Fourier transformation, (4) measurement of quantum states, and (5) classical computation. The aim of the modular exponentiation is to construct a state
from the initial state 1/ √ 2 m 2 m −1 x=0 |x |1 , where a is a randomly chosen integer less than N and m = 2n. The aim of the inverse of quantum Fourier transformation is to obtain a period of the function: a x mod N from Eq.(1). It is known that the modular exponentiation is the most difficult of the all steps.
We describe how to construct quantum circuit for the modular exponentiation [7] , [8] .
• The modular exponentiation: Mod-EXP is composed of m controlled modular multiplications.
• The modular multiplication: Mod-MUL is composed of two modular product-sum operations and one S WAP operation.
• The modular product-sum operation: Mod-PS is composed of n controlled modular additions.
• The modular addition: Mod-ADD is composed of some additions: ADD.
The operations: Mod-EXP, Mod-MUL, Mod-PS , Mod-ADD, ADD are formulated as follows.
Construction of Mod-EXP from Mod-ADD
First, we construct Mod-EXP operation from Mod-MUL.
We can rewrite the state
to the above initial state. A sequence of the controlled multiplications constitutes Mod-EXP. Note that an operation:C(x)-U means U operation controlled on x. The following equation justifies the above sequence of operations.
The operation Mod-MUL(d) can be described as follows by appropriately supplementing ancilla qubits. 
2.4 Construction of Mod-ADD from ADD Two controlled qubits x i and y j are needed in the Mod-ADD operation since we use each one controlled qubit in Mod-EXP and Mod-PS , respectively. We describe how to construct C(x i , y j )-Mod-ADD(d) from ADD. We have two strategies for constructing the above operation. These are called Type 1 and Type 2. In ADD operation, we use three registers: R 1 , R 2 and R 3 . These consist of 1 qubit, n qubits, and 1 qubit, respectively. The ADD operator is applied to the register connected R 1 and R 2 .
Type1
Step1
The Type1 ADD consists of one C 3 -ADD, three C 2 -ADD, two C 3 -NOT and four C 2 -NOT . The Type2 ADD consists of three C 2 -ADD, one C-ADD, one ADD, one C 2 -NOT , two C-NOT and three NOT . Note that C(x i , y j )-is unnecessary for operators in Steps 1-2, 2, 3, 6 and 8. Which type is effective depends on how to construct ADD.
Construction of ADD Circuits
Three constructions have been known for ADD. These are regular ADD (R-ADD), ADD using generalized Toffoli (GT-ADD) and Quantum Addition (Q-ADD). We describe how to construct these ADDs and evaluate the exact number of gates.
Our Premises
In this paper, we analyze the computational time for factoring on the following two premises.
1. We don't consider any parallel computation. 2. The operation times of all kinds of elementary gates, such as Toffoli, C-NOT, NOT, single qubit operation gates, are assumed to be equal.
First, we shall discuss the first premise. It is difficult to execute quantum operation by sequential even if we use the state of the art technologies. Hence, it seems reasonable to suppose that the parallel quantum computation will not be realized in the near future. So, we don't consider any parallel computation in our analysis.
Next, we shall discuss the second premise. The operation time is different from various gates or various physical devices. Furthermore, if the different devices are used, the different kinds of elementary gates had better be chosen. However, now, we have no consensus about the operation time of any kind of gates. For example, nobody knows which kind of gate is the fastest and how fast the fastest gate is. Hence, in this paper, we premise that the operation times for all kinds of gates are equal.
If the research of physical devices progress, the above premise should be drastically changed. If so, the computation time of the factoring should be reevaluated once again.
Notation
In the Sections 3.3 and 3.4, we use the following vector representation as the number of gates.
The last i + 1-st element (i.e., b i ) means the number of C i -NOT gates for i ≥ 1. The last element (i.e., b 0 ) means the number of NOT gates.
For example, suppose that the following vector is given.
The first element, i.e., b 4 , means the number of C 4 -NOT gates, the second, i.e., b 3 , means the number of C 3 -NOT gates, and so on. The last element means the number of NOT gates.
Regular ADD (R-ADD)
The regular ADD [7] , [8] is based on classical addition circuits. This circuit consists three elementary circuits: CARRY, SUM, and the inverse of CARRY. By using these elements, the regular ADD circuit is composed. This circuit needs n − 1 clean ancilla qubits as CARRY bit. The average number of gates for R-ADD is given as follows.
The first element is the number of C 2 -NOT gates or the Toffoli gates, and the second and the last elements are the number of C-NOT and NOT gates, respectively. The number of gates for operating one C 2 -Mod-ADD is given as follows. In the case of Type1,
where the first element is the number of C 5 -NOT gates. In the case of Type2,
where the first element is the number of C 4 -NOT gates. Since Mod-EXP consists of 2nm C 2 -Mod-ADD and m C-S WAP, the total number of gates for Mod-EXP is given as follows.
Type1 m(4n
Next, we decompose C 5 -, C 4 -, C 3 -NOT into Toffoli gates. Then, we estimate the number of elementary gates for constructing Mod-EXP. The followings are known about the decomposition of C k -NOT into Toffoli gates [10] .
• If there are k−2 clean ancilla qubits, we can decompose C k -NOT into 2k − 3 Toffoli gates for k ≥ 2.
• If there are k − 2 unclean ancilla qubits, we can decompose C k -NOT into 4k − 8 Toffoli gates k > 2.
If we use R-ADD, we can apply the first condition in decomposing almost every C k -NOT , where k = 3, 4, 5 since we can use unused CARRY bit as clean ancilla qubits. Then, we can decompose C k -NOT , where k = 3, 4, 5 into 3, 5 and 7 Toffoli gates, respectively. Hence, the number of gates for constructing Mod-EXP are given as follows.
Type1: m(162n
Hence, the total number of elementary gates are given as follows.
We conclude that Type2 is more effective if R-ADD is selected. It is known that the number of qubits are given by m + 3n + 1.
ADD Using Generalized Toffoli (GT-ADD)
First, we describe the circuit for adding 2 i into |b n−1 b n−2 . . . b 1 b 0 . The above operation is realized by
If we want to add a = a n−1 . . . a 0 into |b n−1 . . . b 0 , we run the above sequence for i such that a i = 1. This circuit needs no ancilla qubits [8] . The average number of gates for GT-ADD is given as follows.
(1/2, 1, 3/2, . . . , n, n).
The i-th element is the number of C n+1−i -NOT gates and the last one is the number of NOT gates. Note that the above average is obtained by assuming the probability that a i = 1 is 1/2 for each i.
First, we show the total number of gates for Mod-EXP in the case of Type1 construction.
•
We decompose C k -NOT into Toffoli gates. The all C k -NOT gates, except for C n+3 -NOT , can be decomposed into 4(k − 2) Toffoli gates since we can use more than k − 2 qubits as unclean ancilla qubits. By adding one more ancilla qubit, we can decompose C n+3 -NOT . Hence, the total number of Toffoli gates is given as follows.
The number of qubits are given by m + 2n + 3. Next, we show the total number of gates for Mod-EXP in the case of Type2 construction.
• #C n+2 -NOT = 3mn.
We decompose C k -NOT into Toffoli gates. The all C k -NOT gates can be decomposed into 4(k − 2) Toffoli gates. Hence, the total number of Toffoli gates is given as follows.
The number of qubits are given by m + 2n + 2. The total number of gates are given as follows.
Type1: m(
We can conclude that Type1 is more effective if GT-ADD is selected.
Quantum ADD (Q-ADD)
By applying quantum ADD (q-ADD(a)) [9] , [11] to the input state |φ(y) , we obtain |φ(y + a) , where φ(y) is the quantum Fourier transformation (QFT ) of y. Hence, ADD can be realized as follows. First, we apply QFT to |y to obtain |φ(y) . Second, we apply q-ADD(a) to obtain |φ(y + a) . Finally, we apply QFT −1 to obtain |y + a . This circuit also needs no ancilla qubits.
Next, we evaluate the number of gates. Let a rotation gate:R k = (1, 0; 0, exp(2πi/2 k )). QFT is composed of n + 1 Hadamard gate: H and n + 2 − i controlled rotation gates: C-R i , where 2 ≤ i ≤ n + 1. The q-ADD operation is composed of (n + 2 − i)/2 times R i gates on average. First, we show the total number of gates for Mod-EXP in the case of Type 2.
The C 2 -R i gate can be decomposed into six C-NOT and eight single qubit operation [10] . And C-R i gate can be decomposed into two C-NOT and four single qubit operation. After the decomposion, the number of C-NOT gates and single qubit operators are given as follows.
• #C-NOT = m(18n
Hence, the total number of operations is given as
Next, we show the case of Type 1.
The C 3 -R i gate can be decomposed into 14 C-NOT and 16 single qubit operation [10] . After the decomposion, the number of C-NOT gates and single qubit operators are given as follows.
• #C-NOT = m(72n
We can conclude that Type2 is more effective if Q-ADD is selected.
It is known that if i is large enough, R i can be approximated as identity [11] . In this case, the total number of gates can be reduced to O(mn 2 log n). The number of qubits are given by m + 2n + 2 in both cases.
FFT-Based Multiplication
By applying "FFT based multiplication," the computation time for modular exponentiation can be reduced. Schönhage-Strassen multiplication [12] is such a good example. If we use Schönhage-Strassen multiplication, the asymptotically computation amount for modular exponentiation is given as O(n 2 log n log log n). Hence, this algorithm can be more efficient than our analyzed three algorithms. However, these algorithms have the following fatal demerits.
• The Schönhage-Strassen multiplication is the known fastest algorithm for very large integers. However, this is no more efficient for not so large integers. In this context, the 576 bit and 1024 bit integers are small. Hence, for these integers, Schönhage-Strassen multiplication or its variants should not be used. The above discussion is true for classical computation. And we believe that this is true for quantum computation.
• Schönhage-Strassen multiplication needs much more qubits than our analyzed algorithms.
Although we have not analyzed the above two problems in detail, we believe that above discussions are reliable. Hence, we decide that FFT-based multiplication is not efficient for factoring of our target integers, 576 bit integer. The detailed analyses are our future works. Table 1 summarizes the number of qubits and the dominant term of the number of gates. We set m = 1 and m = 2n in the evaluation of the number of qubits and gates, respectively. By applying one controlling-qubit trick [9] , [13] , we can set m = 1. Note that we only show the size of circuit, not the depth of circuit since we don't consider any parallel computations. Table 2 shows the number of gates for factoring 576 and 1024 bit numbers.
Evaluation of the Number of Gates
Nielsen and Chuang [6] describe that the maximal number of operation is about 10 3 − 10 13 . We summarize Table 3 . We can verify that the necessary number of gates is larger than the maximal number of operation in the most candidates of devices. Note that the above operation time is theoretical or ideal value. So, this value will not be achievable in real circumstances, at least in the near future. Next, we estimate the running time for factoring 576 and 1024 bit numbers by setting the four various gate operation time, which is a time for operating elementary gate (Toffoli, C-NOT, single qubit operation gates). Tables 4 and  5 show the running time for factoring 576 and 1024 bit composite number. We assume that all operations are executed in serial. Note that the operation times for all kinds of elementary gates are assumed to be equal.
The 576 bit number can be factorized in 1 month even if classical computers are used. Consider the condition for quantum computers to exceed classical computers. If we adopt regular ADD, the elementary gates should be operated within 50 µsec. If we adopt quantum addition, the elementary gates should be operated within 7.8 µsec. We can summarize that in order to exceed classical computers, either a quantum computer with 1730-qubits and 50 µsec gate operation time or a quantum computer with 1155-qubits and 7.8 µsec gate operation time must be realized. In other words, even if we can realize a quantum computer with infinite decoherent time or very long decoherent time, if its gate operation time is slow, this quantum computer would not be helpful to factorize a large composite. This is not a pessimistic result. If the gate operation time is enough short, we need not bother this problem.
Conclusions and Future Works
We evaluated the exact number of gates for circuits of Shor's factoring algorithm. We estimated the running time for factoring a large composite such as 576 and 1024 bit numbers by appropriately setting gate operation time. Consequently, we showed that if we adopt the slow gate operation devices or qubit-saving circuits, factorization will not be completed within feasible time.
If a new efficient algorithm for modular exponentiation is proposed, the situations may be changed drastically. If so, we need to reevaluate the computational time for factoring.
