Reversible circuits for modular multiplication Cx%M with x < M arise as components of modular exponentiation in Shor's quantum number-factoring algorithm. However, existing generic constructions focus on asymptotic gate count and circuit depth rather than actual values, producing fairly large circuits not optimized for specific C and M values. In this work, we develop such optimizations in a bottom-up fashion, starting with most convenient C values. When zero-initialized ancilla registers are available, we reduce the search for compact circuits to a shortest-path problem. Some of our modular-multiplication circuits are asymptotically smaller than previous constructions, but worstcase bounds and average sizes remain Θ(n 2 ). In the context of modular exponentiation, we offer several constant-factor improvements, as well as an improvement by a constant additive term that is significant for few-qubit circuits arising in ongoing laboratory experiments with Shor's algorithm.
Introduction
The pursuit of quantum computation [15] has generated both excitement and controversy, while producing few compelling empirical demonstrations so far. Adiabatic computing experiments by DWave Systems were sharply criticized for not demonstrating quantum entanglement and not solving hard problem instances that would confound best known problem-specific algorithms on non-quantum computers. Several academic groups implemented Shor's number-factoring algorithm on several qubits to factor the number 15, recalling that asymptotic worst-case complexity of Shor's algorithm is polynomial while best known number-factoring algorithms for non-quantum computers take more than polynomial time to run, both in theory and in practice. Experiments with photonic quantum gates [12, 11, 16] suggest the presence of entanglement, 1 but leave unclear how entanglement is going to scale in larger systems. Recent ion traps decrease per-gate error rates below the threshold estimate for fault-tolerant quantum computing [7] , making sophisticated quantum algorithms more practical if appropriate quantum error-correction is used. Shor's algorithm remains the best candidate for benchmarking quantum algorithms because (i) it solves a practical problem for which optimized non-quantum software is also available, (ii) it has been thoroughly studied, and (iii) it can be implemented with several known circuits.
Reducing the size of quantum circuits required by Shor's algorithm [5, 20] -the focus of our workdecreases resource requirements for future quantum computers in a non-linear way because larger circuits entail heavier overhead for quantum error-correction [21] . In comparisons to non-quantum numberfactoring software, smaller circuits can make quantum computers more competitive. However, quantum simulators [23] can also run faster on smaller circuits. The significance of simulation in benchmarking quantum algorithms is twofold. First, simulation can help studying intermediate states generated by a quantum algorithm and estimate the amount of quantum entanglement in these states. Second, simulators can be viewed as competing non-quantum algorithms. While this aspect of simulation is often dismissed a priori, an instructive example is given by the Quantum Fourier Transform (QFT). It was recently discovered that QFT can be efficiently simulated when used stand-alone [1, 24] (but not as part of Shor's algorithm) and thus does not offer a quantum speed-up, despite generating a significant amount of entanglement. This unexpected result was obtained independently by several researchers [1, 24] by optimizing approximate QFT circuits for a specific simulation technique [13] . 
Shor's algorithm
Shor's algorithm seeks to factor a given value M > 0, which we assume to be semiprime M = pq with unknown factors. The strategy is to consider the functions f b (x) = x b %M 2 , potentially with several different 1 < b < M values and determine their periods in case gcd(b, M ) = 1. When the period is determined to be even b 2π %M = 1, we have (b π − 1)(b π + 1)%M = 0, thus either (b π − 1) or (b π + 1) must share at least one prime factor with M . If b π %M = −1, such a factor can be found using gcd(b π ± 1, M ), otherwise it leads to the trivial factors 1 and M . When the period is determined to be odd, another b value is tried.
The period-finding procedure relies on a quantum circuit (Figure 1 ), instantiated for a given value 1 < b < M coprime with M . The circuit operates on two 0-initialized quantum registers [15] with
• a block of parallel Hadamard gates on Register 1,
• a circuit for modular exponentiation (mod-exp) evaluates f (y) = b y %M by mapping |y |0 → |y |f (y) , where y is read from Register 1 and f (y) is written to Register 2; Register 1 can be temporarily modified, but must be restored at the end,
• a circuit for the Quantum Fourier Transform (QFT) on Register 1,
• a block of parallel measurements on Register 1.
The first and last blocks cannot be optimized any further. QFT circuits are understood fairly well and are much smaller than circuits for modular exponentiation [15] . Therefore, our focus is on mod-exp circuits. They typically consist of reversible gates -NOT (N ), CNOT (C) and Toffoli (T ) -which can be modeled and optimized entirely in terms of Boolean logic [17] . However, in physical implementations, Toffoli gates must be decomposed into smaller gates directly implementable in a given technology [18] . Reversible circuits for modular exponentiation start with an inverter on Register 2 that changes the |000 · · · 0 value to |000 · · · 1 , and otherwise exhibit the following structure: each (i-th) bit of Register 1 enables (controls) a circuit block that multiplies Register 2 by C i = b 2 i %M and reduces the result %M . When b and M are known, C i can be pre-computed without quantum computation. Therefore, we refer to C i x%M -blocks below. They are typically implemented using shift and addition circuits, and a number of relevant quantum adders are known [9, 19] . The selection of appropriate adder types is discussed in [20, 10] .
Each controlled modular multiplication is traditionally implemented separately. When dealing with reversible logic and quantum circuits, we note that the coprimality of C and M makes x → Cx%M a reversible transformation. The number of coprime C values is ϕ(M ) = (p − 1)(q − 1), where ϕ(M ) is the Euler's totient function and gives the size of (Z/M Z) × -the multiplicative group of integers mod-M . For M = 15, modular multiplication circuits for the eight C coprime values are illustrated in Figure 2 . Figure 3 shows circuits for f (x) = b x %15, gcd(b, 15) = 1. When not knowing p and q, one should also not assume any knowledge that would make it easy to find them. For example, one should not choose C that satisfies C 2π = 1%M with a known (small) π because such solutions would allow one to factorize M via gcd(C π ± 1, M ). Also recall that (Z/M Z) × is a product of two cyclic groups Z/pZ and Z/qZ, and thus (Z/M Z)
× admits a generating set with only two elements. However, knowing such generators is tantamount to knowing p and q. When working with specific small M = pq, it is sometimes difficult to avoid using the knowledge of p and q, but results obtained this way do not necessarily scale to large values. The same can be said about results produced through exhaustive search.
Known circuits for modular multiplication by a constant
We now outline several approaches to modular multiplication by a constant and point out their potential inefficiencies. It is commonly agreed that techniques that give asymptotically the smallest gate counts (based on Fast Fourier Transforms) are not practical for up to several hundred bits, and we do not discuss them here. Karatsuba multiplication also does not appear competitive, as far as we can tell.
Multiplication by a known constant C (not modular) can be implemented as a sequence of alternating shifts and additions, where one of the addends is always x [6] . When C is even, we factor out a power of two and accumulate a multiplication by a power of two, leaving a smaller odd constant C ′ . For an odd constant C > 1, we subtract one and accumulate a +x operator, leaving a smaller even constant C ′′ . This process stops at 1 and essentially traverses the binary expansion of C from the least significant bit to the most significant bit, resulting in n 1 − 1 additions when the binary expansion of C includes n 1 non-zero bits. On average, an n-bit number has n/2 non-zero bits. An improvement is possible by also using −x operators. When C is odd, we consider C%4. When C%4 = 1, we proceed as above. When C%4 = 3, we add one and accumulate a −x operator. This step may temporarily increase an odd constant by one, but always results in constants divisible by four, so the next step will decrease it by more. This algorithm essentially constructs the so-called Canonical Signed Digit (CSD) representation [3, 6] that prohibits adjacent non-zero bits. Thus, the number of additions and subtractions cannot exceed n/2 and averages n/3. For example, consider 39=0b100111. Rather than expand 39x = 2(2(8x + x) + x) + x with three additions, we can expand 39x = 8(4x + x) − x with only two addition/subtraction operations.
Computing Cx%M by reversible circuits through binary or CSD expansion of C poses several challenges. This technique is based on the operation 2 k x + x and thus requires a modular adder circuit with two (unknown) arguments, and must also copy the x value to a separate register. A single (ancilla) register suffices, but clearing it (reinitializing to 0) after the additions requires effort. As we explain below, clearing the ancillae requires another modular-multiplication circuit. For constants C with sparse CSD expansion, the ancillae-clearing circuit can be much larger than the multiplication itself, as its CSD expansion is unlikely to be sparse. In general, the second circuit requires on the order of n 2 gates for n-bit arguments, and is the same size as the first circuit, on average.
Computing Cx%M using binary expansion of x, rather than C, 3 entails chaining x i -controlled mod-M additions of constants (2 i C)%M . As shown in Section 2, such reversible modular addition-of-aconstant circuits can be simplified for each particular constant and require a single (register) argument rather than two (cf. previous paragraph). Even for the simplest constants, quadratically many gates are required. Moreover, x cannot be modified while modular additions are controlled by the bits of x. Therefore, the additions must be accumulated in a separate register, which again requires clearing the garbage ancillae.
Clearing garbage ancillae. In reversible circuits, 0-initialized ancillae must be cleared by each circuit block (except, possibly, the last), but some blocks produce garbage bits. For example, using traditional implementations of constant-multiplication as a sequence of shifts and adds requires creating a copy 3 Given that x is not a constant, its CSD expansion is not easily available and cannot be used in combinational multiplication circuits.
⊕ gates indicate inverters, while the two-bit gates are bit-swaps. All gates are linear and can thus be simulated on an initial full-superposition state using the stabilizer formalism [15] .
Figure 3: Circuits for f (x) = b x %15, gcd(b, 15) = 1. These circuits compare favorably with better-known circuits in terms of controlled-SWAP gates, because each controlled-SWAP is worth three Toffoli gates.
of the input, e.g., to compute 3x = 2x + x. However, clearing this copy (using the result of multiplication) essentially requires a division operation. In the context of modular multiplication Cx%M with gcd(C, M ) = 1, division can be performed as multiplication by the modular inverse C −1 %M precomputed by the extended Euclidean GCD algorithm [6] . We will now show how this approach was developed by Bennett to construct reversible modular multiplication circuits that clear their ancillae [22, Section II], [5, . Assume a reversible circuit computing g(x) = Cx%M using a copy-register:
When gcd(C, M ) = 1, the function g(x) is reversible, and the same construction can be applied to g −1 (z) = C −1 z%M , where C −1 is the modular inverse of C.
when z = g(x), we get
A reversible circuit can be reversed -by reversing the order of the gates and replacing each gate with its inverse, keeping in mind that the gates NOT, CNOT and Toffoli are self-inverse.
Applying U −1 g −1 after U × g replaces x with g(x) and leaves the copy-register initialized to 0.
Unfortunately, when C has sparse binary or CSD expansion, it is unlikely, in general, that so will C −1 %M . Thus, for constants like 2, 8, 17 and 63, not only we have to implement two multiplications rather than one, but the second one may require a much larger circuit, and the number of ancillae can be significant.
Modular exponentiation circuits
A number of mod-exp circuits have been proposed in the literature. The traditional approach is to implement each controlled modular multiplication separately and chain these operations. Circuits used in laboratory experiments with several qubits typically use the following shortcut. Since modular multiplications in mod-exp start with the value 1, the number of possible outcomes after the first k multiplications is at most 2 k . Therefore, for k = 1, 2, one can conditionally produce these outcomes without performing multiplication. This observation is also useful when many qubits are available, but one seeks to decrease the depth of the circuit rather than gate counts. In this case, one can establish one register for each conditional multiplication Cx%M and use CNOT gates in each register to conditionally replace the initial value 1 with C. All these operations are done in parallel and followed by a tree of multipliers. At the cost of a several-fold increase in gate counts and an asymptotic increase in bitlines (from linear to quadratic), circuit depth reduces from linear to logarithmic. As long as bitlines are the most valuable and limited resource of quantum computers, this parallel approach remains impractical.
Paper outline
Basic circuit blocks for addition, comparison and modular reduction are introduced in Section 2. Based on these blocks, we develop multiplicative blocks in Section 3, such as inversion, division with remainder, and multiplication by constants. In several important cases, we develop linear-sized modular multiplication circuits which were not known before. Whereas traditional circuit-synthesis algorithms [17] operate at the bit level, we introduce word-level algorithms that perform dramatically better. Section 4 proposes a new approach for building Cx%M circuits based on modular decomposition of x that can be implemented by compact circuits in some cases. Section 5 defines several circuit operators for producing additional circuits. Examples are given in Section 6. Section 7 proposes circuits for modular exponentiation, based on techniques from earlier sections. Section 8 shows examples.
Additive circuit blocks
Key arithmetic blocks used by modular multiplication are adders, subtractors and comparators, along with their controlled variants. Such circuit blocks are well-known for conventional digital logic, but must be adapted to the reversible context so as to avoid explicit fanout and minimize the number of ancillae. We introduce such reversible blocks below and illustrate several possible circuit optimizations. One such optimization deals with the insertion of control (enable) signals.
Addition and subtraction. A number of adder circuits developed in the literature can be used in our constructions. To this end, Takahashi [19] describes several other adders with different trade-offs between circuit size, circuit depth and the required number of ancillae. To be specific, we are using linear-sized adders by Cuccaro et al [9] , illustrated in Figure 4b , which are the smallest known. They are built using MAJ and UMA blocks shown in Figure 4a . An n-bit Cuccaro adder requires 2n Toffoli gates and 4n + 1 CNOT gates. Subtraction can be evaluated using bitwise negation as (x − y) = (x ′ + y)
The latter formula becomes competitive when the minuend y is known and M − y contains more 0 bits than does y.
Controlled addition. The structure of Cuccaro adders facilitates controlled addition with a smaller overhead. The straightforward solution is to enable such an adder by adding a control to every gate, requiring Toffoli gates with three controls that need to be broken down into smaller gates. A more economical solution is to disable a Cuccaro adder by (1) disabling the middle CNOT gate by adding a control, (2) ensuring that the matching MAJ and UMA gates cancel out. A close inspection of MAJ and UMA gates suggests that their Toffoli gates and their middle CNOT gates cancel out. The outer CNOT gates can be disabled by adding controls, turning them into Toffoli gates, as illustrated by CMAJ and CUMA blocks in Figure 4a . Thus, an n-bit controlled addition is possible with 4n + 1 Toffoli gates and 2n CNOTs.
Controlled addition of a constant (not modular).
A known n-bit value with n 1 non-zero bits can be set on zero-initialized ancillae using n 1 inverters. The adder may modify those values temporarily, but restores them at the end, which allows one to restore the ancillae lines to zeros for use in subsequent circuit blocks. Some of these inverters cancel out in the final circuit. When a control input of a gate is known to be 0 or 1, the gate can be simplified or removed entirely, as shown in Figure 4c . Such optimizations can be performed by a straightforward circuit traversal. Not counting some of the above simplifications, such a circuit requires no more than 3n − 5 Toffoli gates, n 1 + 2 CNOT gates, 2n 1 inverters. Given a constant C 0 , one can compute M − C 0 and compare possible circuits for adding C 0 and subtracting M − C 0 .
Comparators a < b are similar to subtractors -one subtracts a − b and checks a − b < 0. Cuccaro adders can be modified to perform comparison, leaving their data inputs unchanged and producing a one-bit result as the most significant carry-bit of subtraction. Therefore, after the MAJ blocks, one uses inverse MAJ blocks instead of UMA blocks used in adders. When comparing to a known constant, simplifications are possible as in Figure 4d . Comparing to a known n-bit constant with n 1 non-zero bits, such circuits require no more than 2n − 2 Toffoli gates, 3 CNOT gates, and 4n 1 inverters.
Modular reduction x%M for x ≤ 2M can be performed with one comparator and one conditional
(a) MAJ and UMA blocks used in addition, subtraction and comparator circuits.
based on MAJ and UMA blocks. subtraction, connected serially with at most 5n − 7 Toffoli gates, n 1 + 5 CNOT gates, and 8n 1 inverters (2n 1 inverters to set and reset ancillae before and after the computation). Figure 5a shows such a circuit and exhibits additional gate optimizations at the interface between the comparator and the subtractor. The inverter on x 3 is the result of simplifying a CNOT gate in a Cuccaro adder. Figure 5b shows further optimizations using Toffoli gates with negative controls. 4 Figure 6 illustrates controlled modular reduction, where the comparator and the subtractor remain intact, but the result of comparison is conditioned on the new control using a new ancilla. This ancilla is cleared at the end, but the garbage output γ inherited from uncontrolled modular reduction remains. Modular reduction for x ≪ M is discussed in Section 3 under division with remainder.
Conditional modular addition of a constant. The straightforward implementation y = (x + a)%M by adding a constant and then performing mod-M reduction clears the added carry bit, but leaves a garbage bit. This bit can be cleared by (y < a). To avoid the carry, precompute
5 and clear the ancilla via (y < a). Comparators optimized for x < α M may be smaller than those for x < M . 4 In practice, CNOTs and Toffoli gates with negative controls may be as easy to implement as the gates with positive controls. Otherwise, additional inverters around the controls suffice. Negative controls not only result in more compact circuit diagrams, but can also help reading such circuits. Recall that positively-controlled T gates with targets on 0-initialized ancillae compute the AND function ab ⊕ 0 = ab. Using negative controls and a 1-initialized ancilla computes the OR function:
5 The ternary conditional a ? b : c is similar to if(a) then b else c, but is more flexible. It returns the value of b or the value of c (it can be an l-value).
Figure 5: (a) A circuit for modular reduction with M = 21 (after optimization) consisting of a comparator and subtractor based on the Cuccaro construction [9] . For x > 21, γ = 1 which activates the subtractor block. Using this circuit as a building block may require clearing or avoiding the garbage output bit γ as in Figure 8 . (b) Further optimization using negative controls, shown with hollow circles. 
Multiplicative circuit blocks
We now develop several circuits for Cx%M and related operations, using additive building blocks from Section 2.
Circuits for (2 k + 1)x (not modular) can be constructed by shifts and adds, but the challenge is to avoid unnecessary garbage ancillae. Our circuits are structured as follows. For bit values x i (i < n) of x, the bit values of (2 k + 1)x, S i , can be constructed by a k-bit shift of x followed by an n + k bit add (i.e., 2 k x + x). The addition can be performed by a generic Cuccaro adder -2 k x on main qubits, x on ancillae, -but clearing these ancillae is difficult. Another approach is to construct logical sub-expressions for output bit i based on the bit values of x 1 · · · x i . Formula 6 gives sub-expressions for each S i bit. To calculate each S i , we precompute the incoming carry c i in Formula 6 and store it on an ancilla. For n such ancillae, we need at most 3n Toffoli gates. To construct S i values, at most 3n CNOT gates suffice. To clear the c i ancillae after use, the Toffoli gates that computed them are performed in reverse (their inputs did not change). With the additional 3n Toffoli gates to clear ancillae, a circuit for (2 k + 1)x needs up to 6n Toffoli gates and 3n CNOTs. Figure 9b illustrates a 4-bit 3x circuit after two optimizations: (i) literal reduction in S i and c i sub-expressions, and (ii) absorbing inverters in Toffoli gates with negative controls.
Circuits for −x%M . For bitwise negation
′ using one Cuccaro adder, as illustrated in Figure 7 for M = 21 and M ′ = 31 − 21 = 10. The proposed circuit maps x = 0, x = M , and x > M into M , 0, and 2 n + (M − x), respectively. Note that inverting any circuit for −x%M will produce a circuit computing −x%M
A circuit for conditional −x%M can be constructed by converting each inverter to a CNOT gate and applying the conditional modular reduction discussed in Section 2 and illustrated in Figure 6 .
Circuits for 2
k x%M for odd M > 2. We start with a linear-sized circuit for 2x%M that clears its ancillae. 6 The bulk of our 2x%M circuit computes x%⌈M/2⌉ using a modular-reduction circuit we described earlier, which evaluates x ≥ ⌈M/2⌉ on a 0-initialized ancilla, but also zeros out the most significant bit. To multiply x%⌈M/2⌉ by two, it suffices to rotate the bits, which moves the most significant zero into the least significant position. One also needs to (a) change the LSB to 1 conditional on the ancilla -this can be done with a CNOT, (b) clear the ancilla conditional on the LSB -this can be done with another CNOT. This circuit is illustrated in Figure 8a . Further circuit optimization uses three tricks. One is the merger of inverters into negative controls (shown with hollow circles) of Toffoli gates, this may benefit from creating pairs of canceling inverters and/or moving inverters through targets of CNOT/Toffoli gates. The second optimization deals with the two CNOTs at the end of the circuit. It creates a pair of canceling CNOTs prior to them, so that three CNOTs can be combined into a SWAP. The remaining CNOT gate is controlled by a 0 value created by doubling, thus can be removed. We are left with a chain of SWAP gates that rotate the significant bits and onto an ancilla, in particular, the 0 value is rotated onto the ancilla (at which point all ancillae are cleared). The third optimization interprets the bit rotation at the end of the circuit as a relabeling of outputs. The resulting circuit in Figure 8b works correctly only for x ≤ M , but x ≤ M can be guaranteed in Shor's algorithm.
Since output relabeling cannot be used in a controlled 2x%M circuit, controlled rotation can be implemented with controlled-SWAP gates. However, when multiple 2x%M circuits are concatenated to 6 Circuits in the literature may exhibit quadratic size because, to clear ancillae, they implement x/2 %M by finding the modular inverse of 2 (see Section 2) and decomposing it in binary.
Figure 8: Circuits for f (x) = 2x%21, (a) module-based design, (b) after three optimizations. The first and second sub-circuits in these figures are for (x > 10) and (x > 10 ? x − 11 : x), respectively. In (a), SWAP gates are used to compute 2x, the second-to-last CNOT adds 1 (i.e., 2x + 1), and the last CNOT clears the bottom ancilla. These gates can be removed by reordering output labels as shown in (b). implement 2 k %M , all controlled rotations can be merged into one such rotation at the end of the circuit.
Modular reduction 2 i x%M for M = 2 k ± 1 can be performed using a well-known algorithm. To compute x%(2 k − 1), add 2 k -ary digits of n-bit x modulo 2 k − 1. To compute x%(2 k + 1), alternate addition and subtraction of 2 k -ary digits of n-bit x modulo 2 k + 1. When k > n, no gates are needed. Otherwise, one can use ⌈n/k⌉ Cuccaro adders on log 2 M bits. In the case M = 2 k − 1, the output carry of each adder can be ignored. Hence, Toffoli and CNOT gate counts are ⌈n/k⌉ · 2k and ⌈n/k⌉(4k + 1), respectively. For M = 2 k + 1, the output carry of each Cuccaro adder should be considered. In this case, at most ⌈n/2k⌉ mod-M reduction modules on log 2 M bits are sufficient. Therefore, the numbers of Toffoli and CNOT gates are ⌈n/k⌉(2k+2)+⌈n/2k⌉(5k−2) and ⌈n/k⌉(4k+5)+⌈n/2k⌉(n 1 +5) (n 1 ≤ k+1 represents the number of non-zero bits in M ), respectively. Another approach to implement the required additions and subtractions is to implement the counters (0, 1, ..., 2 k ±1−1) and (2 k ±1−1, 2 k ±1−2, ..., 0) conditional on bit values of n-bit x as illustrated in Figure 10 for x%3. Clearly, no %M modules will be required in this approach. A factor of 2 i only changes the indices of the bits read by the baseline algorithm. All circuits constructed here exhibit linear number of CNOT and Toffoli gates in terms of n.
Special case 2
k x%M where M = 2 n −1±d and both k and d are very small. Breaking down x into k more significant bits and n − k less significant bits, we write
n−k and then
where rot k (x) is a cyclic shift (rotation) of x by k bits (rot k (x) may exceed M ). Note that when d = 0 or d = 2, we get a well-known special case described above. When |d| < 2 n−k modular reduction can be computed by subtracting M if the number exceeds M , which allows one to compute the product dx Division with remainder circuits convert x into x/ρ, x%ρ without loss or gain of information. A simple example is given by ρ = 2 k , where the quotient and the remainder are simply the n − k high and the k low bits of x. Previously, we have also shown linear-sized remainder circuits for 2 n ± 1. In general, division can be performed by a series of subtractive modular reductions, whose ancillae accumulate the bits of the quotient. When x < 2 k ρ, the most significant bit is computed by a mod-2 k−1 ρ reduction, the next bit by a mod-2 k−2 ρ reduction, etc for a total of k + 1 reductions. The last reduction produces the remainder.
Cx%M using division with remainder
We propose the following computation of Cx%M .
so that a single subtractive mod-M reduction suffices.
Proof. Clearly, x = ρ⌊x/ρ⌋ + x%ρ. Then Cx = Cρ⌊x/ρ⌋ + C(x%ρ). We leave the latter term as is because C(x%ρ) ≤ C(ρ − 1) = C⌊M/C⌋ < M . To reduce the former term, note that M < Cρ < 2M ,
Since δ < C and ⌊x/ρ⌋ < Cx/M < C, we have δ⌊x/ρ⌋ ≤ C 2 which proves Formula 9. To construct reversible circuits using this result, use the circuits for division with remainder from Section 3 to represent x by the pair (⌊x/ρ⌋, x%ρ) without a loss or gain of information. This will require x0 y0 x0
(a) (b) Figure 9 : Circuits for (a) modular reduction with M = 12 (after optimization), and (b) computing f (x) = 3x for x < 16. In (b), gates in the dashed box clear ancillae.
⌈log 2 C⌉ subtractive mod-2 i ρ reductions, with the ⌈log 2 C⌉-bit remainder stored in ancilla (for C = 3, two mod-ρ reductions are performed). A challenging part is to implement multiplications by constants δ⌊x/ρ⌋ and Cx%ρ with reversible circuits, so that ancillae are cleared. This is illustrated in Section 3 for C = 2 n ± 1. After modular addition, ancillae can be cleared by computing (Cx%M )%C, which takes linear time for C = 2 n ± 1 as explained in Section 2. Figure 10 for 3x%35 where ρ = 12, δ = 1. We implement ⌈log 2 3⌉ = 2 subtractive mod-12 and mod-24 reductions by two successive %12 modules. Accordingly, the second-to-last and the last ancillae evaluate to 1 when 12 ≤ x and 24 ≤ x, respectively. After the first CNOT, the ancillae will be 1 for 12 ≤ x < 24 and 24 ≤ x. Therefore, the values of the lowest-placed ancillae are 0, 1 or 2 based on the value of x. Computing 3x (not modular) and adding 1 for 12 ≤ x < 24 or 2 for 24 ≤ x (Formula 8) implement 3x%35. To clear the ancillae used, one needs to implement %3 on two new ancillae (two highest-placed ancilla bits in Figure 10 ) and uses the bits to control two CNOT gates. Since 2%3 = 1, we can rewrite x = 2
Example 4.1 This approach is illustrated in
which can be implemented by three up-counters conditioned on even bits and three down-counters conditioned on odd bits. The %3 computation can be undone by applying %3 in reverse (indicated by the (%3) −1 block in Figure 10 ).
Values C = O(1) facilitate linear-time mod-ρ decomposition by subtractive reductions and also imply δ = O(1), x/ρ = O(1). Therefore the first multiplication can be performed through controlled additions of constants. Given our circuits for (not modular) multiplication by (2 k + 1) in Section 3, Formula 8 can be used with C = 5, 9, 17, 33, . . .. Circuits for multiplication by δ = C − M %C are available for δ of the form 2 k and 2 k + 1. Working with C > √ M directly can be difficult because many modular reductions may be required in Formula 8, and their ancillae must be cleared. It helps to postpone, until the end, clearing the ancilla that contain x%ρ, and use them to clear the ancillae for modular reductions. Another trick is to avoid unnecessary modular reductions by interpreting each multiplication %M . In particular, large C values can be replaced by M − C if the addition is replaced by subtraction. In this context, some large C values may also be convenient when δ = O(1) and ρ = 2 O(1) , and thus the second multiplication can be performed through controlled additions of constants.
To count the number of Toffoli and CNOT gates for x → (x/ρ, x%ρ), we use ⌈log 2 C⌉ subtractive mod-2 i ρ reductions and ⌈log 2 C⌉ ancillae. The reductions go from larger numbers to smaller numbers, ending with ρ. A 2 i ρ-reduction module operates on ⌈log 2 ρ⌉ + i bits. Hence, the number of Toffoli gates will be Σ
. To compute (2 k +1)x%M by division with remainder, one additionally uses a multiplicative module Cx (not modular) to compute C(x%ρ). Consequently, one Cuccaro adder and one %M module are employed to add ⌊x/ρ⌋ to the result and apply the mod-M reduction. To clear ancillae by computing (Cx%M )%C, two %C modules and O(1) gates are necessary. Additionally, ⌈log 2 C⌉ and ⌈log 2 M ⌉ + 1 ancillae are required for the first modular reductions and other blocks, respectively.
Circuit operators and decompositions
Given small circuits proposed earlier, we find additional C values for which small Cx%M circuits exist.
Figure 10: A circuit for 3x%35 in Example 4.1. Circuits for 3x and modular reduction with M = 12 = ⌈35/3⌉ are illustrated in Figure 9 . The two mod-12 reductions compute (x/12, x%12), the latter is unconditionally multiplied by 3 (the 3x box) and the result is added to the former (the adder box). The %3 module modifies two highest-placed ancilla bits, counting up (0,1,2) and down (0,2,1). It consists of three count-up (↑) and three count-down (↓) blocks -the first block of each kind is shown in detail. The two CNOTs that target the last two lines clear ancillae set by %12 modules. The (%3) −1 module clears ancillae set by the %3 module. Table 1 : Number of gates in circuit blocks. In this table, n 1 and n ′ 1 represent the number of non-zero bits in M and C and n = ⌈log 2 M ⌉, and n ′ = ⌈log 2 C⌉. T , C, and A are the number of Toffoli, CNOT and ancillae, respectively. More accurate gate count for division with remainder is discussed in text. Further constant-specific optimizations may be possible.
Gates and ancillae Circuit block
Formula
Multiplicative decompositions
We employ two circuit operators -inversion and negation -that convert a reversible circuit for Cx%M into a circuit for τ (C)x%M , where the function τ (·) characterizes the transform.
Inversion reverses the order of the gates and replaces each gate with its inverse (inverters, CNOT gates, swaps and Toffoli gates are self-inverse). Circuit size is preserved. The generated circuit computes τ (C)x%M , where τ (C) is the mod-M inverse of C, i.e., τ (C)C%M = 1. When gcd(C, M ) = 1, modular inverse exists, is unique and can be computed by the extended Euclidean algorithm [6] . When applied to small-power-of-two circuits (C = 2 k ), inversion produces negative-power-of-two circuits (C = 2 −k %M ) and generates new convenient C values unless M = 2 n − 1.
, where ′ performs bitwise negation. Therefore, the circuit operators adds an inverter on every wire and performs one modular addition/subtraction with d, either before or after modular multiplication by C. Circuit size increases. When M is odd, so are M − 2 k , producing new convenient values. Combining negation with inversion may produce additional convenient values. Given that the two transforms commute, applying inversion and negation to small powers of two produces at most 4⌈log 2 M ⌉ convenient values (including small powers of two), which can be a lot smaller than ϕ(M ).
Modular products. Composing compact circuits for convenient constants in series, one can often obtain additional convenient constants C = C 1 C 2 %M . However, when multiplying small positive and negative powers of two, no new values can be obtained. Multiplying positive powers of two (or negative powers of two) does not help when M = 2 n − 1, e.g., for M = 15. Products with negated powers of two do not give new convenient values when M = 2 n + 1, e.g., for M = 33. In general, since (Z/M Z) × is a product of two cyclic groups, it suffices to build compact reversible circuits for its two generators and compose them in various ways to produce reversible circuits for all other group elements. This strategy is impractical because (i) the composed circuits will often be larger than necessary, (ii) it is not clear how to identify a pair of generators without knowing p and q.
Additive decompositions and a shortest-path formalism
For large C, the multiplicative operators described above may be insufficient. To also consider additive operators, we introduce a zero-initialized ancilla register which is cleared after Cx%M is computed in the primary register. A value is copied into this register from the primary register using a parallel chain of CNOT gates. Multiplicative operators can be applied to individual registers, and additive operators replace the contents of one of the register with the modular sum or difference of two values (note that these operations are reversible). The operators we consider are listed in Table 2 , along with their costs, measured as the number of T gates (which dominate quantum cost). We use the following circuit descriptions.
• Every step/operator takes exactly two characters For example, the literal c2 represents a bit-wise CNOT operation with Register 2 as its target. It is meant to copy the contents of Register 1 to a zero-initialized Register 2 (or clear Register 2, when it duplicates Register 1). The same can be accomplished using the modular addition operator +2 (the modular subtraction operator -2, respectively), but at a higher cost. The circuits c2c2, +2-2 and r1t1 do nothing, and the circuit c2c1c2 swaps the contents of the two registers. As a more complex example, to compute (x, 0) → (3x%65, 0) without the multiplicative 3x%M operator we introduced earlier, one might use the circuit c2+1+1+2+2d2+2d2d2c2. It uses 154 T gates, and is smaller than our generic 3x%M circuit. However, such compact circuits need to be discovered for each M . We reduce this task to finding a shortest path in a graph where the vertices represent possible two-register states relative to the initial state (x, 0). For 0 ≤ a, b < M , vertex (a, b) represents (ax, bx). The source vertex is (1, 0). The weighted edges represent operators from Table 2 with respective costs. When traversing this graph, vertices and edges can be generated on the fly. Theorem 5.1 For an n-bit value M and any 0 < C < M coprime with M , the worst-case gate count of optimal two-register circuits mapping (x, 0) → (Cx%M, x) and
Proof. Once the statement for (x, 0) → (Cx%M, x) is proven, the statement for (x, 0) → (Cx%M, 0) follows by Bennett's construction for clearing ancillae (Section 1.2) which produces a circuit of the second kind by composing two circuits of the first kind. Consider the binary decomposition of x = Σ i 2 i b i and traverse it from the most significant bit. Before considering a new bit, apply the d2 operator, except when the second register holds value 0. Upon seeing bit 1, apply the +2 operator. For example, x = 13 = 0b1101 leads to operators +2d2+2d2d2+2, which produce
To swap the register values, one can apply c2c1c2 or c1c2c1, but this may be unnecessary within Bennett's construction. Each operator uses O(n) gates. The circuits use n − 1 d2 operators and up to n +2 operators, thus O(n 2 ) gates total. The upper bound on the T -cost of (x, 0) → (Cx%M, x) circuits implied by our proof is n(5n − 7) + 2n 2 = 8n 2 − 7n, with the average-case estimate n(5n − 7) + n 2 = 7n 2 − 7n because half of the bits are 0 on average. These bounds can be improved by considering the canonical signed digit (CSD) decomposition, which uses not only additions but also subtractions, and ensures that at least one of each two neighboring bits is a 0. Thus 7n 2 + O(n) becomes a worst-case bound, and the average case improves to 6 2 3 n 2 + O(n). For (x, 0) → (Cx%M, 0) circuits, doubling the above estimates due to the use of Bennett's construction produces 14n 2 + O(n) in the worst case and 13 1 3 n 2 + O(n) on average. The smallest-cost circuits we report in Section 6.2 improve upon these bounds by factors 2-4, but not asymptotically. We also note that our shortest-path construction produces O(n)-sized circuits in some basic cases, such as C = 2, 3, 4, ... In contrast, resorting to Bennett's construction with binary or CSD expansion involves the modular inverse of C and typically leads to n 2 -sized circuits.
Examples of modular multiplication
Here we study M = pq for small prime p and q. One can argue that large classes of such M values should be excluded from consideration in the context of Shor's algorithm because they offer no value for number-factoring. For example, numbers of the form M = b 2π −1 can be factorized quickly by computing gcd(M, b π ± 1), and this class includes the number 15, commonly used in experimental demonstrations of Shor's algorithm. The same argument applies to some numbers that satisfy mM = b 2π − 1, where m has very few factors. This class includes the number 21, considered as the next example after 15 for quantum number-factoring. Indeed, 3 · 21 = 8 2 − 1 = 7 · 9 leads to gcd(7, 21) = 7. Nevertheless, we consider these cases for completeness and use them to illustrate general circuit constructions. Table 3 describes small circuits for Cx%M functions with coprime C and M with 6 bits or less. Each circuit is described by a parenthesized triplet consisting of the Toffoli gate count, the CNOT gate count and the number of ancillae. An expression indicating circuit structure follows after a colon. C values where gcd(C, M ) > 1 and C ≥ M are marked by × and −, respectively. For each M , the last row reports circuits constructed by binary expansion of x (Section 1.2) with the smallest gate counts among different C values.
Very small moduli
In each case, we report the best circuit structure we could find. For example, 3x%35 can be implemented as −2 5 . At most one inverter may be used on each circuit line. Of the techniques we presented, the most economical one is the use of 2x%M circuits, their repetitions, inverses and negations. In some cases (M = 21, 39, 55), it suffices for all C values. When additional circuit constructions are needed, we start with circuits for 3x%M or 5x%M , except when the modulus is divisible by 3 or 5. By means of circuit operators, these additional primitive circuits generate a large number of composite circuits, especially that they can be composed with powers of two, etc. In Table 3 , the first grayed cell of a column represents a primitive circuit that is not a power of two. The smallest circuits constructed by binary expansion of x for each M (shown in the bottom row) are typically larger than the largest circuits proposed. The data suggest that divisibility of M by 3 can lead to relatively large Cx%M circuits compared to other moduli M with the same number of bits. This is because C = 3 is the smallest C value unrelated to powers of two, for which we can build compact multiplication circuits. Among M values divisible by 3, circuits for M = 39 tend to be smaller because all C values coprime with M can be obtained through positive and negative powers of two, and their inverses.
Larger moduli
We now illustrate the use of our shortest-path reduction to find two-register mod-mult circuits. Our C++ implementation of Dijkstra's algorithm operates on an M × M vertex array, but generates edges on the fly. In one pass, it finds all single-source shortest paths starting at (1, 0) and produces Cx%M circuits for all C coprime with M (this is convenient, but not necessary when working with Shor's algorithm). The modular multiplication circuits with 7-14 bits produced by our techniques are available online at http://www.eecs.umich.edu/~imarkov/MME/. In Table 4 , we show circuits for Cx%65 with all coprime C. Figure 11 shows circuit-cost distributions (for T gate counts) for several M values in terms of cumulative distribution functions (CDF). Maximum and average costs for all 6-14 bit semiprime M values not divisible by 2 and 3 are reported in Table 5 . On a fast Linux workstation (3.0GHz Intel CPU with 8GB RAM), processing one 14-bit M value takes one to six minutes, and less than three days for all 14-bit M values.
8 Many 15-bit values require over 8GB memory, and runtime increases four-to eight-fold. Our implementation of Dijkstra's algorithm based on an explicit 2 n × 2 n vertex array does not scale beyond 15-bit M . However, the shortest-path formalism can be applied in different ways to find optimal circuits for larger M values, and also to perform heuristic optimization for much larger M values.
The sizes of n-bit modular multiplication circuits in Table 5 fit very well (R 2 > 0.999) to quadratic functions, producing the worst-case bound 6n 2 + O(n) and the average-case estimate 3.3n 2 + O(n). Thus, our circuits are 4 times smaller on average than CSD-based circuits produced using Bennett's construction discussed after Theorem 5.1.
Circuits for modular exponentiation
When implementing f (y) = b y %M , one deals with modular multiplications C k x%M, C k = b 2 k %M conditional on bits y k , as outlined in Section 1.
Reordering and factoring of modular multiplications
The order of conditional modular multiplications does not affect the result, and this becomes useful after some of them are factored. As we have shown earlier, sometimes C k x%M is easiest to implement as Table 3 : The structure and gate count of the proposed circuits for Cx%M with coprime C and M with 6 bits or less. Each circuit is described by a parenthesized triplet consisting of the Toffoli gate count, the CNOT gate count and the number of ancillae. DR indicates direct use of division with remainder (Theorem 4.1). Circuits for M = 15 are illustrated in Figure 2 . Gray cells indicate circuits that do not only use powers of two, negative powers of two, or their inverses. All ancillae are cleared after the computation. Additional optimizations are possible. For each M , the last row reports circuits, constructed by binary expansion of x, with the smallest gate counts among different C values. Table 4 : Two-register circuits for Cx%65. Cost is reported as the number of T gates. Circuit  2  28  d1  33  28  h1  3  154  c2+2+2+1+1d1+1d1d1c2  34  140  c2+2+2+1+1d1+1d1-2c2  4  56  d1d1  36  126  c2+2+2+1+2+1+1d1-2c2  6  140  c2h1h1-1-2-1-1-2-1c2  37  140  c2+1d1+2+1+2d2+1+2  7  140  c2+2h1+1+2+1h2+2+2  38  126  c2+1+2+1+1+2+1d1-2c2  8  84  d1d1d1  41  126  c2+1+1+2+2+1+1+2+1+2  9  140  c2+2+1h1+1+2+1+2+2+2  42  140  c2+1+1+2h1+1+1+2+2+2  11  140  c2+1+2+1+1+2+1d1d1c2  43  168 Table 3 (right). The correspondence between blocks in the two circuits can be established by matching control lines.
In this case, we can factor out −x%M conditional on y k . Any number of conditional −x%M operations can be consolidated into one such operation, conditional on the XOR of relevant control bits. This XOR value can be computed using a chain of CNOTs without ancillae, and uncomputed by the same chain after use. Reordering also allows one to move a small set of the most difficult multiplications to the front of the circuit, where the initial value is 1 and generic multiplication circuits can be avoided, as shown below.
k-bit look-up tables
A (k, m) look-up table (LUT) takes k read-only input bits and m > log 2 k zero-initialized ancillae. For each 2 k input combination, a LUT produces a pre-determined m-bit value. For example, a (2,4)-LUT may be defined by values (1,2,4,8) or (1,4,1,4) .
Look-up tables arise in implementations of Shor's algorithm (with initialized bits) where the first conditional modular multiplication is applied to the constant 1, and can therefore produce only two possible values -1 and the multiplier. Such a circuit can be implemented with at most m CNOT gates (m/2 on average). When two conditional multiplications are considered, four output combinations are possible. For every bit of the result, this defines a two-input Boolean function, which can be implemented with at most two reversible gates (possibly with negative controls) and no ancillae. All these gates operate g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g1, g2 g3 · · · g6 g7, g8 g9, g10g11, g12 g13, g14 g in parallel, although most existing technologies are not able to use such amount of parallelism. When two output bits implement the same function using Toffoli gates, one of them can be replaced by a CNOT that copies the computed value.
Reconsider conditional mod-mults by 16, 36 and 31 required in modular exponentiation for b = 2, M = 55 as shown in Figure 12 . Depending on the three input bits, the output may be 1, 16, 36, 31, 26 = 16 · 36%55, 1 = 16 · 31%55, 16 = 31 · 36%55, 36 = 16 · 31 · 36%55. Figure 13a illustrates a simple realization based on the following Boolean expressions for output variables where {i, j} is used to denote m i ⊕ m j for minterms 9 m i and m j .
Since some Boolean functions with > 2 gates repeat, they can be computed once and then copied. Some Boolean functions can be used to compute other Boolean functions too. Following these optimizations, an improved circuit for circuit in Figure 13 (a) is shown in Figure 13 (b) which is smaller than three conditional modular multiplications by 16, 36 and 31 as reported in Table 3 . Figure 14 illustrates the LUT-implementation of modular exponentiation for M = 21 with different coprime base values. Predictably, the cases with b 2 %M = 1 result in the most compact circuits.
10
Systematic synthesis. We now construct circuits to implement each output of a reversible LUT. Viewing each output as a Boolean function of read-only inputs, one can write the Shannon decomposition F = xF x ⊕ x ′ F x ′ where F x and F x ′ are positive and negative cofactors of F . This equation can be written as Formula 10, the positive Davio decomposition, or as Formula 11, the negative Davio decomposition. Table 6 shows that each 2-input Boolean function can be implemented by a reversible circuit with read-only inputs using at most three gates, of which at most one is a Toffoli gate. To implement a threeinput function, cofactor it with respect to one of its inputs. Implement the first cofactor without controls and then implement a controlled version of the XOR of the two cofactors. This approach leads to at most one 4-input Toffoli gate and at most 6 smaller gates. Circuit costs can be minimized by choosing the cofactoring variable (pivot) so as to minimize the total costs of cofactors based on Table 6 . Working with four-input functions, one can implement four modular-multiplication modules by one (4,n)-LUT by implementing cofactors as three-input functions. However, using two separate cofactoring steps may require five-input Toffoli gates. An alternative approach is to consider the four double-cofactors (each a two-input function) with respect to two variables as shown in Formula 12, and introduce an ancilla to enable the fourth cofactor. This ancilla will be set and unset by a Toffoli gate and will enable the 9 For a Boolean function of n variables, a minterm is a product term of all n variables (either complemented or uncomplemented). Each minterm can be labeled by an integer by interpreting negated literals as 0 bits in the label. For example, expanding minterms for y 0 leads to Table 6 : Circuits for all 16 two-input functions. N , C, and T are used for NOT, CNOT and Toffoli gates. Variables a and b are inputs and z is the output. No ancillae are used.
Function
Minterms Circuit
cofactor using a single control. One of the following formulas can be selected based on the costs of double-cofactors obtained from Table 6 .
Of the four cofactors in each formula, one can be implemented without control, two with a single control without ancillae, and one with a single control with an ancilla. This approach leads to at most 12 gates, of which at most three are four-input Toffoli gates. The depth and gate count of a (4, n)-LUT are O(n). Figure 15 illustrates the result of applying the systematic synthesis to 2 x %87. Selecting the cofactoring variable carefully, implementing the appropriate cofactor without control, and sharing cofactors among different functions can reduce the number of gates. Davio decompositions were used in [25] to synthesize a given reversible function. However, the technique in [25] implements the Davio decompositions by assuming that the factors have already been computed on dedicated ancillae. Therefore, the resulting circuits require numerous ancillae. The work in [25] does not clear these ancillae. Our (4,n)-LUT circuits use at most one ancilla, and we clear it. 
Control optimization using 2-to-2 multiplexors
A large fraction of quantum-gate costs can be attributed to controls (read-only bits) [4] , and this is particularly true for mod-exp circuits, where each C i x%M -block is enabled (controlled) by one bit of Register 1.
11 To avoid propagating these controls to each gate of the C i x%M -block, we observe that the binary 000...0 is a fixed point of every such block. Control can be implemented indirectly by conditionally swapping a constant zero into the register before the block and swapping the result out after the block (Figure 16 ). For n qubits, this technique requires an additional n-qubit zero-initialized swap register and 2n Fredkin (controlled-SWAP) gates. We merge pairs of adjacent Fredkin gates with controls from Register 1 and common target bits in Register 2. Indeed, Register 2 must be swapped with the swap register only when the two control bits carry mutually exclusive values. Therefore, we first apply a CNOT gate to the two controls from Register 1, then (optimized) Fredkin gates (for each qubit of Register 2) controlled by the target bit of the CNOT, and then we repeat the same CNOT gate to restore the modified control bit. This is illustrated in Figure 17 . Each Fredkin gate can be broken down into a single-controlled Toffoli surrounded by two CNOT gates. However, when one of the swapped inputs always carries a zero, the first CNOT gate can be removed. Given that C i x%M -blocks in the literature contain Θ(n 2 ) gates, our two optimizations bring substantial savings and simplify the structure of mod-exp circuits.
Ancillae sharing. Our proposed optimizations trade off the overhead of control logic for a number of additional ancillae. In addition to the control register (where the Hadamard gates are applied) and the results register (Figure 16 ), multiplexing requires a swap register of size n. This is separate from the ancillae required by our mod-mult circuits shown in Table 1 . Fortunately, many ancillae already used by the Cx%M circuits can be reused for multiplexing under some conditions. To this end, our multiplexing construction guarantees that either the results register or the swap register is holding all zeros. In the latter case, the swap register bits can clearly be used as zero-initialized ancillae in mod-mult circuits, as long as we restore them before the next multiplexing which we do (at least for x < C, as discussed for additive and multiplicative circuit blocks in Sections 2 and 3). In the former case, we need to make sure that when the Cx%M computation is performed with x = 0, the ancillae are restored to their (possibly non-zero) initial values. Consider the modular reduction in Section 2 with one comparator and one conditional subtraction where the comparison and conditional subtraction are performed on the value stored in ancillae. Consider the zero-initialized ancilla ς that carries the condition bit. For any value A = 0 in the ancillae, x = 0 < A and we have ς = 0. Hence, the conditional subtraction is not applied. Since the Cuccaro adder recovers the values in the second register (and changes the first register to the result), the possibly non-zero initial values in the ancillae will be recovered. Therefore, we need to add one ancilla to save n − 1 ancillae. The added ancilla will be cleared as before. Now consider the following individual blocks used in our circuits.
• −x%M . Our construction maps x = 0 into M .
• 2 k x%M for odd M > 2. This block contains one modular reduction but has a fixed point at x = 0. •
. . . • 2 i x%M for M = 2 k ± 1. Our construction contains several modular reductions and additions based on Cuccaro adder. x = 0 is a fixed point for Cuccaro adder. However, a non-zero value in the ancillae changes x = 0 after addition.
• Division with remainder circuits. Our construction includes a set of modular reductions followed by a circuit for (2 k + 1)x (not modular). Assigning x = 0 in Formula 6 reveals that the (2 k + 1)x circuit does not change x = 0 as far as the ancillae used for c i carry zero. Next, we use a set of Cuccaro-based modular reductions and additions. Overall, x = 0 is a fixed point for division with remainder circuits with zero-initialized ancillae.
This analysis indicates that for 2 k x%M for odd M > 2, ancillae can be shared. However, the −x%M block complicates the proposed sharing of ancillae with 2-to-2 multiplexors. Therefore, we factor out such blocks, aggregate them into one as described in Section 7.1, and implement one conditional −x%M as described in Section 3 directly. Without multiplexing, the swap register must hold all zeros and can thus hold the ancillae of the −x%M block. For other cases with 2 i x%M for M = 2 k ± 1 and divisionwith-remainder circuits, ancilla sharing cannot be applied and separate ancillae are needed for mod-mult and multiplexer modules.
Circuit structure for modular exponentiation
Overall structure. Summarizing the content of the above subsections, we propose mod-exp circuits consisting of three modules: (i) an initial LUT, (ii) an XOR-controlled negation, and (iii) remaining conditional modular multiplications. The first two modules will use a linear number of gates, while the bulk of the circuit will be in modular multiplications. To simplify the implementation of control, we use uncontrolled modular multiplications with multiplexors. The size of mod-mult circuits is moderated by factoring out negations and by implementing the most difficult multiplications in the LUT module. For an n-bit modulus M , our circuits use an n-qubit results register, a 2n-qubit control register, an n-qubit swap register and an n-qubit ancilla register for each modular multiplication. In addition to these 5n qubits, less than n ancillae may be needed for arithmetic operations (such as doubling and trippling), but these ancillae can also be shared with the swap register. In toto, 5n to 6n qubits are used.
Base selection. Recall that in Shor's algorithm not all values b > 1 for the base of exponentiation succeed -for some values the period 2π of f (x) = b
x %M is even and b π %M = −1 and for others it is either odd or b π %M = −1. It is proven [15] that the successful case occurs with probability at least 50% (also check Table 7 ). Therefore, common descriptions of Shor's algorithm make a random choice of 1 < b < M , invoke period-finding, and repeat the entire process for another b if the period is either odd or b π %M = −1. Obviously, when gcd(b, M ) > 1, there is no need for quantum circuits, but this occurs increasingly rarely for large semiprime M . Therefore, when illustrating mod-exp circuits in our work, we observe gcd(b, M ) = 1. The set of reasonable b values can be further restricted as follows. • For an integer k,
In particular, if b results in an even period, so does b 2k+1 for k > 0.
• p %M = 1. Since p * = p 0 p 1 /gcd(p 0 , p 1 ) is a multiple of both p 0 and p 1 , it must satisfy the latter equation. Therefore, p * = pm for some integer m > 0 (or else p * %p < p would satisfy the equation, since we can factor p out). If p 0 and p 1 are odd, then so is p * , and thus p cannot be even.
,2k+1) = −1 which proves the third case.
Theorem 7.1 suggests using odd powers of primes for b. 12 Straightforward computational experiments show that small primes have much greater probability of success than 50%. Assuming that success for different primes is not strongly correlated, trying only b = 2, b = 3 and b = 5 can be expected to work in a majority of the cases. To illustrate this, Table 7 Adding primes < 43 as a base for the first set and < 61 for the second and third sets is sufficient to ensure that a useful period can be observed in all cases. The last row (#Total) shows the total number of M for each n. In addition to the results reported in Table 7 , we discovered that choosing larger primes than b = 2, 3, 5 as bases leads to more failed M values. Therefore, the smallest bases are the most promising and can be tried first.
Selecting the number of controls. If we find that b k %M = b m %M for some k = m, that allows us to upper-bound the period and then find it by binary search. When factoring large integers M using Shor's algorithm, we can pursue different strategies for selecting the number of control qubits. Most of the literature shows that selecting twice as many qubits as bits in M is sufficient. However, fewer bits suffice in many cases. Given that physical implementations of Shor's algorithm are typically limited by the number of qubits, a more practical strategy is to start with a small number of qubits, perform number-factoring, and increase the number of qubits in case of failure. This adds at most a poly-time factor to runtime complexity, but also reduces circuit sizes. Assuming that modular exponentiation circuits generally have size on the order of n 3 , the difference in sizes n 3 − (n − 1) 3 is on the order of n 2 , which can be significant.
Examples of modular exponentiation
Our first series of experiments illustrates the proposed construction of mod-exp circuits but uses only multiplicative circuit decompositions for individual mod-mult blocks. Multiplicative decompositions do 6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  2  100 100 100 100 95  90  92  85  85  87  90  89  91  91  91  89  89  90  90  90  90  3  100 100  67  89 100 91  92  89  89  86  86  84  86  87  88  87  87  87  88  87  86  5  -100  83  89  94  90  100  90  83  81  80  84  87  87  90  89  88  87  87  86  88  2|3|5 100 100 100 100 100 96. not require an ancilla register used by two-register circuits, but in some cases generate larger circuits, and for larger M values may not be able to generate some Cx%M circuits at all. Therefore, this approach is more relevant for small M values and an environment with a very limited number of qubits. Gate counts and the structure of the proposed circuits for b x %M for M with 9 functional qubits or less are reported in Table 8 . In this table, the notation x(y) represents modular multiplication by x controlled on the line y. Each circuit is described by a parenthesized triplet consisting of the Toffoli gate count, the CNOT gate count and the number of ancillae. For each M , we initially selected b = 2 in modular exponentiation. If this triggered a restart in Shor's algorithm, we tried b = 3, and if that failed we selected b = 5. For each M value, we calculated all parameters C i = b 2 i %M and found the least costly multiplicative decomposition for each C i according to Table 1 . The four costliest modular multiplications were selected for each M value, and a (4, n)-LUT was synthesized for these multiplications using our systematic synthesis procedure. The remaining controlled modular multiplications were implemented directly and connected through 2-to-2 multiplexers. The number of Toffoli and CNOT gates required for each mod-mult sub-circuit is reported in Table 1 . Each controlled SWAP in a 2-to-2 multiplexer can be implemented by one Toffoli and one CNOT gates (Figure 17 ). For the first and last controlled SWAP gates, n = ⌈log 2 M ⌉ Toffoli and the same number of CNOT gates are applied. For intermediate controlled SWAPs, two additional CNOT gates are essential. Gate counts for Cx%M modules used in circuits of Table 8 are computed by adding up gate counts from Table 1 . To simplify circuits for 4-LUTs, we applied the rule-based optimization method in [2] which optimizes sub-circuits with common-target gates and uses both negative and positive control Toffoli gates during the optimization. For each M in Table 8 , another b value may admit a smaller circuit, but finding the best b (for a given large M ) that is useful in number-factoring M is, in general, no easier than number-factoring.
Our further experiments focused on scalable minimization of gate counts, but were allowed to use an additional n-qubit ancilla register to facilitate two-register mod-mult circuits. Figure 18 shows the distributions of mod-exp circuit sizes for n = 7..14. Each line represents the cumulative density function for T -cost of mod-exp circuits constructed for all n-bit semiprime M not divisible by 2 or 3. We note that for a given n the median cost is about 2/3 of the maximal cost, but the smallest cost is only a fraction of the median cost. Table 9 reports min, max and average costs numerically, as well as the M values for which extreme circuit costs were observed. The data for average and max costs are amendable to polynomial extrapolation (R 2 > 0.999), allowing us to estimate achievable circuit costs for much greater values of n without necessarily having practical synthesis algorithms. However, the costs of smallestseen circuits are too erratic for reliable extrapolation. Notably, our experiments optimize the number of control qubits, typically assumed to be 2n. For each M , we use the smallest number that does not lead to failures in Shor's algorithm and report it in Table 9 , along with the period found by Shor's algorithm.
13
Comparing to mod-exp circuits in [20] that use 100n ancillae, our circuits use only 5n to 6n ancillae 13 For each semiprime M , there are two non-trivial b values, such that b 2 %M = 1. While these bases lead to the most compact mod-exp circuits, finding them is as hard as number-factoring. To this end, the data Table 9 suggest bases b = 2, 3, 5 sometimes lead to unusually small circuits and short periods. and are several orders of magnitude smaller in terms of gate counts. Circuit depth seems comparable for n = 14. However, considering circuit depth as a measure of circuit speed assumes that any number of gates can be implemented in parallel, which does not hold for many existing physical implementations. In an environment with a limited supply of qubits and limited parallelism, our circuits appear far superior to those proposed earlier. Whether or not many gates can be applied in parallel, larger circuits may require heavier quantum error correction, and this trend favors circuits with fewer gates.
Comparison with prior art
Prior work on circuits for modular multiplication and modular exponentiation typically describes circuit sizes by a closed-form expression in terms of the number of input qubits. Those circuits typically take on the order of n 2 gates for modular multiplication and n 3 for modular exponentiation. 14 The best cases almost always exhibit the same asymptotic growth. In contrast, our circuits for modular multiplication by 2, 3 and 5 (as well as their inverses) require only a linear number of gates. In the more general case, our optimization is algorithmic in nature, therefore a closed-form expression cannot be given a priori and comparisons require software implementations of our proposed algorithms. To compare the asymptotic number of gates in the proposed mod-mult and mod-exp circuits, we use the trend lines for maximum and average gate counts.
Modular multiplication
In [5] , circuits for n-qubit modular multiplication uses n conditional mod M additions. The addition mod M is constructed by a multiplexed adder and a comparison operator where the former is based on multiplexed full and half adders. Considering one enable bit in [ 
Following [5, Formula 6.4] leads to Formula 14 for the average gate count in mod-mult. Similar to the worst case, 2n + 1 ancillae are used and cleared at the end of computation. 14 QFT-based circuits exhibit slower asymptotic growth, but are viewed impractical for < 1000 qubits or less. 15 Following [5] , [c 0 , c 1 , c 2 , c 3 ] indicates a circuit with c 0 NOT, c 1 CNOT, c 2 C 2 NOT, and c 3 C 3 NOT gates. x %M for M with 9 bits or less. The notation x(y) represents modular multiplication by x controlled on the line y. Each circuit is described by a parenthesized triplet consisting of the T gate count, the C gate count and the number of ancillae. Circuits for M = 15 and M = 21 are illustrated in Figure 3 and Figure 14, Table 9 : T -Costs for modular exponentiation circuits for n-bit M values not divisible by 2 and 3. All M values divisible by 5 were factored with b = 2 or b = 3. Trend lines were extrapolated using b = 2 and n = 9..14. Parameters l and π represent the number of controls and the period, respectively. Max and min l may not correspond to max-cost and min-cost circuits for a given n. Numbers in [ ] are C in LUT.
In [20] , depth-optimized circuits for modular exponentiation were constructed by parallelizing modular multiplications and using depth-optimized adders. With arbitrary-distance interaction between qubits, the authors reduced the asymptotic depth of modular exponentiation to O(n log 2 n). However, their circuits need ∼ 100n ancillae, use a large number of gates, and assume unbounded gate parallelism, which can make them impractical with current technologies. For n = 128, the latency (circuit depth) of the best technique in [20, Algorithm E, Table II ] is 1.96 × 10 4 CNOT, and 1.71 × 10 5 Toffoli gates with 12657 ancillae. For n = 128 our mod-exp circuits need 1.1 × 10 7 CNOT, and 7.0 × 10 6 Toffoli gates with 641 ancillae. If [20, algorithm G, Table II ] with 660 ancillae is used for comparison, the latency is 2.48 × 10 5 C, and 1.50 × 10 7 T gates. Even though our circuits are not optimized for depth, the actual number of gates seems comparable to the depth of depth-optimized circuits in [20] .
Conclusions and future research
In this paper, we proposed linear-size circuits for several special cases in modular multiplication and used them to develop a shortest-path formalism for finding compact generic mod-mult circuits. Our results can be viewed as the first illustration of automated logic synthesis and optimization for modular multiplication circuits with superior results compared to mathematical circuit constructions. Our circuits are also the first not to require Bennett's technique, and this produces significant savings. The above results are directly applicable to modular exponentiation circuits, for which we propose several additional improvements.
Another first in our research is the use of register-transfer level (RTL) primitives to optimize reversible circuits, where previous techniques for reversible logic optimization operate at the bit level [17] . This higher-level perspective facilitates much greater scalability than for previous algorithms. Additionally, the RTL primitives we proposed in Table 2 are good candidates for direct implementations in terms of specific quantum technologies. Such implementations may be faster and less error-prone than the decompositions into elementary gates that we have shown. They can also support a higher level of programming of quantum computers, where sequences of operators demonstrated in Tables 4 and 5 can be issued directly to the quantum computer without intermediate levels of software translation.
Despite concrete evidence of smaller circuits for mod-exp, our research leaves a number of open challenges. In particular, the algorithms for synthesizing mod-exp circuits that we have implemented do not currently scale beyond 15-bit M values. More scalable techniques must be developed, perhaps, by giving up the optimality of results. Departing from register-level structure of our current mod-mult circuits, bit-level local optimization [17] may reduce gate counts. Further reductions may be achievable by leaving the Boolean domain [14] .
