Optimal synthesis of reversible functions is a non-trivial problem. One of the major limiting factors in computing such circuits is the sheer number of reversible functions. Even restricting synthesis to 4-bit reversible functions results in a complexity explosion (16! ≈ 2 44 functions). The output of such a search alone, counting only the space required to list Toffoli gates for every function, would require over 100 terabytes of storage.
INTRODUCTION
To the best of our knowledge, at present, physically reversible technologies are found only in the quantum domain [10] . However, "quantum" unites several technological approaches to information processing, including ion traps, optics, superconducting, spin-based and cavity-based technologies [10] . Of those, trapped ions [5] and liquid state NMR (Nuclear Magnetic Resonance) [9] are two of the most developed quantum technologies targeted for computation in the circuit model (as opposed to communication or adiabatic computing). These technologies allow computations over a set of 8 qubits and 12 qubits, correspondingly.
Reversible circuits are an important class of computations that need to be performed efficiently for the purpose of efficient quantum computation. Multiple quantum algorithms contain arith- metic units such as adders, multiplies, exponentiation, comparators, quantum register shifts and permutations, that are best viewed as reversible circuits. Moreover, reversible circuits are indispensable in quantum error correction [10] . Often, the efficiency of the reversible implementation is the bottleneck of a quantum algorithm (e.g., integer factoring and discrete logarithm [15] ) or even a class of quantum circuits (e.g., stabilizer circuits [1]).
In this paper, we present an algorithm that finds optimal circuit implementations for 4-bit reversible functions. The algorithm has a number of potential uses and implications.
One major implication of this work is that it will help physicists with experimental design, since fore-knowledge of the optimal circuit implementation aids in the control over quantum mechanical systems. The control of quantum mechanical systems is very difficult, and as a result experimentalists are always looking for the best possible implementation. Having an optimal implementation helps to improve experiments or show that more control over a physical system needs to be established before a certain experiment could be performed.
A second important contribution is due to the efficiency of our implementation-0.01 seconds per synthesis of an optimal 4-bit reversible circuit. The algorithm could easily be integrated as part of peephole optimization, such as the one presented in [12] .
Furthermore, our implementation allows us to propose a subset of optimal implementations that may be used to test heuristic synthesis algorithms. Currently, similar tests are performed by comparison to optimal 3-bit implementations. The best heuristic solutions have very tiny overhead, making such a test hard to improve. As such, it would help to replace this test with a more difficult one that allows more room for improvement.
Finally, due to the effectiveness of our approach, we are able to report new optimal implementations for small benchmark functions, approximate L(4), the number of reversible gates required to implement a reversible 4-bit function, approximate the average number of gates required to implement a 4-bit permutation, and show the distribution of the number of permutations that may be implemented with 0..9 gates.
PRELIMINARIES
In this paper, we consider circuits with NOT, CNOT, Toffoli (TOF), and Toffoli-4 (TOF4) gates defined as follows:
where ⊕ denotes an EXOR operation and concatenation is Boolean AND; see Figure 1 for illustration. These gates are used widely Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. in quantum circuit construction, and have been demonstrated experimentally in multiple quantum information processing proposals [10] . In particular, CNOT is a very popular gate among experimentalists, frequently used to demonstrate control over a multiple-qubit quantum mechanical system. Since quantum circuits describe time evolution of a quantum mechanical system where individual "wires" represent physical instances, and time propagates from left to right, this imposes restrictions on the circuit topology. In particular, quantum and reversible circuits are strings of gates. As a result, feed-back (time wrap) is not allowed and there may be no fan-out (mass/energy conservation).
In this paper, we are concerned with searching for circuits requiring a minimal number of gates. Other circuit cost metrics may be considered, as discussed in Section 5.
In related work, there have been a few attempts to synthesize optimal reversible circuits with more than three inputs. Große et al. [3] employ SAT-based technique to synthesize provably optimal circuits for some small parameters. However, their implementation quickly runs out of resources. The longest optimal circuit they report contains 11 gates. The latter took 21,897.3 seconds to synthesize-same function that the implementation we report in this paper synthesized in .000106 seconds, see Table 4. Prasad et al. [12] used breadth first search to synthesize 26,000,000 optimal 4-bit reversible circuits with up to 6 gates in 152 seconds. We extend this search into finding 117,798,040,190 optimal circuits with up to 9 gates in 10,549 seconds. This is over 65 times faster and 4,500 times more than reported in [12] . Yang et al. [16] considered short optimal reversible 4-bit circuits composed with NOT, CNOT, and Peres gates.
ALGORITHM
We outline our algorithm and refer interested reader to the complete description available at http://arxiv.org/abs/1003.1914.
There are N = 2 n ! reversible n-variable functions. The most obvious approach to the synthesis of all optimal implementations is to compute all optimal circuits and store them for later lookup. However, this is extremely inefficient. This is because such an approach requires Ω(N ) space and, as a result, at least Ω(N ) time. These space and time estimates are lower bounds, because, for instance, storing an optimal circuit requires more than a constant number of bits, but for simplicity, let us assume these figures are exact. Despite considering both figures for space and time unpractical, we use this simple idea as our starting point.
We first improve the space requirement by observing that if one synthesized all halves of all optimal circuits, then it is possible to search through this set to find both halves of any optimal circuit. It can be shown that the space requirement for storing halves has a lower bound of Ω( √ N ). However, searching for two halves potentially requires a runtime on the order of the square of the search space, Ω ( √ N ) 2 = Ω(N ), a figure for runtime that we deemed inefficient. Our second improvement is thus to use a hash table to store the optimal halves. This reduces the runtime to soft Ω( √ N ). While this lower bound does not necessarily imply that the actual complexity is lower than O(N ), this turns out to be the case, because the set of optimal halves is indeed much smaller than the set of all optimal circuits (an analytic estimate for the relative size of the former set is hard to obtain, though). Cumulatively, these two improvements reduce Ω(N ) space and Ω(N ) time requirement to O(#halves(N )) space and soft O(#halves(N )) time requirement. These reductions almost suffice to make the search possible using modern computers.
Our last step, apart from careful coding, that made the search possible is the reduction of the space requirement (with consequent improvement for runtime) by a constant of almost 48 via exploiting the following two features. First, simultaneous input/output relabeling, of which there are at most 24 (=4!) different ones, does not change the optimality of a circuit. And second, if an optimal circuit is found for a function f , an optimal circuit for the inverse function, f −1 , can be obtained by reversing the optimal circuit for f . This allows to additionally "pack" up to twice as many functions into one circuit. The cumulative improvement resulting from these two observations, is by a factor 
PERFORMANCE AND RESULTS
We performed several tests using two computer systems, CS1 and CS2. CS1 is a high performance server with 16 AMD Opteron 2300 MHz processors, 64 GB RAM, and Seagate Barracuda ES2 SCSI 7200 RPM HDD running Linux. CS2 is a laptop Sony VGN-NS190D with Intel Core Duo 2000 GHz processor, 4 GB RAM, and a 5400 RPM SATA HDD running Linux. The following subsections summarize the tests and results.
Synthesis of Random Permutations
In this test, we generated 10,000,000 random uniformly distributed permutations using the Mersenne twister random number generator [7] . The test was executed on CS1. It took 104,616.716 seconds (about 29 hours) of user time and the maximal RAM memory usage was 43.04GB. Note that 1111 seconds (approximately 18 minutes) were spent loading previously computed optimal circuits with up to 9 gates (see Subsection 4.2 for details) into RAM. On average, it took only 0.01035 seconds to synthesize an optimal circuit for a permutation. The distribution of the circuit sizes is shown in Table 1 .
Note, that the ratio of the number of random permutations requiring 9 gates to the number of all random permutations,
50,861 10M
≈ .005086, is close to the ratio of the number of all permutations requiring 9 gates to the number of all permutations, 105,984,823,653 16! ≈ .005066. This implies that the weighted average over the random sample, equal to 11.94 gates per circuit, must be close to the actual weighted average. We further use this random sample and the results of the optimal 3-bit circuit synthesis [14] to approximate the number of permutations requiring 10 through 17 gates, see Table 2 .
We conjecture that there are no permutations requiring 17 gates, and unlikely many, if at all, that require 16 gates. This implies that our search may be performed on a machine capable of storing reduced optimal implementations with up to 8 gates, i.e., a machine with 4GB RAM. Further analysis suggests that the search for an optimal circuit will complete in the majority of cases (99.999% assuming uniform distribution) if one uses optimal circuits with at most 7 gates and stores only the hash table. Such a search requires slightly more than 256M of available RAM, and could be executed on an older machine. Table 2 lists the distribution of the number of permutations that can be realized with optimal circuits requiring no more than 9 gates. We estimate the number of functions requiring 10..17 gates using random function size distribution, see Table 1 , and optimal synthesis of all 3-bit reversible functions. We used CS1 to run this test, and it took 10,549 seconds (under 3 hours) to complete using 43.04 GB of RAM. CS2 used 2.74 GB RAM and took 743.401 seconds (under 13 minutes) to synthesize optimal implementations with up to 8 gates. Linear reversible circuits are the most complex part of error correcting circuits [1] . Efficiency of these circuits defines efficiency of quantum encoding and decoding error correction operations. Linear reversible functions are those whose positive polarity ReedMuller polynomial has only linear terms. More simply, linear reversible functions are those computable by circuits with NOT and CNOT gates.
Distribution of Optimal Implementations

Optimal linear circuits
For example, the reversible mapping a, b, c, d
, a is a linear reversible function. Interestingly, this linear function is one of the 138 most complex linear reversible functions-it requires 10 gates in an optimal implementation. The optimal implementation of this function is given by the cir-
We synthesized optimal circuits for all 322,560 4-bit linear reversible functions. This process took under two seconds on CS2. The distribution of the number of functions requiring a given number of gates is shown in Table 3 .
Synthesis of Benchmarks
In this subsection, we present optimal circuits for benchmark functions that have been previously reported in the literature. Table 4 summarizes the results. The table describes the Name of the benchmark function, its complete Specification, Size of the Best Known Circuit (SBKC), the Source of this circuit, indicator of whether this circuit has been Proved Optimal (PO?), Size of an Optimal Circuit (SOC), the optimal implementation that our program found, and the runtime our program takes to find this optimal implementation. We used CS1 for this test, and report the runtime it takes after hash table with all optimal implementations with up to 9 gates is loaded into RAM. Shorter runtimes were identified using multiple runs of the search to achieve sufficient accuracy. Please note that we introduce the function primes4, which cannot be found in previous literature. Also, the 9-gate circuit for function mperk requires some extra SWAP gates to properly map inputs into their respective outputs, indicated by an asterisk.
Searching for a Hard Permutation
We executed a 12-hour search using CS1 to find a permutation requiring more than 14 gates in an optimal implementation. To run the search, we used 14-and 13-gate optimal implementations and tried to extend them by assigning gates to the beginning and the end of those optimal implementations, computing the resulting function, and verifying how many gates they require. After the 12 hour search, we were not able to find a permutation requiring more than 14 gates, indicating further that there are not many such permutations.
CONCLUSIONS AND FUTURE WORK
In this paper, we described an algorithm that finds an optimal circuit for any 4-bit reversible function. Our goal was to minimize the number of gates required for function implementation. Our program implementation takes approximately 3 hours to calculate all optimal implementations requiring up to 9 gates, and then an average of about 0.01 seconds to search for an optimal circuit of any 4-bit reversible function. Both calculations are surprisingly fast given the size of the search space.
We demonstrated the synthesis of 117,798,040,190 optimal circuits in 10,549 seconds, amounting to an average speed of 11,166,749 circuits per second. This is over 65 times faster and some 4,500 times more than the best previously reported result (26 million circuits in 152 seconds) [12] . We synthesized optimal implementations for all linear reversible functions.
We also demonstrated that the search for an optimal circuit can be done very quickly. For example, if all optimal circuits are written to a hypothetical 100+TB 5400 RPM hard drive, the average time to extract a random circuit from the drive would be expected to take on the order of 0.01−0.02 seconds (typical access time for 5400 RPM hard drives). In other words, it would likely take longer to read the answer from a hypothetical hard drive than to compute it with our implementation. Furthermore, the 3-hour calculation of all optimal circuits with up to 9 gates could be reduced by storing its result (computed once for the entirety of the described search and its follow up executions) on the hard drive, as was done in Subsection 4.1. It took 1111 seconds, i.e., under 18 minutes, to load optimal circuits with up to 9 gates into RAM using CS1. Given that the media transfer rate of modern hard drives is 1Gbit/s (=1GB in 8 seconds) and higher, it may take no longer than 5 minutes (= 300s > 296 = 37 * 8s) to load optimal implementations into RAM to initiate the search on a different machine.
Minor modifications to the algorithm could be explored to address other optimization issues. For example, for practicality, one may be interested in minimizing depth. This may be important if a faster circuit is preferred, and/or if quantum noise has a stronger constituent with time, than with the disturbance introduced by multiple gate applications. It may also be important to account for the different implementation costs of the gates used (generally, NOT is much simpler than CNOT, which in turn, is simpler than Toffoli). Both modifications are possible, by making changes only to the first part of the search.
To optimize depth, one needs to consider a different family of gates, where, for instance, sequence NOT(a) CNOT(b, c) is counted as a single gate. To account for different gate costs, one needs to search for small circuits via increasing cost by one (assuming costs are given as natural numbers), as opposed to adding a gate to all maximal size optimal circuits.
It is also possible to extend the search to find optimal im-655 39. 4 
