Abstract-Currently, there is a large research interest and a significant economical effort to build the first practical quantum computer. Such quantum computers promise to exceed the capabilities of conventional computers in fields such as computational chemistry, machine learning and cryptanalysis. Automated methods to map logic designs to quantum networks are crucial to fully realizing this dream, however, existing methods can be expensive both in computational time as well as in the size of the resultant quantum networks. This work introduces an efficient method to map reversible single-target gates into a universal set of quantum gates (Clifford+T ). This mapping method is called best-fit mapping and aims at reducing the cost of the resulting quantum network. It exploits k-LUT mapping and the existence of clean ancilla qubits to decompose a large single-target gate into a set of smaller single-target gates. In addition this work proposes a post-synthesis optimization method to reduce the cost of the final quantum network, based on two cost-minimization properties. Results show a cost reduction for the synthesized EPFL benchmark up to 53% in the number T gates.
I. INTRODUCTION
The recent prospect of practical quantum computers [1] , [2] , [3] is pushing the design automation community to develop suitable tools that are able to address the peculiarities of quantum circuits. In fact, there are many aspects in which quantum computing differs from standard computing. First, quantum computers process qubits instead of bits. They do not only have the classical values 0 and 1, but can represent a superposition of these. The state of a qubit cannot be copied, so it is impossible to have a quantum gate with multiple fanout. All quantum circuits are reversible. During design it is possible to consider all the inputs as Boolean values-even though when embedded as part of a quantum algorithm entangled states in superposition are being applied. A quantum circuit performing a Boolean function is called reversible quantum network. This network is composed by reversible gates. In this paper, we consider the frequently used single-target gates and multiple-controlled Toffoli gates. Those are high-level abstractions of the real operations that can be performed on qubits. During design it is necessary to lower the level of abstraction and map this reversible gates into a universal set of quantum gates, the Clifford+T set [4] , in an efficient way.
In this paper, we present a method that maps single-target gates into Clifford+T networks by using k-LUT mapping. By making use of clean ancilla qubits, i.e., qubits being initialized to a constant value, it is possible to map a large singletarget gate into a sequence of smaller single-target gates. Eventually, the small single-target gates can be mapped into Clifford+T networks by applying ESOP (exlusive sum-ofproducts) decomposition [5] . To further reduce the costs of the Clifford+T networks, we propose a post-synthesis optimization method based on graph matching that optimizes the Toffoli networks resulting from ESOP decomposition.
We evaluate our mapping algorithm within the LUT-based hierarchical reversible logic synthesis (LHRS) algorithm proposed in [6] . LHRS is a reversible synthesis algorithm that maps classical (irreversible) logic networks into reversible networks composed of single-target gates. Embedded into LHRS, experimental results show that we can obtain a reduction of up to 53% in the number of T gates when synthesizing the EPFL arithmetic benchmarks. Experiments also show that the proposed mapping approach can significantly speed up the execution time of LHRS.
II. PRELIMINARIES

A. Reversible Network
A reversible network is a set of reversible gates realizing a reversible function. A multi-output Boolean function is reversible if each input pattern uniquely maps to an output pattern. To accomplish this specification, a reversible function has the same number of inputs and outputs. In addition, a reversible network that corresponds to a quantum network is required to be garbage free: intermediate values must not appear at any output terminal of the network. This is because the quantum networks typically need to be run on a superposition of different inputs and measuring and resetting garbage bits can betray the path that the quantum computer took, which can collapse the quantum state that encodes the data. To describe reversible gates, consider the following notation. A reversible gate performs an n-variable reversible function which is applied to qubits (lines) X = {1, . . . , n}. We further consider literals based on the numbers in X, i.e., given x ∈ X, we can have l as the positive literal andl as the negative of x. Note thatl = l, and we define |l| = |l| = x. Finally, let l ⊕ 0 = l and l ⊕ 1 =l. Also, for a given set of literals L, we use |L| = {|l| | l ∈ L} to refer to all variables of L. 
978-1-5090-0602-1/18/$31.00 ©2018 IEEE 7D-1
(a) And-inverter graph.
(c) 4-LUT mapping. In other words, it inverts the target, if and only if the control function evaluates to true.
Definition 2 (Multiple-controlled Toffoli gate):
If c can be expressed as a single product term c
, where p i are the polarities of the controls, then we call the gate a multiple-controlled Toffoli gate. Since we consider these gates as special cases, we introduce a special notation T(C , t) where
An example of this special notation is in Fig. 2 . For Toffoli gates, we will use C and c interchangeably.
B. LUT mapping
In algorithms such as LHRS, the control function of a single-target gate is represented symbolically, e.g., using an and-inverter graph (AIG). An AIG is a logic network composed of AND gates and inverters (see Fig. 1(a) for an AIG representing the function prime 6 (x 1 , . . . , x 6 ) = [(x 6 . . . x 1 ) 2 is prime]). Each non-terminal node in an AIG represents an AND gate with two operands and an edge between two nodes is complemented when is drawn dashed.
The k-LUT mapping describes the problem of mapping an AIG into k-LUTs, which are gates with k inputs that can represent any k-input Boolean function. Several algorithms have been presented to obtain a k-LUT mapping (see, e.g., [7] , [8] , [9] ). Figs. 1(b) and 1(c) show a 3-LUT and 4-LUT mapping of the AIG. Note that vertices with the same color belong to the same LUT. There are cases in which nodes of the initial AIG were copied such that they can belong to two different LUTs. The 3-LUT and 4-LUT mappings contain 12 and 4 LUTs, respectively.
III. MAPPING OF SINGLE-TARGET GATES
We target the synthesis of Clifford+T circuits, a gate library consisting of the 2-qubit CNOT (controlled NOT) gate, and the single-qubit Hadamard (H) and T -gate. The T gate is by Notations for the multiple-controlled Toffoli: the left gate is T 1∧2 ({1, 2}, 3) in the complete notation and T ({1, 2}, 3) in the special notation; for the second gate T 1∧2 ({1, 2}, 3) and T ({1, 2}, 3).
Fig . 3 . Example of mapping a single-target gate into Toffoli gates using ESOP decomposition.
far the most expensive in fault tolerant quantum computing, so that it is often the only one defining the cost of a quantum algorithm [10] . The number of T -gates in the quantum network is called T -count and we are interested in minimizing this quantity. In this section we propose mapping techniques that solve the following problem in reversible logic and quantum circuit synthesis.
Problem 1:
Given a single-target gate T c (C, t), a set of clean ancilla lines X clean , find a Clifford+T network that realizes the function c on line t and restores the initial values on all other lines.
Ancillae are helper lines that can be used to map reversible gates into quantum gates more efficiently. In large reversible circuits many ancilla lines are available because each reversible gate locally interacts only with a portion of all the qubits. If the value of the ancilla is known to be zero when used in the realization of a gate, then the ancilla is called clean. We first introduce existing mapping methods. The first two are direct methods and do not make use of any ancilla line: one based on ESOP decomposition, and one based on precomputed optimal networks. We propose then a novel method which exploits ancillae by means of k-LUT networks, selecting the most suitable LUT size. This novel method is called best-fit mapping of single-target gates.
A. Direct Mapping Methods
ESOP-decomposition based mapping. One can always decompose a single-target gate Fig. 4 . Example of mapping a single-target gate into Toffoli gates using LUT mapping. a function composition. This decomposition of c is also referred to as ESOP decomposition [11] , [12] , [13] . ESOP minimization techniques can be applied to reduce the size of the ESOP expression. An example is given in Fig. 3 . Eventually, each of the multiple-controlled Toffoli gates is mapped into a Clifford+T with the mapping described in [10] .
Near-optimum mapping. For small functions it is practical to store optimal Clifford+T realizations, found by applying optimization algorithms such as [14] , [15] , [16] , [17] , and store them into a database. In order to reduce the number of optimal results to be stored, affine classification of Boolean function is exploited [18] , [19] , [20] . There are 2 2 n Boolean functions on n variables and they can be partitioned into equivalence classes by allowing transformations on the inputs that only require CNOT gates and therefore do not account to the number of T gates in the quantum circuit. Two functions that are equivalent under the condition of affine equivalence under negation are called AN-equivalent. The exploitation of this classification method enables to scale the use of the database up to 5 inputLUTs. Indeed for n = 1, 2, 3, 4 and 5 there are only 2, 3, 6, 18 and 206 classes of AN-equivalent functions, respectively [21] .
B. LUT-based Mapping Methods
We describe techniques that exploit LUT mapping to translate a single-target gate into a network of multiple-controlled Toffoli gates. A LUT-based mapping method performs k-LUT mapping, which consists of dividing the network into subnetworks, each one having maximum k inputs. If the mapped LUT network is composed of LUTs, then it is possible to map the single-target gate into a network of singletarget gates with at most k inputs, by using − 1 clean ancilla lines. Fig. 4 shows how the LUT mapping in Fig. 1(c) , where k = 4, can be mapped into a single-target gates reversible network. Note that for the LUT mapping in Fig. 1(b) , where k = 3, 11 clean ancillae are needed.
Hybrid LUT-based mapping. The hybrid method is a previously proposed LUT-based mapping that aims at exploiting the near-optimal precomputed networks [6] . Given the input AIG it performs 4-LUT mapping to match the 4-input functions affine equivalent classes. However, it can happen that there are not enough clean ancillae to store all the intermediate values of the mapping. In that case the hybrid method merges two LUTs into a larger one, thereby requiring one fewer clean ancilla. This procedure is repeated until the number of available clean ancillae suffices. This procedure leads to one very large LUT. In fact, this LUT can have more inputs than primary inputs.
Best-fit mapping. The best-fit mapping method addresses the limitations of both the direct and the hybrid methods. It exploits extra ancillae, it applies near-optimal pre-computed networks, and reduces the size of the functions that are directly synthesized. The idea is to find a suitable value for k. This value is chosen to be the smallest for which the LUT size fits the available number of clean ancillae. Starting from k = 4, k is incremented until the above mentioned condition is satisfied. Once the mapping is obtained, each k-LUT needs to be mapped in Clifford+T gates. If the number of a LUT's inputs is small enough, the LUT can be replaced by the precomputed optimal Clifford+T network, otherwise ESOPbased decomposition is applied.
IV. POST-SYNTHESIS OPTIMIZATION
In addition we propose a technique to reduce the T -count of reversible networks composed of multiple-controlled Toffoli gates. It is particularly useful for our best-fit technique. In fact, the optimization is effective for reversible circuits obtained with ESOP-decomposition; since these circuits consist of a set of multiple-controlled Toffoli gates all acting on the same target lines. Networks with these characteristics can be optimized exploiting gate pair combinations described in the next section. In addition it is important to note that the position of a gate is irrelevant in these networks, in fact all the gates have the same target and for this reason they can be moved in any order.
A. Exploited Properties
The rules here described are possible ways to combine two gates with specific characteristics of their control lines. All the rules apply on two gates which share the same target line.
If two gates are identical, they can both be removed from the network. This property is called deletion rule in [22] .
If the following conditions are verified:
where #B is the cardinality of the set of controls B. Then
Rule 2 is a generalization of some rules presented in [23] . An example is shown in Fig. 5 , explaining the rules using identities from [22] . Given two Toffoli gates, it applies if one gate has a single control on a qubit that is not a control of the other one. In the example this is x 5 . It is possible to substitute the second gate with two identical gates applied before and after the remaining one. They are controlled by the ones not in common with the first gate. This rule leads to cost reduction only if the two initial gates have some identical controls in common: x 1 and x 2 in the example.
Rule 3 ([24]):
Let g 1 = T(C 1 , t) and g 2 = T(C 2 , t) with #C 1 = #C 2 , i.e., g 1 and g 2 share the same set of controls. Let D = C 1 ∩ C 2 = {c 1 , . . . , c k } be the set of controls that occur in different polarities in g 1 and g 2 , and let #D > 0. Then
An example is shown in Fig. 6 . This rule applies when the first and the second gates have controls on the same lines, with different polarities. It uses two identical CNOT gates before and after the initial gates to complement the polarity of one control (see rule D2 in [22] ). This is done until only one control with different polarity remains. Then the pair is equivalent to a single gate with this control removed and all the identical controls kept (see rule D3 in [22] ).
B. Graph Matching Problem
Direct single-target gate mapping using ESOP decomposition leads to reversible networks with multiple-controlled Toffoli gates that all have the same target line. There are many pair of gates that could be combined and this paragraph describes the algorithm used to select which pairs to combine. The exploited method derives an optimization graph from the circuit and performs graph matching in a similar fashion to how it has been done in [25] , [26] .
Definition 3 (Optimization graph):
Given a set of generalized Toffoli gates g 1 = T(C 1 , t), g 2 = T (C 2 , t) , . . . , g m = T(C m , t), we define the undirected graph G = (V, E) with edge weights q : E → N 0 as follows:
where e = {v, w}
In other words, vertices in G are all gates and two gates are connected by an edge if their cost when combined together is smaller than their accumulated individual cost. The weights on an edge e = (v, w) describe the cost savings that can be achieved when composing the gates v and w together. We refer to this graph as optimization graph. We use graph matching to find the set of graph edges corresponding to the set of combined pairs that leads to the largest gain in terms of cost.
The following theorem follows trivially.
Theorem 1: Let G = (V, E) be an optimization graph as defined above. Let M = {e 1 , . . . , e j } be a graph matching for G. Then it is possible to realize all generalized Toffoli gates in a circuit with
A greedy algorithm is used to compute a graph matching with a maximal weight. Given the matching, that corresponds to a set of edges in the optimization graph. Each edge refers to a pair of Toffoli gates in the initial circuit to be combined together exploiting Rule 2 or 3. The final circuit is computed considering the reduced cost of combined gates.
V. RESULTS
We implemented the algorithm in C++ on top of RevKit [27] . 1 Experiments were run on an Intel Core i3 with 3.06 GHz and 4 GB RAM running Mac OS X Yosemite.
A. Best-Fit Mapping Method
The efficiency of the best-fit mapping method is compared to the hybrid mapping, a method already implemented in LHRS [6] . The EPFL arithmetic benchmark is used for the evaluation and the results are shown in Table I. For each  benchmark the table shows the results for three different  methods: HY -all single-target gates with hybrid mapping; BF -all single-target gates with best-fit mapping, PB -pickbest method. Pick-best selects, for every single-target gate, the one between the hybrid and the best-fit technique which results in the lower T -count. All the final networks are post-optimized with the method proposed in Section IV.
In order to perform the synthesis of a complex circuit, the LHRS framework performs an initial LUT-mapping, then associates each LUT with a single-target gate to be synthesized. This part of the procedure is independent from the mapping method evaluation, but the selection of the k 1 parameter for this initial LUT mapping has an impact on the complexity of the single-target gates and on the number of qubits. Different initial k 1 selections are shown in Table I : 16-LUT, 22-LUT and 28-LUT. It is important to not confuse this k 1 parameter, that is used to map the initial benchmark AIG into single-target gates, with the k used for mapping of each single-target gate into Clifford+T networks.
The results show that the best-fit method is always superior to the hybrid method in terms of the number of T gates in the final network. While the second method applies best-fit to all the single-target gates that compose the circuit; the pick-best method selects for each single-target gate the one between hybrid and best-fit that results in the smaller cost. Most of the time using pick-best helps in getting a smaller T -count, even if the gain is always small. This proves that for most of the possible, isolated, single-target gates the mapping method best-fit is superior to the hybrid method.
All the simulations have been performed with a timeout of 30 minutes. The LUT-based mapping techniques are particularly slow whenever direct mapping has to be performed. This happens whenever the number of inputs of the LUT exceeds the AN-classification capabilities. Even if there is not any runtime difference between the three methods with a fixed number of available qubits; the situation changes when some extra qubits are available for the synthesis. In the LHRS framework, the number of available ancillae for the mapping of each single-target gate is fixed by means of the first LUTmapping. Nevertheless, because the mapping algorithm can also be used outside the proposed framework, it includes the possibility to specify a different number of ancillae. In this experiment this is used to show the best-fit algorithm superior capabilities in exploiting additional ancillae. Indeed the bestfit mapping, varying its k parameter, is able to take advantage from extra qubits, reducing the number and the size of the LUTs that are mapped with the direct method. Note that some of the results in hyp, log2, multiplier, and square are marked with the symbols and . This means that it is possible to get these results by adding 20 qubits or 40 qubits, respectively, whenever running into a timeout. Wherever in the table there is a timeout indication, that means that not even with 40 additional qubits a mapping is possible in 30 minutes. This is often the case for the hybrid method.
B. Post-Synthesis Optimization
The post-synthesis optimization technique proposed in Section IV is evaluated on the EPFL arithmetic benchmark. It is synthesized with the LHRS framework with direct mapping method. Results are shown in Table II and Table III. The  first table shows a comparison between greedy and an exact matching algorithms, for example the blossom algorithm [28] , with k 1 = 6 . A small k 1 is chosen in order to be able to compute the exact matching in a reasonable time. It proves that the greedy algorithm, whose complexity is Θ(E) where E is the number of edges in the graph, is capable of obtaining satisfying maximal matching. The exact approach only leads to very small improvements, always less than 1% in the case of k 1 = 6. We are then confident that for the purpose of the reversible circuit optimization, the greedy approach is accurate enough and that the results of the exact approach are not worth the complexity it has. The second table shows the case in which the initial LUT mapping has been performed with k 1 = 16. With a larger k 1 , each single-target gate has a more complex control function in terms of number of gates and inputs and more room for improvements. The optimization leads to a 53% reduction in terms of T -count.
VI. CONCLUSION
We propose an efficient method to map single-target gates into Clifford+T logic networks. This method has been integrated in the open-source LHRS framework for Hierarchical Reversible Synthesis and its performances have been compared with other mapping methods proving its superiority. We also provide a post-synthesis optimization method to reduce the cost of reversible quantum networks, reaching up to 53% reduction in T -count. Both the proposed method have been evaluated on the EPFL arithmetic benchmark. = +20 qubits, = +40 qubits 
