Abstract-Today's rapid advances in the physical implementation of quantum computers call for scalable synthesis methods to map practical logic designs to quantum architectures. We present a synthesis framework to map logic networks into quantum circuits for quantum computing. The synthesis framework is based on LUT networks (lookup-table networks), which play a key role in state-of-the-art conventional logic synthesis. Establishing a connection between LUTs in a LUT network and reversible single-target gates in a reversible network allows us to bridge conventional logic synthesis with logic synthesis for quantum computing-despite several fundamental differences. As a result, our proposed synthesis framework directly benefits from the scientific achievements that were made in logic synthesis during the past decades.
Logic Synthesis for Quantum Computing Mathias Soeken, Martin Roetteler, Nathan Wiebe, and Giovanni De Micheli
Abstract-Today's rapid advances in the physical implementation of quantum computers call for scalable synthesis methods to map practical logic designs to quantum architectures. We present a synthesis framework to map logic networks into quantum circuits for quantum computing. The synthesis framework is based on LUT networks (lookup-table networks), which play a key role in state-of-the-art conventional logic synthesis. Establishing a connection between LUTs in a LUT network and reversible single-target gates in a reversible network allows us to bridge conventional logic synthesis with logic synthesis for quantum computing-despite several fundamental differences. As a result, our proposed synthesis framework directly benefits from the scientific achievements that were made in logic synthesis during the past decades.
We call our synthesis framework LUT-based Hierarchical Reversible Logic Synthesis (LHRS). Input to LHRS is a classical logic network, e.g., represented as Verilog description; output is a quantum network (realized in terms of Clifford+T gates, the most frequently used gate library in quantum computing). The framework offers to trade-off the number of qubits for the number of quantum gates. In a first step, an initial network is derived that only consists of single-target gates and already completely determines the number of qubits in the final quantum network. Different methods are then used to map each singletarget gate into Clifford+T gates, while aiming at optimally using available resources.
We demonstrate the effectiveness of our method in automatically synthesizing IEEE compliant floating point networks up to double precision. As many quantum algorithms target scientific simulation applications, they can make rich use of floating point arithmetic components. But due to the lack of quantum circuit descriptions for those components, it can be difficult to find a realistic cost estimation for the algorithms. Our synthesized benchmarks provide cost estimates that allow quantum algorithm designers to provide the first complete cost estimates for a host of quantum algorithms. Thus, the benchmarks and, more generally, the LHRS framework are an essential step towards the goal of understanding which quantum algorithms will be practical in the first generations of quantum computers.
I. INTRODUCTION
R ECENT progress in fabrication makes the practical application of quantum computers a tangible prospect [2] , [3] , [4] , [5] . However, as quantum computers scale up to tackle problems in computational chemistry, machine learning, and cryptoanalysis, design automation will be necessary to fully leverage the power of this emerging computational model.
Quantum circuits differ significantly in comparison to classical circuits. This needs to be addressed by design automation tools: 1) Quantum computers process qubits instead of classical bits. A qubit can be in superposition and several qubits can be entangled. We target purely Boolean functions as input to our synthesis algorithms. At design state, it is sufficient to assume that all input values are Boolean, even though entangled qubits in superposition are eventually acted upon by the quantum hardware. 2) All operations on qubits besides measurement, called quantum gates, must be reversible. Gates with multiple fanout known from classical circuits are therefore not possible. Temporarily computed values must be stored on additional helper qubits, called ancillae. An intensive use of intermediate results therefore increases the qubit requirements of the resulting quantum circuit. Since qubits are a limited resource, the aim is to find circuits with a possibly small number of ancillae. Quantum circuits that compute a purely Boolean function are often referred to as reversible networks.
3) The quantum gates that can be implemented by current quantum computers can act on a single or at most two qubits [3] . Something as simple as an AND operation can therefore not be expressed by a single quantum gate.
A universal fault-tolerant quantum gate library is the Clifford+T gate set [3] . In this gate set, the T gate is sufficiently expensive in most approaches to fault tolerant quantum computing such that it is customary to neglect all other gates when costing a quantum circuit [6] . Mapping reversible functions into networks that minimize T gates is therefore a central challenge in quantum computing [7] . 4) When executing a quantum circuit on a quantum computer, all qubits must eventually hold either a primary input value, a primary output value, or a constant. A circuit should not expose intermediate results to output lines as this can potentially destroy wanted interference effects, in particular if the circuit is used as a subroutine in a larger quantum computation. Qubits that nevertheless expose intermediate results are sometimes referred to as garbage outputs.
It has recently been shown [8] , [9] , [10] that hierarchical reversible logic synthesis methods based on logic network representations are able to synthesize large arithmetic designs. The underlying idea is to map subnetworks into reversible networks. Hierarchical refers to the fact that intermediate results computed by the subnetworks must be stored on additional ancilla qubits. If the subnetworks are small enough, one can locally apply less efficient reversible synthesis methods that do not require ancilla qubits and are based on Boolean satisfiability [11] , truth tables [12] , or decision diagrams [13] . However, state-of-the-art hierarchical synthesis methods mainly suffer from two disadvantages. First, they do not explicitly uncompute the temporary values from the subnetworks and leave garbage outputs. In order to use the network in a quantum computer, one can apply a technique called "Bennett trick" [14] , which requires to double the number of gates and add one further ancilla for each primary output. Second, current algorithms do not offer satisfying solutions to trade the number of qubits for the number of T gates. In contrast, many algorithms optimize towards the direction of one extreme [10] , i.e., the number of qubits is very small for the cost of a very high number of T gates or vice versa.
This paper presents a hierarchical synthesis framework based on k-feasible Boolean logic networks, which find use in conventional logic synthesis. These are logic networks in which every gate has at most k inputs. They are often referred to as k-LUT (lookup table) networks. We show that there is a one-to-one correspondence between a k-input LUT in a logic network and a reversible single-target gate with k control lines in a reversible network. A single-target gate has a k-input control function and a single target line that is inverted if and only if the control function evaluates to 1. The initial reversible network with single-target gates can be derived quickly and provides a skeleton for subsequent synthesis that already fixes the number of qubits in the final quantum network. As a second step, each single-target gate is mapped into a Clifford+T network. We propose different methods for the mapping. A direct method makes use of the exclusivesum-of-product (ESOP) representation of the control function that can be directly translated into multiple-controlled Toffoli gates [15] . Multiple-controlled Toffoli gates are a specialization of single-target gates for which automated translations into Clifford+T networks exist. Another method tries to remap a single-target gate into a LUT network with fewer number of inputs in the LUTs, by making use of temporarily unused qubits in the overall quantum network. We show that nearoptimal Clifford+T networks can be precomputed and stored in a database if such LUT networks require sufficiently few gates.
The presented LHRS algorithm is evaluated both on academic and industrial benchmarks. On the academic EPFL arithmetic benchmarks, we show how the various parameters effect the number of qubits and the number of T gates in the final quantum network as well as the algorithm's runtime. We also used the algorithm to find quantum networks for several industrial floating point arithmetic networks up to double precision. From these networks we can derive cost estimates for their use in quantum algorithms. This has been a missing information in many proposed algorithms, and arithmetic computation has often not been explicitly taken into account. Our cost estimates show that this is misleading as for some algorithms the arithmetic computation accounts for the dominant cost.
Quantum programming frameworks such as LIQU i| [16] or ProjectQ [17] can link in the Clifford+T circuits that are automatically generated by LHRS.
The paper is structured as follows. The next section introduces definitions and notations. Section III provides the problem definition and gives a coarse outline of the algorithm, separating it into two steps: synthesizing the mapping, described in Sect. IV and mapping single-target gates, described in Sect. V. Section VI discusses the results of the experimental evaluation and Sect. VII concludes.
II. PRELIMINARIES
A. Some Notation
, there can be at most one arc between two vertices for each direction. An acyclic digraph is called a dag. We refer to d
∈ A} as in-degree and out-degree of v, respectively.
B. Boolean Logic Networks
A Boolean logic network is a simple dag whose vertices are primary inputs, primary outputs, and gates and whose arcs connect gates to inputs, outputs, and other gates. Formally, a Boolean logic network N = (V, A, F ) consists of a simple dag (V, A) and a function mapping F . It has vertices V = X ∪ Y ∪ G for primary inputs X, primary outputs Y , and gates G. We have d − (x) = 0 for all x ∈ X and d + (y) = 0 for all y ∈ Y . Arcs A ⊆ (X ∪ G × G ∪ Y ) connect primary inputs and gates to other gates and primary outputs. Each gate g ∈ G realizes a Boolean function F (g) : B d − (g) → B, i.e., the number of inputs in F (g) coincides with the number of ingoing arcs of g.
Example 1: Fig. 1 shows a logic network of the benchmark cm85a obtained using ABC [18] . It has 11 inputs, 3 outputs, and 13 gates. The gate functions are not shown but it can easily be checked that each gate has at most 4 inputs.
The fanin of a gate or output v ∈ G ∪ Y , denoted fanin(v), is the set of source vertices of ingoing arcs:
For a gate g ∈ G, this set is ordered according to the position of variables in F (g). For a primary output y ∈ Y , we have d − (y) = 1, i.e., fanin(y) = {v} for some v ∈ X ∪ G. The vertex v is called driver of y and we introduce the notation driver(y) = v. The transitive fan-in of a vertex v ∈ V , denoted tfi(v), is the set containing v itself, all primary inputs that can be reached from v, and all gates which are on any path from v to the primary inputs. The transitive fan-in can be constructed using the following recursive definition: Example 2: The transitive fan-in of output y 3 in the logic network in Fig. 1 contains {y 3 , 1, 2, 4, 5, 9, 13, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 10 , x 11 }. The driver of y 3 is gate 13.
We call a network k-
Sometimes k-feasible networks are referred to as k-LUT networks (LUT is a shorthand for lookup-table) and LUT mapping (see, e.g., [19] , [20] , [21] , [22] , [23] ) refers to a family of algorithms that obtain k-feasible networks, e.g., from homogeneous logic representations such as And-inverter graphs (AIGs, [24] ) or Majority-inverter graphs (MIGs, [25] ).
Example 3: The logic network in Fig. 1 is 4-feasible.
C. Reversible Logic Networks
A reversible logic network realizes a reversible function, which makes it very different from conventional logic networks. The number of lines, which correspond to logical qubits, remains the same for the whole network, such that reversible networks are cascades of reversible gates and each gate is applied to the current qubit assignment. The most general reversible gate we consider in this paper is a singletarget gate. A single-target gate T c ({x 1 , . . . , x k }, x k+1 ) has an ordered set of control lines {x 1 , . . . , x k }, a target line x k+1 , and a control function c : B k → B. It realizes the reversible function f :
with f :
It is known that all reversible functions can be realized by cascades of singletarget gates [26] . We use the '•' operator for concatenation of gates.
Example 4: Fig. 2(a) shows a reversible circuit that realizes a full adder using two single-target gates, one for each output. Two additional lines, called ancilla and initialized with 0, are added to the network to store the result of the outputs. All inputs are kept as output.
A multiple-controlled Toffoli gate is a single-target gate in which the control function is 1 (tautology) or can be expressed in terms of a single product term. One can always decompose a single-target gate T c ({x 1 , . . . , x k }, x k+1 ) into a cascade of Toffoli gates
where c = c 1 ⊕c 2 ⊕· · ·⊕c l , each c i is a product term or 1, and X i ⊆ {x 1 , . . . , x k } is the support of c i . This decomposition of c is also referred to as ESOP decomposition [27] , [28] , [29] . ESOP minimization algorithms try to reduce l, i.e., the number of product terms in the ESOP expression. Smaller 
D. Mapping to Quantum Networks
Quantum networks are described in terms of a small library of gates that interact with one or two qubits. One of the most frequently considered libraries is the so-called Clifford+T gate library that consists of the reversible CNOT gate, the Hadamard gate, abbreviated H, as well as the T gate, and its inverse T † . Quantum gates on n qubits are represented as 2 n × 2 n unitary matrices. We write T † to mean the complex conjugate of T , and use the symbol ' †' also for other quantum gates. The T gate is sufficiently expensive in most approaches to fault tolerant quantum computing [6] that it is customary to neglect all other gates when costing a quantum algorithm. For more details on quantum gates we refer the reader to [30] . Fig. 3(a) shows one of the many realizations of the 2-control Toffoli gate, which can be found in [7] . It requires 7 T gates which is optimum [6] , [31] . Several works from the literature describe how to map larger multiple-controlled Toffoli gates into Clifford+T gates (see, e.g., [6] , [32] , [33] , [7] ). Fig. 3(b) shows one way to map the 3-control Toffoli gate using a direct method as proposed by Barenco et al. [34] Given a free ancilla line (that does not need to be initialized to 0), it allows to map any multiple-controlled Toffoli gate into a sequence of 2-control Toffoli gates which can then each be mapped into the optimum network with T -count 7. However, the number of T gates can be reduced by modifying the Toffoli gates slightly. It can easily be seen that the network in Fig. 3(c) is the same as in Fig. 3(b) , since the controlled S † gate cancels the controlled S gate and the V † gate cancels the V gate. However, the Toffoli gate combined with a controlled S gate can be realized using only 4 T gates [32] , and applying the V to the Clifford+T realization cancels another 3 T gates (see Input : Logic network N , parameters pQ, parameters pT Output : Clifford+T network R 1 set N ← lut_mapping(N, pQ); 2 set R ← synthesize_mapping(N, pQ); 3 set R ← synthesize_gates(R, pT); 4 return R; Algorithm 1: Overview of the LHRS algorithm. Fig. 3(a) and [35] , [7] ). In general, a k-controlled Toffoli gate can be realized with at most 16(k−1) T gates. If the number of ancilla lines is larger or equal to
, then 8(k − 1) T gates suffice [34] , [7] . Future improvements to the decomposition of multiple-controlled Toffoli gates into Clifford+T networks will have an immediate positive effect on our proposed synthesis method.
III. MOTIVATION AND PROBLEM DEFINITION
A major problem facing quantum computing is the inability of existing hand crafted approaches to generate networks for scientific operations that require a reasonable number of quantum bits and gates. As an example, the quantum linear systems algorithm requires only 100 (logical) quantum bits to encode a 2 100 × 2 100 matrix inversion problem [36] , [37] , clearly demonstrating the advantage that can be gained by using a quantum computer. However, in prior approaches, the reciprocal step (1/x) that is part of the calculation can require in excess of 500 quantum bits. This means that arithmetic may dominate the number of qubits of that algorithm [38] , diminishing the potential improvements of a quantum algorithmic implementation. Similarly, recent quantum chemistry simulation algorithms can provide improved scaling over the best known methods but at the price of requiring the molecular integrals that define the problem to be computed using floating point arithmetic [39] . While floating point addition was studied before [40] , [41] , currently networks do not exist for more complex floating point operations such as exponential, reciprocal square root, multiplication, and squaring. Without the ability to automatically generate circuits for these operations it will be a difficult task to implement such algorithms on a quantum computer, to estimate their full costs, or to verify that the underlying circuitry is correct.
In this paper we tackle this challenge and address the following problem: Given a conventional combinational logic network that represents a desired target functionality, find a quantum circuit with a reasonable number of qubits and number of T gates. The algorithm should be highly configurable such that instead of a single quantum circuit a whole design space of circuits with several Pareto-optimal solutions can be explored.
Algorithm outline: Alg. 1 illustrates the general outline of the algorithm. The following subsections provide further details. Input to the algorithm is a logic network N and it outputs a Clifford+T quantum network R. In addition to N , two sets of parameters p Q and p T are provided that control detailed behavior of the algorithm. The parameters will be introduced in the following sections and are summarized in Sect. VI-A; but for now it is sufficient to emphasize the role of the parameters. Parameters in p Q can influence both the number of qubits and T gates in R, however, their main purpose is to control the number of qubits. Parameters in p T only affect the number of T gates. The first step in Alg. 1 is to derive a LUT mapping from the input logic network. As we will see in Sect. IV, one parameter in p Q is the LUT size for the mapping which has the strongest influence on the number of qubits in R. Given the LUT mapping, one can derive a reversible logic network in which each LUT is translated into one or two single-target gates. In the last step, each of the gates is mapped into Clifford+T gates (Sect. V).
It is important to know that most of the runtime is consumed by the last step in Alg. 1, and that after the first two steps the number of qubits for the final Clifford+T network is already known. This allows us to use the algorithm in an incremental way as follows. First, one explores assignments to parameters in p Q that lead to a desired number of qubits, particularly by evaluating different LUT sizes. This can be done by calling the first two steps of the algorihm with different values for the parameters in p Q . Afterwards, one can optimize for the number of T gates by calling the last step by sampling the parameters for p T .
IV. SYNTHESIZING THE MAPPING
This section describes how a LUT mapping can be translated into a reversible network. This is the second step of Alg. 1. The first step in Alg. 1 applies conventional LUT mapping algorithms and is not further explained in this paper. The interested reader is referred to the literature [19] , [20] , [21] , [22] , [23] .
A. Mapping k-LUTs into Single-target Gates Fig. 4 illustrates the general idea how k-LUT networks are mapped into reversible logic networks composed of singletarget gates with control functions with up to k variables. Fig. 4 (a) shows a 2-LUT network with 5 inputs x 1 , . . . , x 5 and 5 LUTs with names 1 to 5. It has two outputs, y 1 and y 2 , which functions are computed by LUT 3 and LUT 5, respectively.
A straightforward way to translate the LUT network is by using one single-target gate for each LUT in topological order. The target of each single-target gate is a 0-initialized new ancilla line. The reversible circuit in Fig. 4(b) results when applying such a procedure. With these five gates, the outputs y 1 and y 2 are realized at line 8 and 10 of the reversible circuit. But after these first five gates, the reversible circuit has garbage outputs on lines 6, 7, and 9, indicated by '-', which compute the functions of the inner LUTs of the network. The circuit must be free of garbage outputs in order to be implemented on a quantum computer. This is because the result of the calculation is entangled with the intermediate results and so they cannot be discarded and recycled without damaging the results they are entangled with [30] . To avoid the garbage outputs, we can uncompute the intermediate results by re-applying the single-target gates for the LUTs in reverse topological order. This disentangles the qubits, reverting them all to constant 0s. Fig. 4(c) shows the complete reversible circuit; the last 3 gates uncompute intermediate results at lines 6, 7, and 9.
But we can do better! Once we have computed the LUT for a primary output that does not fan in to another LUT, we can uncompute LUTs that are not used any longer by other outputs. The uncomputed lines restore a 0 that can be used to store the intermediate results of other LUTs instead of creating a new ancilla. For the running example, as shown in Fig. 4(d) , we can first compute output y 2 and then uncompute LUTs 4 and 2, as they are not in the logic cone of output y 1 . The freed ancilla can be used for the single-target gate realizing LUT 3. Compared to the reversible network in Fig. 4(c) , this network requires one qubit less by having the exact same gates.
B. Bounds on the Number of Ancillae
As we have seen in the previous section, the order in which LUTs are traversed in the LUT network and translated into single-target gates affects the number of qubits. Two questions arise: (i) how many ancillae do we need at least and at most, and (ii) what is a good strategy? We will answer the first question, and then discuss the second one.
The example order that was used in the previous example leading to the network in Fig. 4(c) illustrates an upper bound. We can always use one ancilla for each LUT in the LUT network, postulated in the following lemma.
Lemma 1: When realizing a LUT network N = (X ∪ Y ∪ G, A, F ) by a reversible circuit that uses single-target gates for each LUT, one needs at most |G| ancilla lines.
The optimized order in Fig. 4(d) used the fact, that one can uncompute gates in the transitive fan-in cone of an output, once the output has been computed.
This observation leads to a lemma providing a lower bound.
9 for x ∈ X do 10 add input line with name x to R; if g ∈ S then return;
Algorithm 2: Synthesizing a LUT mapping into a reversible network with single-target gates.
be the maximum cone size over all outputs. When realizing the LUT network by a reversible circuit that uses single-target gates for each LUT, we need at least l ancilla lines. The lower bound inspires the following synthesis strategy that minimizes the number of additional lines. One starts by synthesizing a circuit for the output with the maximum cone. Let's assume that this cone contains l LUTs. These LUTs can be synthesized using l single-target gates. Note that all of these are in fact needed, because in order to uncompute a gate, the intermediate values of children need to be available. From these l gates, l − 1 gates can be uncomputed (all except the LUT computing the output), and therefore restores l − 1 lines which hold a constant 0 value. We can easily see that the exact number of required lines may be a bit larger, since all output values need to be kept. Note that this strategy uncomputes all LUTs in the transitive fan-in cone of an output-even if it is part of a fan-in cone of another output. Therefore, some LUTs will lead to more than two single-target gates in the reversible network.
For a good tradeoff between the number of qubits and Tcount one is interested in the minimum number of qubits such that each LUT is translated into at most two single-target gates in the reversible network. Finding the minimum number of ancillae under such constraints relates to playing the reversible pebble game [42] in minimum time using minimum number of pebbles. More details can be found in [43] , [44] , [45] , [46] .
C. Synthesizing a LUT Network
Alg. 2 describes in detail how a k-LUT network N = (X ∪ Y ∪ G, A, F ) is mapped into a reversible network R that consists of single-target gates with at most k controls. The main entry point is the function 'synthesize_mapping' (line 1). This function keeps track of the current number of lines l, available ancillae in a stack C, a LUT-to-line mapping m : G → N that stores which LUT gates are computed on which lines in R, and a visited list S (lines 3-6). The reference counter r(g) checks for each LUT g how often it is required as input to other LUTs. For driving LUTs, stored in D (line 7), the reference counter is decreased by 1. It is initialized with the fan-out size and allows us to check if g is not needed any longer such that it can be uncomputed (line 8). This is the case whenever the reference counter is 0, and therefore the process of uncomputing is triggered by driving LUTs.
Input lines are added to R in lines 9-13. Input vertices are mapped to their line in R using m. In lines 14-22 single-target gates to compute and uncompute LUTs are added to R. Each gate g is visited in topological order (details on 'topo_order' follow later). First, a 0-initialized line t is requested (line 15). Either there is one in C or we get a new line by incrementing l. Given t, a single-target gate with control function F (g), controls
and target line t is added R (line 16). The LUT-to-line map is updated according to the newly added gate (line 17). Then, if g is driving an output, i.e., r(g) = 0 (line 18), we try to uncompute the children recursively by calling uncompute_children (line 20). In that function, first the reference counter is decremented for each child g that is not a primary input (line 35). Then, each child g that afterwards has a reference count of 0, is uncomputed using uncompute_gate (line 37).
In there, first it is checked whether the child has already been e n o u g h a n c i l l a n o t e n o u g h a n c il la (b) LUT-based mapping. [23] [24] [25] . This procedure is simplified: two or more outputs may share the same driving LUT. In this case, one needs additional lines and copy the output result using a CNOT gate.
visited (line 40
With a given topological order of LUTs, the time complexity of Alg. 2 is linear in the number of LUTs. As seen in the beginning of this section, the order in which LUTs are visited has an effect on the number of qubits. Therefore, there can be several strategies to compute a topological order on the gates. This is handled by the function 'topo_order' that is configured by a parameter in p Q . Besides the default topological order implied by the implementation of N (referred to as topo_def in the following), we also implemented the order topo_tfi_sort, which is inspired by Lemma 2: compute the transitive fanin cone for each primary output and order them by size in descending order. The topological order is obtained using a depth-first search for each cone by not including duplicates when traversing a cone.
D. The Role of the LUT Size
As can be seen from previous discussions, the number of additional lines roughly corresponds to the number of LUTs. Hence, we are interested in logic synthesis algorithms that minimize the number of LUTs. In classical logic synthesis the number of LUT-inputs k needs to be selected according to some target architecture. For example in FPGA mapping, its value is typically 6 or 7. But for our algorithm, we can use k as a parameter that trades off the number of qubits to the number of T gates: If k is small, one needs many LUTs to realize the function, but the small number of inputs also limits the number of control lines when mapping the single-target gates into multiple-controlled Toffoli gates. On the contrary, when k is large, one needs fewer LUTs but the resulting Toffoli gates are larger and therefore require more T gates. Further, since for larger k the LUT functions are getting more complex, the runtime to map a single-target gate into multiple-controlled Toffoli gates increases.
To illustrate the influence of the LUT size we performed the following experiment, illustrated in Fig. 5(a) . For four benchmarks and for LUT sizes k from 3 to 32, we computed a LUT mapping using ABC's [18] command 'if -K k -a'. The resulting network was used to compute both the upper and lower bound on the number of additional lines according to Lemmas 1 and 2, and to compute the actual number of lines according to Alg. 2 with 'topo_def ' as topological order. It can be noted that the actual bound often either matches the upper bound or the lower bound. In some cases the bounds are very close to each other, leaving not much flexibility to improve on the number of additional lines. Further, after larger LUT sizes the gain in reducing the number of lines decreases when increasing the LUT size. It should be pointed out that for benchmark 'Log2' an optimum number of additional lines can be achieved for k = 32, because in this case k matches the number of inputs of the function. Consequently, the LUT mapping has as many gates as the number of outputs.
V. MAPPING SINGLE-TARGET GATES
For the following discussion it is important to understand the representation of the logic network that is given as input to Alg. 1 and the k-LUT network that from results the first step. The input network is given as a gate-level logic network, i.e., all gates are simple logic gates. In our experimental evaluation and current implementation the logic network is given as AIG, i.e., a logic network composed of AND gates and inverters. The LUT network is represented by annotating in the gatelevel netlist (i) which nodes are LUT outputs and (ii) which nodes are LUT inputs for each LUT. As a result, the function of a LUT is implicitly represented as subnetwork in the gatelevel logic network.
A. Direct Mapping
The idea of direct mapping is to represent the LUT function as ESOP expression, optimize it, and then translate the optimized ESOP into a Clifford+T network. Fig. 6(a) illustrates the complete direct mapping flow. As described above, a LUT function is represented in terms of a multi-level AIG. In order to obtain a 2-level ESOP expression for the LUT function, one needs to collapse the network. This process is called cover extraction and two techniques called AIG extract and BDD extract will be described in this section. The number of product terms in the resulting ESOP expression is typically far from optimal and is reduced using ESOP minimization. The optimized ESOP expression is first translated into a reversible network with multiple-controlled Toffoli gates as described in Section II-D and then each multiple-controlled Toffoli gate is mapped into a Clifford+T with the mapping described in [7] .
ESOP cover extraction: There are several ways to extract an ESOP expression from an AIG that represents the same function. Our implementation uses two different methods. The choice of the method has an influence on the initial ESOP expression and therefore affects both the runtime of the algorithm and the number of T gates in the final network.
The method AIG extract computes an ESOP for each node in the AIG in topological order. The final ESOP can then be read from the output node. First, all primary inputs x i are assigned the ESOP expression x i . The ESOP expression of an AND gate is computed by conjoining both ESOP expressions of the children, taking into consideration possible complementation. Therefore, the number of product terms for the AND gate can be as large as the product of the number of terms of the children. The final ESOP can be preoptimized by removing terms that occur twice, e.g., x 1x2 x 3 ⊕ x 1x2 x 3 = 0, or by merging terms that differ in a single literal, e.g., [29] . AIG extract is implemented similar to the command '&esop' in ABC [18] . We were able to increase the performance of our implementation by limiting the number of inputs to 32 bits, which is sufficient in our application, and by using cube hashing [47] .
The method BDD extract first expresses the LUT function in terms of an AIG, again by translating each node into a BDD in topological order. From the BDD a Pseudo-Kronecker expression [48] , [49] is extracted using the algorithm presented in [50] . A Pseudo-Kronecker expression is a special case of an ESOP expression. For the extracted expression, it can be shown that it is minimum in the number of product terms with respect to a chosen variable order. Therefore, it provides a good starting point for ESOP minimization. In our experiments we noticed that AIG extract was always superior and BDD extract did not show any advantage. BDD extract may be advantageous for larger LUT sizes.
ESOP minimization: In our implementation we use exorcism [29] to minimize the number of terms in the ESOP expression. Exorcism is a heuristic minimization algorithm that applies local rewriting of product terms using the EX-ORLINK operation [51] . In order to introduce this operation, we need to define the notation of distance. For each product term a variable can either appear as positive literal, as negative literal, or not at all. The distance of two product terms is the number of variables with different appearance in each term. For example, the two product terms x 1 x 3 and x 1x2 x 3 have distance 1, since x 2 does not appear in the first product term and appears as negative literal in the second one. It can be shown that two product terms with distance k can be rewritten as an equivalent ESOP expression with k product terms in k! different ways. The EXORLINK-k operation is a procedure to enumerate all k! replacements for a product term pair with distance k. Applying the EXORLINK operation to product term pairs with a distance of less than 2 immediately leads to a reduction of the number of product terms in an ESOP expression. In fact, as described above, such checks are already applied when creating the initial cover. Applying the EXORLINK-2 operation does not increase the number of product terms but can decrease it, if product terms in the replacement can be combined with other terms. The same applies for distances larger than 2, but it can also lead to an increase in the number of product terms. This can sometimes be helpful to escape local minima. Exorcism implements a default minimization heuristic, referred to as def in the following, that applies different combinations and sequences of EXORLINK-k operations for 2 ≤ k ≤ 4. We have modified the heuristic by just omitting the EXORLINK-4 operations, referred to as def_wo4 in the following.
B. LUT-based Mapping
This section describes a mapping technique that exploits two observations: (i) when mapping a single-target gate there may be additional lines available with a constant 0 value; and (ii) for single-target gates with few control lines (e.g., up to 4) one can precompute near-optimal Clifford+T networks and store them in a database. Available 0-ancilla resources: Fig. 7 shows the same reversible network as in Fig. 4(d) that is obtained from the initial k-LUT mapping. However, additional lines that are 0-valued are drawn thicker. We can see that for the realization of the first single-target gate, there are three 0-valued lines, but for the last single-target gate there is only one 0-valued line. This information can easily be obtained from the reversible network resulting from Alg. 2.
The following holds in general. Let g = T c (X, t) be a single-target gate with |X| = k and let there be l available 0-valued lines, besides a 0-valued line for target line t. If c can be realized as a 4-LUT network with at most l + 1 LUTs, then we can realize it as a reversible network with 2l − 1 singletarget gates that realizes function c on target line t such that For this number, it is possible to precompute optimum or nearoptimal Clifford+T networks for each of the single-target gates using exact or heuristic optimization methods for Clifford+T gates (see, e.g., [6] , [52] , [53] , [54] ), and store them in a database. Mapping single-target gates resulting from the LUTbased mapping technique described in this sections is therefore very efficient. However, the number of functions can be reduced significantly when using affine function classification. Next, we review affine function classification and show that two optimum Clifford+T networks for two single-target gates with affine equivalent functions have the same T -count. For a Boolean function f (x 1 , . . . , x n ), let us use the notation f ( x), where x = (x 1 , . . . , x n ) T . We say that two functions f and g are affine equivalent [55] , if there exists an n × n invertible matrix A ∈ GL n (B) and a vector b ∈ B n such that
We say that f and g are affine equivalent under negation [56] , if either (5) holds or g( x) =f (A x + b) for all x. For the sake of brevity, we say that f is AN-equivalent to g in the remainder. AN-equivalence is an equivalence relation and allows to partition the set of 2 2 n into much smaller sets of functions. For n = 1, 2, 3, and 4, there only 2, 3, 6, and 18 classes of AN-equivalent functions (see, e.g., [56] , [55] , [57] . And all 4 294 967 296 5-input Boolean functions fall into only 206 classes of AN-equivalent functions! The database solution proposed in this mapping can therefore scale for 5 variables given a fast way to classify functions. Before we discuss classification algorithms, the following lemma shows that ANequivalent functions preserve T -cost.
Lemma 3: Let f and g be two n-variable AN-equivalent functions. Then the T -count in Clifford+T networks realizing T f and T g is the same.
Proof: Since f and g are AN-equivalent, there exists A ∈ GL n (B), b ∈ B n , and p ∈ B such that g( x) = p ⊕ f (A x + b) for all x. It is possible to create a reversible circuit that takes x → A x + b using only CNOT gates for A and NOT gates for b (see, e.g., [58] ). The output can be inverted using a NOT gate. Fig. 8 the linear transformation using CNOT gates, U b realizes the input inversions using NOT gates, and X p represents a NOT gate, if p = 1, otherwise the identity.
In order to make use of the algorithm we need to compute an optimum or near-optimal circuit for one representative in each equivalence class for up to 4 variables. To match an arbitrary single-target gate with a control function of up to 4 variables in the database, one needs to derive it's representative. Algorithms as presented in [59] can be used for this purpose. Fig. 9 lists all the AN-equivalent classes for 2-4 variables; the class containing the constant functions is omitted. It shows how often they are discovered in a 4-LUT in all our experimental evaluations. Also, it shows the number of T gates in the best-known Clifford+T networks of the corresponding single-target gate. The classes # 1, # 01, and # 0001 occur most frequently. These classes contain among others x 1 x 2 , x 1 x 2 x 3 , and x 1 x 2 x 3 x 4 , and are therefore control functions of the multiple-controlled Toffoli gates. The table can guide to the classes which benefit most from optimizing their Tcount. Reversible and Clifford+T networks of the best-known realizations can be found at quantumlib.stationq.com.
C. Hybrid Mapping
LUT-based mapping cannot be applied if the number of available 0-valued lines is insufficient. As fallback, the 4-LUT network is omitted and direct mapping is applied on the k-LUT network, therefore not using any of the 0-valued lines Enables whether 4-LUT networks in the hybrid mapping are postoptimized using the SAT-based technique described in [63] , default: false at all. The idea of hybrid synthesis is to merge 4-LUTs into larger LUTs. By merging 2 LUTs in the network, the number of LUTs is decreased by 1 and therefore one fewer 0-valued line is required. Our algorithm for hybrid synthesis merges the output LUT with one of its children, thereby generating a larger output LUT. This procedure is repeated until the LUT network is small enough to match the number of 0-valued lines. The topmost LUT is then mapped using direct mapping, while the remaining LUTs are translated using the LUT-mapping technique.
VI. EXPERIMENTAL EVALUATION
We have implemented LHRS as command 'lhrs' in the open source reversible logic synthesis framework RevKit [64] . Table I gives an overview of all parameters that can be given as input to LHRS. The parameters are split into two groups. Parameters in the upper half have mainly an influence on the number of lines and are used to synthesize the initial reversible network that is composed of single-target gates (Sect. IV). Parameters in the lower half only influence the number of T gates by changing how single-target gates are mapped into 1 The source code can be found at github.com/msoeken/cirkit Clifford+T networks. Each parameter is shown with possible values and some explanation text.
A. LHRS Configuration

B. Benchmarks
We used both academic and industrial benchmarks for our evaluation. As academic benchmarks we used the 10 arithmetic instances of the EPFL combinational logic synthesis benchmarks [65] , which are commonly used to evaluate logic synthesis algorithms. In order to investigate the influence of the initial logic representation, we used different realizations of the benchmarks: (i) the original benchmark description in terms of an AIG, called Original, and (ii) the best known 6-LUT network wrt. the number of lines, called Best-LUT. Further statistics about the benchmarks are given in Table II. All experimental results and synthesis outcomes for the academic benchmarks can be viewed and downloaded from quantumlib.stationq.com.
As commercial benchmarks we used Verilog netlists of several arithmetic floating point designs in half (16-bit), single (32-bit), and double (64-bit) precision. For synthesis all Verilog files were translated into AIGs and optimized for size using ABC's 'resyn2' script.
C. Experiments for EPFL Benchmarks
We evaluated LHRS for both realizations (Original and Best-LUT) for all 10 arithmetic benchmarks with a LUT size of 6, 10, and 16. Table III lists all results. We chose LUT size 6, because it is a typical choice for FPGA mapping, and therefore we expect that LUT mapping algorithms perform well for this size. We chose 16, since we noticed in our experiments that it is the largest LUT size for which LHRS performs reasonably fast for most of the benchmarks. LUT size 10 has been chosen, since it is roughly in between the other two. These configurations allow to synthesize 6 different initial single-target gate networks for each benchmark, therefore spanning 6 optimization points in a Pareto set. For each of these configurations, we chose 4 different configurations of parameters in p T , based on values to the mapping method and the ESOP heuristic ({direct, hybrid} × {def , def_wo4}).
The experiments confirm the observation of Sect. IV-D: A larger LUT size leads to a smaller number qubits. In some cases, e.g., Log2, this can be quite significant. A larger LUT size also leads to a larger T -count, again very considerable, e.g., for Log2.
Slightly changing the ESOP minimization heuristic (note that def to def_wo4 are very similar) has no strong influence on the number of T gates. There are examples for both the case in which the T -count increases and in which it decreases. However, the runtime can be significantly reduced. The choice of the mapping method has a stronger influence. For example, in case of the Adder, the hybrid mapping method is far superior compared to the direct method. In many cases, the hybrid method is advantageous only for a LUT size of 16, but not for the smaller ones. Also, the initial representation of the benchmark has a considerable influence. In many cases, the Best-LUT realizations of the benchmarks require fewer qubits, which-as expected-results in higher T -count. The effect is particularly noticeable for the Divisor and Square-root. In some cases, better results in both qubits and T -count can be achieved, e.g., for Log2 as Best-LUT with a LUT size of 16, and for the Barrel shifter as Original with a LUT size of 16.
D. Experiments for Floating point Library
We reevaluated the LHRS algorithm in its improved implementation and new parameters on the commercial floating point designs, which were already used for the evaluation in [1] . The numbers are listed in Table IV for LUT sizes 6, 10, and 16, mapping method hybrid, and ESOP heuristic def_wo4. Due to the changes and improvements in the implementation and different parameters, the numbers vary. In the vast majority of the cases the numbers improve for qubits, T -count, and runtime. Consequently, we did not repeat the comparison to the hierarchical reversible logic synthesis algorithm presented in [9] , as the previous numbers already have shown an improvement.
Note that for all benchmarks with 16 inputs (exp-16, invsqrt-16, ln-16, log-16, recip-16, sincos-16, square-16, and sqrt-16), a LUT size 16 leads to qubit-optimum quantum networks, since every output can be represented by a single LUT. Note that LHRS is not aware of this situation, and will realize every output separately. A better runtime, and potentially a better T -count, can be achieved by generating the ESOP cover and optimizing it for all outputs at once. 
E. Compositional Functions
The synthesis results of the floating point benchmarks can be used to cost quantum algorithms. This alone provides a useful tool to quantum algorithm designers. However, we show below that using these results to synthesize a composed function can be sub-optimal. This is significant because several quantum algorithms require compositions of several arithmetic functions [36] , [39] , [66] , [67] , [68] , [69] . We show that by using automatic synthesis with LHRS better quantum networks can be found relative to naïvely summing the costs of the constituent functions in Table IV . To demonstrate this effect, we use a 32-bit implementation of the Gaussian f (x) = ae
where besides x also a, b, and c are 32-bit inputs to the quantum circuit. We focus on this function because of its importance to quantum chemistry simulation algorithms, wherein the best known simulation methods for solving electronic structure problems within a Gaussian basis need to be able to reversibly compute (6) [39] . The overheads from compiling Gaussians using conventional techniques has, in particular, rendered these simulation methods impractical. This makes improved synthesis methods critical for such applications. The function has been used in the design of quantum algorithms. We can implement the Gaussian by combining the 32-bit versions of the floating point components (synthesized using k = 16) in Table IV . This leads to a quantum network as shown in Fig. 10(a) . Note that we use the multiplication component with a constant input to realize the denominator in the argument of the exponent. Also, we make sure to uncompute all helper lines. This leads to a quantum network with 6 355 qubits and 8 960 228 T gates.
We also implemented the Gaussian directly in Verilog, optimized the resulting design with logic synthesis (as we did for the individual components in Table IV) , and synthesized it using LHRS. To match the quality of the design in Fig. 10(a) , we used 'synthesize_mapping' (see Alg. 1) to find the smallest k that leads to a number of qubits smaller than 6 355. In this case, k is 23. With this k, we synthesize the composed Guassian design using the same parameters p T as used for synthesizing the components. The result is a quantum network with 6,283 qubits and 1,850,001 T gates, which can be synthesized within 11.28 seconds (see also Fig. 10(b) , which summarizes all results of this experiment).
The experiment demonstrated that by resynthesizing composed functions, better networks and better cost estimates can be achieved. The approach also easily enables design space exploration. For example, if one is interested in a quantum network with less than 1,000,000 T gates, one can find a realization with 8,124 qubits and 982,417 T gates, after 9.67 seconds by setting k to 9.
It can be seen that LHRS finds quantum circuits with much better qubits/T gates tradeoff. Further, LHRS allows for a better selection of results by using the LUT size as a parameter. One strong advantage is that in LHRS one can quickly obtain a skeleton for the final circuit in terms of single-target gates that already has the final number of qubits. If this number matches the design constraints, one can start the computational more challenging task of finding good quantum circuits for each LUT function. Here, several synthesis passes trying different parameter configurations are possible in order to optimize the result. Also post-synthesis optimization techniques likely help to significantly reduce the number of T gates.
VII. CONCLUSION
We presented LHRS, a LUT-based approach to hierarchical reversible circuit synthesis that outperforms existing state-ofthe-art hierarchical methods. It allows for much higher flexibility addressing the needs to trade-off qubits to T -count when generating high quality quantum networks. The benchmarks that we provide give what is at present the most complete list of costs for elementary functions for scientific computing. Apart from simply showing improvements, these benchmarks provide cost estimates that allow quantum algorithm designers to provide the first complete cost estimates for a host of quantum algorithms. This is an essential step towards the goal of understanding which quantum algorithms will be practical in the first generations of quantum computers.
LHRS can be regarded as a synthesis framework since it consists of several parts that can be optimized separately. As one example, we are currently investigating more advanced mapping strategies that map single-target gates into Clifford+T networks. Also, most of the conventional synthesis approaches that are used in the LHRS flow, e.g., the mapping algorithms to derive the k-LUT network, are not quantum-aware, i.e., they do not explicitly optimize wrt. the quality of the resulting quantum network. We expect further improvements, particu-larly in the number of T gates, by modifying the synthesis algorithms in that direction.
