Quantum algorithm design usually assumes access to a perfect quantum computer with ideal properties like full connectivity, noise-freedom and arbitrarily long coherence time. In Noisy Intermediate-Scale Quantum (NISQ) devices, however, the number of qubits is highly limited and quantum operation error and qubit coherence are not negligible. Besides, the connectivity of physical qubits in a quantum processing unit (QPU) is also strictly constrained. Thereby, additional operations like SWAP gates have to be inserted to satisfy this constraint while preserving the functionality of the original circuit. This process is known as quantum circuit transformation. Adding additional gates will increase both the size and depth of a quantum circuit and therefore cause further decay of the performance of a quantum circuit. Thus it is crucial to minimize the number of added gates. In this paper, we propose an efficient method to solve this problem. We first choose by using simulated annealing an initial mapping which fits well with the input circuit and then, with the help of a heuristic cost function, stepwise apply the best selected SWAP gates until all quantum gates in the circuit can be executed. Our algorithm runs in time polynomial in all parameters including the size and the qubit number of the input circuit, and the qubit number in the QPU. Its space complexity is quadratic to the number of edges in the QPU. Experimental results on extensive realistic circuits confirm that the proposed method is efficient and can reduce by 57% on average the size of the output circuits when compared with the state-of-the-art algorithm on the most recent IBM quantum device viz. IBM Q20 (Tokyo).
Introduction
In Noisy Intermediate-Scale Quantum (NISQ) era, it is unrealistic to implement quantum error correction due to the strictly limited number of qubits [20] . This drawback brings huge challenge to quantum program compilation because the noise will have large impact on final circuits and may often make the results meaningless. Besides, the connectivity of qubits in an NISQ device is also limited. Only those neighbouring qubits can be coupled and only between them can two-qubit operations be implemented [24] . As a result, a large number of modifications must be done to adapt a quantum circuit to the real quantum devices. This process is termed as quantum circuit transformation [6] , qubit mapping [13] , qubit allocation [21] , qubit routing [19] or qubit movement [15] in the literature. We call it quantum circuit transformation in this paper.
Quantum circuit transformation is an essential part for quantum circuit compilation. The main idea behind is to convert an ideal quantum circuit, in which full connectivity among qubits is assumed and noise is ignored, to a quantum circuit respecting constraints imposed by the NISQ devices [6] . Usually this process will bring in a large number of auxiliary gates like SWAP gates and Hadamard gates which will in turn increase both the size and depth of the generated quantum circuit and sometimes make the error of the whole circuit unacceptable [24] . Hence, it is vital for the success of quantum computation to find an automated approach that can efficiently transform any input quantum circuit into one that respects the physical constraints imposed by the NISQ devices with a small overhead in terms of the size, depth or error.
The quantum circuit transformation problem can be reduced to token swapping or template matching in graph theory [14, 2] . Unfortunately, both of these problems are NP-hard [6] . Hence, designing algorithms to solve the quantum circuit transformation problem while making trade off between time consuming and the quality of results has brought lots of interest in both the quantum computing community and the integrated circuits community [24] .
There are currently three major approaches to the quantum circuit transformation problem. The first one is to use heuristic search to construct the output quantum circuit step by step from the original input quantum circuit [13, 19, 25, 21, 8] . Usually, these search algorithms need an initial mapping as the input, and it can be set arbitrarily or via some greedy methods [25, 19, 7] . Recently, a novel reverse traversal technique is proposed in [13] to choose the initial mapping with the consideration of the whole circuit. The second approach is to utilize unitary matrix decomposition algorithms to reconstruct a quantum circuit from scratch while preserving the functionality of the input circuit [16, 12] . The third one is to convert the quantum circuit transformation problem to some existing problems like AI planning and constraint programming and use ready-made tools for these problems to find acceptable results [4, 23] .
In this paper, we follows the first approach. Our main contributions are listed as follows. First, we propose a simulated annealing based algorithm to find a near-optimal initial mapping for the input circuit. Second, we design a flexible heuristic cost function to evaluate the possible operations that may be applied to transform the current circuit. The heuristic function supports weight parameters to reflect the variable influence of gates in different layers. Third, a heuristic search algorithm with a novel selection mechanism is designed, where in each step of the search process, instead of selecting the operation with minimum cost to apply, we look one step ahead and select the operation which has the best consecutive operation to apply. In this way, the algorithm is able to avoid the local minimum effectively. Fourth, a pruning mechanism is introduced to reduce the size of search space and ensure the program terminates in reasonable time.
Note that the look-ahead mechanism has already been introduced in the heuristic cost function during the search process in existing works like [25, 13] . However, we adopt in this paper a double look-ahead mechanism: in addition to looking ahead at subsequent layers when defining the cost function, we also look ahead (at grandchild states) in finding the state with minimal cost in order to make the best transformation. Thanks to this novel idea, the proposed algorithm is able to find a better solution with less circuit size within acceptable running time. Experimental results on extensive realistic circuits show that our algorithm is efficient and, when compared with the state-of-the-art algorithms [25, 13] , can reduce on average the size of the output circuits by above 10% on IBM QX5 and above 55% on IBM Q20.
The remainder of this paper is organized as follows: some background knowledge about quantum computation is given in Section 2, and the quantum circuit transformation problem is formally defined in Section 3. Section 4 is devoted to the detailed description of our proposed algorithm. We report experimental results in Section 5 and conclude the paper in Section 6.
Background
In classical computation, information is stored in memory in the form of binary digits, i.e., bits. The quantum counterpart of bit, called qubit, has two basis states denoted by |0 and |1 , respectively. Different from a classical bit, a qubit |ψ can be in a linear combination of basis states [17] , i.e.,
where |α| 2 + |β| 2 = 1. Information processing or computation is realized by applying quantum gates on qubits. Typical gates which we are concerned with in this paper are Hadamard gate H, CNOT gate and SWAP gate, depicted in Fig. 1 . H is a single-qubit gate which can evenly mix the basis states to produce a superposed one. CNOT and SWAP are both two-qubit gates, i.e., they operate on two qubits. A CNOT gate flips the target qubit (indicated graphically with ⊕) if and only if the control qubit (indicated graphically with a black dot •) is in state |1 , while a SWAP gate exchanges the states of the two qubits operated.
Quantum circuits are the most commonly used model to describe quantum algorithms, which consist of input qubits, quantum gates, measurements and classical registers [22] . However, as far as quantum circuit transformation is concerned, only input qubits and quantum gates are relevant. Thus in this paper, a quantum circuit is simply represented as a pair (Q, C), where Q is the set of involved qubits and C a sequence of quantum gates. For a generic quantum circuit to be executed in a real quantum processing unit (QPU), two more steps have to be taken:
• Compilation process. As only limited quantum operations are available in a QPU, quantum gates in the circuit must be decomposed into elementary gates first [3, 9] . In this paper, we take single-qubit and CNOT gates as elementary gates as they are universal to implement any quantum circuit and supported by, say, IBM QX architectures.
• Transformation process. Qubits in a real QPU are typically laid out in a fixed topology and CNOT gates can only be applied on neighbouring qubits. Such a connectivity topology can be described by an architecture graph or coupling graph [6] which is a directed graph with each node representing a qubit in the QPU. A quantum circuit consisting of only single-qubit and CNOT gates is said to respect the QPU constraint if for every CNOT gate in the circuit, there is a directed edge in the architecture graph from the control qubit to the target qubit. The transformation process is then to convert a quantum circuit (say, those obtained from the above compilation process) into one that respects the QPU constraint so that it can be executed on the QPU.
In this paper, we only focus on the transformation process. The QPU topologies we are concerned with are IBM QX architectures QX5 and Q20 shown in Fig. 2 , but our approach is applicable to any architecture graph, including for example Rigetti 16Q Aspen-4
1 . Notice that edges in IBM Q20 are bidirectional (or, undirected) and thus either node of each edge can be the control qubit of a CNOT gate. Depicted in Fig. 3 are several gate transformation rules which are quite useful in gate decomposition and circuit transformation. The top equivalence shows that we can exchange the control and target qubits of a CNOT by adding two Hadamard gates before and after it, while the bottom ones show different ways of implementing a SWAP gate in QX structures.
To simplify the presentation, we distinguish between two kinds of quantum circuits in this paper. Logical circuits are ideal and high-level gate descriptions of quantum algorithms without considering any physical constraints imposed by QPUs. In contrast, physical quantum circuits are low-level gate-model implementation which respect the QPU concerned. The purpose of the circuit transformation process mentioned above is then to convert a logical circuit to a physical one. Accordingly, qubits appearing in logical circuits are called logical qubits while those appearing in physical circuits are called physical ones.
Quantum Circuit Transformation
The main objective of quantum circuit transformation is to transform an input logical circuit to a physical one so that the constraints imposed by the QPU are satisfied. To simplify the problem, we only consider the connectivity constraints for CNOT gates as specified by the architecture graph (see Section 2) . This means that single-qubit gates have no effect in the circuit transformation process, and we assume without loss of generality that the input logical circuit consists only of CNOT gates. Furthermore, a CNOT gate is simply denoted as a pair q, q , where q is the control qubit and q is the target qubit. We call the CNOT gate q , q the inverse of q, q .
Let AG = (V, E) be the architecture graph of a QPU, where V is the set of physical qubits and E the set of directed edges along which CNOT gates can be performed. Given a logical circuit LC = (Q, C l ) with |Q| ≤ |V |, we need to construct a physical circuit P C = (V, C p ) such that
• LC and P C are equivalent in functionality.
• C p only contains CNOT gates and single qubit gates.
• For any CNOT gate q, q in
It is easy to find a physical circuit that satisfies the above conditions, but the real challenge is to find one with minimal size or depth, which is NP-hard in general [5] . In this paper, we modify the input logical circuit stepwise by inserting auxiliary gates like CNOT and H, as shown in Fig. 1 , until the logical circuit is transformed into a physical circuit that can be executed on the QPU. To evaluate the effectiveness of quantum circuit transformation algorithms, we use the sizes of the output circuits, i.e., the total number of its elementary gates.
Dependency Graph
CNOT gates in a logic circuit LC = (Q, C l ) are not independent. We say a CNOT gate q, q depends on another p, p if the latter must be executed before the former. This happens when p, p is in front of q, q in 
C
l and they share a common qubit (i.e., either p ∈ {q, q } or p ∈ {q, q }), or when q, q depends on a CNOT gate which depends on p, p .
In general, we can construct a directed acyclic graph (DAG), called the dependency graph [10] , to characterize the dependency between gates in a logical circuit LC [13] . Each node of the dependency graph represents a gate and each directed edge the dependency relationship from one gate to another. The front layer of LC, denoted F(LC) or L 0 (LC), consists of all gates in LC which have no parents in the dependency graph. The second layer L 1 (LC) is then the front layer of the circuit obtained from LC by deleting all gates in F(LC). Analogously, we can define the k-th layer L k (LC) of LC for all k ≥ 0. Consider the circuit shown in Fig. 4 as an example. Initially, gates g 0 and g 1 can be applied in parallel because there are no gates before them and they are independent from each other. Thus F(LC) = {g 0 , g 1 }. Then, gate g 2 can be executed after g 0 , g 3 after g 2 and g 0 , and g 4 after g 3 and g 1 . Thus L 1 (LC) = {g 2 }, L 2 (LC) = {g 3 }, and L 3 (LC) = {g 4 }.
Qubit Mapping
At each step of the circuit transformation, qubits in the logical circuit are mapped or allocated to physical qubits in the QPU [21] . Mathematically, a qubit mapping is a function τ from Q to V such that τ (q) = τ (q ) if and only if q = q for any q, q ∈ Q [6] . The mapping may change at consecutive steps of the transformation which is determined by the inserted auxiliary gates.
Given a logical circuit LC and a mapping τ , a CNOT gate g = q, q in LC is said to be satisfied by τ , or τ satisfies g, if (τ (q), τ (q )) is a directed edge in AG. Furthermore, g is executable by τ if it appears in the front layer of LC and is satisfied by τ . In this case, we remove it from LC and append a CNOT gate τ (g) := τ (q), τ (q ) to the end of the physical circuit. This process is called the execution of g.
The Proposed Algorithm
In this section, details of the proposed algorithm will be explained step by step. Let AG = (V, E) be the architecture graph and LC = (Q, C l ) the input logic circuit consisting only CNOT gates. The goal of the algorithm is to try to minimize the size, i.e., the total number of elementary gates, of the output physical circuit.
We first generate an initial mapping τ ini by using simulated annealing (Algorithm 1), and then stepwise construct the output physical circuit by adding auxiliary CNOT or Hadamard gates while processing gates in the input logic circuit. The state of each step is described by a mapping τ from logic qubits in Q to physical qubits in V , the currently constructed physical circuit P C which obeys the constraints imposed by AG, and the logic circuit LC with gates that have not been processed. A cost function which assigns decreasing weights to gates in later layers is used to select the state of the next step. Note that the above procedure is standard for circuit transformation, and has been adopted in [25] . Our algorithm distinguishes itself from the previous ones in the ways of choosing the initial mapping (Sec 4.2), the definition of the cost function, and the strategy of updating step states (Sec 4.3).
CNOT Distance
In graph theory, the distance from a source node v to a destination node v in a directed graph G, written dist G (v, v ), is the minimal number of edges needed to traverse from v to v . Suppose AG is the architecture graph of the QPU we consider. We define the CNOT distance from v to v in AG, written dist cnot (v, v ), as the minimal number of auxiliary CNOT and Hadamard gates required to execute the CNOT gate v, v in the QPU. Here 'execute' is in the same sense as we have described in Section 3.2. To execute v, v , we need to bring the two qubits v and v close to each other by swapping and then, when they are neighbours in AG, we further check if the direction is from v to v or vice versa.
For bi-directed (or, undirected) architecture graph such as that of Q20, we need only to bring v close to v or vice versa, and the CNOT distance is simply computed as
. This is because only dist AG (v, v ) − 1 swaps are required and each SWAP requires only 3 CNOT gates to implement (see Fig. 3 (top)). For directed architecture graph such as that of QX5, the situation is a little complicated, where we also need to consider the direction of the CNOT gates. We first compute all shortest paths from v to v (ignoring the directions).
, because a SWAP gate is decomposed into 7 elementary gates (see Fig. 3 (bottom)). Otherwise, we have dist cnot (v, v ) = 7 × (d − 1) + 4, as we need to add 2 Hadamard gates before and after to change the direction of the target CNOT [25] .
Take QX5 as an example. Suppose the logic qubits q and q are mapped to v 3 and v 1 , which correspond to nodes 3 and 1 in Fig. 2 , respectively, and we want to implement the CNOT gate q, q , with q the control qubit and q the target qubit. One solution is to add a SWAP gate between qubit v 1 and v 2 to bring q one step close to q , and a CNOT gate between v 2 and v 3 together with 4 additional Hadamard gates (cf. Figure 3) to change the direction of the CNOT gate. Because a SWAP gate can be decomposed into 7 elementary gates complying with the directions in QX5, the CNOT distance from v 3 to v 1 in QX5 is 11.
For simplicity, in out algorithm we precompute the CNOT distance for all node pairs in AG by using, say, a breadth-first search algorithm.
Initial Mapping
The selection of a good initial mapping has a significant impact on the quality of the final physical circuit [19, 13, 25] . Intuitively, we would like to find an initial mapping that 'fits' most gates such that fewer SWAP gates are required in the circuit transformation process.
To this end, we define the gate cost of a CNOT gate g = q, q under a mapping τ : Q → V as
Our ideal initial mapping τ * ini is then given by
where C * is a selected subset of the logical circuit LC. Here we use C * instead of C l to calculate the initial mapping. This is because taking into account all gates in C l would bring further overhead and be unnecessary because gates in the tail of the circuit would have little impact on the initial mapping.
Simulated annealing (SA), inspired by the annealing process in metallurgy [11] , is designed for approximating the global optimum of a given cost function. The algorithm tries to find the best state in the search space. In each trial for searching a better state, the algorithm generates a new state based on the previous one, calculates Figure 5 : Convergence of the simulated annealing algorithm on circuit adr4-197 and IBM Q20, where the blue and orange lines represent the cost of accepted states and existing best states, respectively. We set empirically T max = 100, T min = 1, ∆ = 0.98, and R = 100.
its cost and compares it with the previous one and decides whether this new state should be accepted. To escape from local optima, the algorithm accepts the new generated state with a certain probability even if its cost is worse than the previous one. The acceptance probability is decided by the current temperature which declines during the search process until the minimum value is reached.
We propose an efficient simulated annealing based algorithm (Algorithm 1) to find a good approximation of τ * ini , where T max , T min , ∆ and R are, respectively, the starting temperature, the minimum temperature, the decline coefficient for the temperature and the repeated times for one temperature. Fig. 5 shows convergence of the simulated annealing process on a real quantum circuit adr4-197. Note that the cost of states converges after sufficient iterations, showing that the temperature is low enough. The fluctuation of the cost of accepted states is caused by the above mentioned acceptance probability for worse states.
In the following two subsections, we describe in detail how to generate all possible child and grandchild states and how the cost of a child or grandchild state is calculated. We will illustrate the construction by using the quantum circuit shown in Fig. 6 and the test architecture graph AG test in Fig. 7. 
Heuristic search with look-ahead
We have shown in the previous section how to construct the initial mapping τ ini for our circuit transformation algorithm, thus obtaining the state s 0 := (τ ini , P C 0 , LC 0 ) for the first step, where P C 0 and LC 0 are respectively the physical and logic circuits after executing all gates in LC which are executable in τ ini . Suppose we are in state s i := (τ i , P C i , LC i ) at the i-th step for i ≥ 0. This section is devoted to the strategy of choosing s i+1 for the i + 1-th step. Obviously, depending on the different ways of adding auxiliary CNOT and H gates, there are multiple child states of s i to choose from. One natural way is to select the one with the minimal cost. This surely gives a fine method for extending s i , but (as shown in Fig. 10 ) the sizes of the output physical circuits are not always desirable. In this paper, we propose a novel way to select the next state: we look one level ahead to calculate the costs of all grandchild states of s i , and choose the child of s i which has a child (thus a grandchild of s i ) with the minimum cost among all grandchildren of s i . To this end, we have to specify for a given state s i := (τ i , P C i , LC i ), (1) how to extend s i to get all its children and grandchildren, and (2) how to define the costs of its grandchildren. We are going to elaborate these two points one by one in the following.
Extend s i . There are two natural ways to extend s i .
• Way 1: Apply on τ i a swap operation represented as an edge in AG one of whose end nodes is the image under τ i of some qubit appearing in a gate in the front layer of LC i , and obtain a new mapping τ i .
Algorithm 1:
Simulated annealing for computing the initial mapping input : A set C * for considered gates in a logical quantum circuit. output: An approximation of the optimal initial mapping given in Eq.(3). begin
Initialize parameters T max , T min , ∆, R, and an arbitrary mapping τ ;
cost ← ncost and τ ← τ new with prob. exp(
Accordingly, we extend P C i with the CNOT + H implementation of the SWAP gate corresponding to this swap operation. Then we execute recursively all gates in LC i (not only those in the front layer, but also those executable when their precedents have already been executed by τ i ) which are executable in τ i . The resultant state is then a child of s i .
• Way 2 only applies when AG is directed and there is a CNOT gate q, q in the front layer of LC i which is inversely executable, i.e. its inverse gate q , q is executable, in τ i . In this case, we add 4 Hadamard gates to change the direction of q, q (cf. Figure 3 (top)), extend P C i with all these 5 gates, and delete q, q from LC i . Again, we execute recursively all gates in LC i which are executable in τ i to get a child of s i .
Finally, for each child of s i , we extend one level further to get its grandchildren. We denote by {s i j : j ∈ J} and {s i j,k : j ∈ J, k ∈ K} the set of children and grandchildren of s i , respectively. Example 1. We consider the quantum circuit shown in Fig. 6 and the test architecture graph AG test in Fig. 7 . Applying Alg. 1 we get the initial mapping τ ini : Q → V which maps q i to v i for each 0 ≤ i ≤ 4. For convenience, we write such a mapping as a list of length 5. For example, τ ini = [0, 1, 2, 3, 4] . Note that the front layer contains two gates, viz., q 2 , q 1 and q 3 , q 4 . As the latter is directly executable by τ ini , the initial state s = (τ ini , P C 0 := { q 3 , q 4 }, LC 0 := LC\{ q 3 , q 4 }), where LC is the circuit shown in Fig. 6 . We next show how to construct the child states of s. Note that there is only one gate, viz. q 2 , q 1 , in the first layer of LC 0 , and τ ini maps q 1 and q 2 to, respectively, v 1 and v 2 . Only 4 edges, (1, 0), (1, 2), (2, 3) and (5, 2), in AG test (see Fig. 7 ) are relevant. For each of them, we obtain a corresponding swap operation and a corresponding child state. Since AG test is directed and τ ini can execute q 1 , q 2 , another child state can be obtained by using Way 2. Therefore, s has in total 5 child states and Table 1 gives the mappings and physical circuits of these child states as well as the corresponding operations. Similarly, we also construct the grandchild states of s, also shown in Table 1 . Here in the 'Operation' column, we use an edge in AG test to denote the corresponding swap operation on mappings, and a CNOT gate to denote the operation of changing its direction. Table 1 we can see that each grandchild state of s has cost g 14 or 11.
For the second part of the cost, we employ a look-ahead mechanism first demonstrated in [13] . Given a generic state s = (τ s , P C s , LC s ), we partition the gates in LC s into different layers according to its dependency graph. Denote by L k , k ≥ 0, these layers such that L 0 is the front layer. Then the heuristically estimated cost of s is defined as
where d is the diameter of the architecture graph, N swap the number of elementary gates needed to compose a SWAP gate, and N s is the number of gates in LC s . The parameters > 0, w k (0 ≤ k ≤ ) and w s are taken empirically but normally we assume 1 = w 0 ≥ w 1 ≥ · · · ≥ w l ≥ w s ≥ 0. This reflects the intuition that the closer a gate is from the front layer of the circuit, the more it contributes to the total cost of executing the whole circuit, as subsequent dependent gates will not be executable unless it has been processed. Table 1 gives the heuristic costs for all child and grandchild states of s in Example 1, where we take = 3, w 1 = 1, w 2 = 0.8, w 3 = 0.6, w s = 0.4. Note that the diameter of QX5 is 8 and each SWAP gate is composed by 7 elementary gates in QX5. Thus d = 8 and N swap = 7. Finally, the total cost of a grandchild s i j,k of s i is computed as 
Fallback via remote CNOT
During the search process, there is a small possibility that our algorithm does not halt. This happens when a child state with better cost may be good for gates in look-ahead layers but increases the distances of gates in the front layer. To address this problem, a fallback mechanism is introduced to ensure that the program terminates in reasonable time.
A direct way for fallback is to select a gate q, q in the front layer and then choose a SWAP operation that will reduce the shortest path between the two corresponding nodes v, v with τ s (q) = v and τ s (q ) = v in the architecture graph [6] , where s denotes the current state. However, this method may change the mapping that the algorithm may want to keep as it is preferred by look-ahead layers. To protect the preferred mapping, remote CNOT operations [18] , which are depicted in Fig. 8 , are introduced in the fallback. After imposing remote CNOT gates, the circuit has the same functionality while preserving the current mapping. The fallback is activated when no gates are removed from LC s after a certain prefixed number of rounds.
Algorithm 2: Circuit transformation with look-ahead
input : A logic circuit LC = (Q, C l ), an initial mapping τ constructed by Algorithm 1 and an architecture graph AG = (V, E) with |Q| ≤ |V |. output: A physical circuit (V, C p ) which satisfies AG and is equivalent to LC. begin (P C, LC) ← Execute(τ, P C, LC); while LC = ∅ do L ← F(LC), the first layer of LC; Cld ← ∅; for e ∈ E which touches some gate in L under τ do τ ← swap e • τ ; P C ← P C by adding (the CN OT + H implementation of) a SWAP gate corresponding to swap e ; (P C , LC ) ← Execute(τ , P C , LC); gcost ← 3 if e −1 ∈ E and 7 otherwise; Cld ← Cld ∪ {(τ , P C , LC , gcost)}; end for g ∈ L which is inversely executable by τ do P C ← P C by adding τ (g) complemented by four H gates before and after it; LC ← LC by deleting g; (P C , LC ) ← Execute(τ, P C , LC ); Cld ← Cld ∪ {(τ, P C , LC , 4)}; end mCost ← ∞; for (τ , P C , LC , gcost) ∈ Cld do cost ← minChildHcost(τ , LC ); if cost + gcost < mCost then mCost ← cost + gcost; (τ, P C, LC) ← (τ , P C , LC ); end end end return P C end Procedure Execute(τ, P C, LC) input : A mapping τ : Q → V , a physical circuit P C, and a logic circuit LC. output: A pair (P C , LC ) obtained by executing as many as possible gates which satisfy τ . begin P C ← P C; LC ← LC; do EL ← {g ∈ F(LC ) : g is executable by τ }; for g ∈ EL do P C ← P C by adding τ (g); LC ← LC by deleting g; end while EL = ∅; return (P C , LC ) end Procedure minChildHcost(τ, LC)
input : A mapping τ : Q → V and a logic circuit LC. output: The minimal cost of all children of τ . begin L ← F(LC); mCost ← ∞; for e ∈ E which touches some gate in L under τ do τ ← swap e • τ ; (P C , LC ) ← Execute(τ , ∅, LC); gcost ← 3 if e −1 ∈ E and 7 otherwise; hcost ← hcost(τ , LC ) according to Eq. (4); if hcost + gcost < mCost then mCost ← hcost + gcost; end end for g ∈ L which is inversely executable by τ do LC ← LC by deleting g; (P C , LC ) ← Execute(τ, ∅, LC); hcost ← hcost(τ, LC ) according to Eq. (4); if hcost + 4 < mCost then mCost ← hcost + 4; end end return mCost; end
Complexity of the Search Process
In each layer, there are at most |Q|/2 gates, where Q is the set of qubits in the input logic circuit. Thus, the time complexity of computing the cost (cf. Eq.(5)) of any state is O( · |Q|), where is the prefixed small number of layers we select for Eq.(4). For our evaluation (see Section 5), we take = 3 for all circuits.
By construction, each state s has at most |E| + |Q|/2 child states, where E is the set of edges in the architecture graph, or, equivalently, the number of possible SWAP operations that can be added to the circuit and |Q|/2 is the number of CNOT gates in the front layer of the current logic circuit that can be applied by adding four extra Hadamard gates to change the direction.
Suppose the input circuit contains m CNOT gates. If we activate the fallback when no gates are removed from LC s after K rounds, then the search procedure has at most K ×m states. This is because each activation of the fallback will execute a selected gate due to the use of remote CNOT. Therefore, the overall time complexity of the search is O(
For the space complexity, in each state s, we maintain a depth-2 search tree rooted at s. Thus the space complexity of the algorithm is bounded by O((|E| + |Q|/2)
2 ), i.e., O(|V | 4 ).
Optimization
In Algorithm 2, the search space grows exponentially if the depth of look-ahead is increased. Therefore, a pruning mechanism is introduced to reduce the size of the search space while preserving the quality of the output physical circuit. More specifically, a child state s i j of s i will be removed if both cost h (s
, where cost h (s) = w ×(d−1)×N swap ×N s s as defined in Eq.(4). In Example. 1, states s 1,0 will be pruned. This is because cost h (s) = 145.6, cost h (s) = 167.2, cost h (s 0 ) = 145.6, cost h (s 0 ) = 177.6 and cost Figure 9 : Effectiveness of the pruning mechanism obtained by running exemplary circuits on IBM Q20, where the gray and blue bars are ratios of the number of added gates to that of original gates for the proposed algorithm with and without pruning while lines correspond to the time consumption (seconds) for the search process.
From Fig. 9 we can see that the pruning mechanism has limited influence on the sizes of the output circuits while the time consumption is reduced by a large amount.
Programming and Benchmarks
To evaluate our approach, we compare it with previous algorithms proposed for the same purpose in the literature [25, 13, 7] . We use Python as our programming language and IBM Qiskit [1] as auxiliary environment. The code can be found in GitHub 2 . All experiments are conducted in a laptop with i7-8750H CPU and 16GB memory. The results are reported in Tables 2, 3 and 4, in which the 'Comparison' column shows the improvement of our algorithm over previous ones in terms of the numbers of auxiliary gates added. Specifically, let n comp and n ours be the numbers of gates added by the compared algorithm and by ours, respectively. Then the improvement ratio is defined as (n comp − n ours )/n comp . Table 2 demonstrates the superiority of the initial mapping output by our simulated annealing algorithm (Alg. 1) compared to the naive initial mapping which maps q i to v i for all i in Q20. The improvement is consistent and often above 30%.
For the heuristic search, we compare our algorithm to the ones introduced in [25] and [13] , which are, respectively, the state-of-the-art algorithms for IBM QX5 and Q20. We set the number of look-ahead layers as = 3, and the weight parameters w 1 = 1, w 2 = 0.8, w 3 = 0.6, Table 2 : The performance improvement brought by simulated annealing on IBM Q20. Here we compare with the naive mapping τ nv : Q → V which maps q i to v i for all i in Q20.
D AG is the diameter of the architecture graph and N swap the number of elementary gates needed to compose a SWAP gate. The threshold number K for activating fallback is set to be 0.5 × D AG . The algorithm proposed in [25] utilizes A * to find the best solution of each layer. It has exponential time complexity and only considers one layer for look-ahead when designing the heuristic cost function. Like ours, their A * -based algorithm works for both directed and undirected architecture graphs. As confirmed in [13] , it is comparable with the algorithm in [13] when Q20 is used as the QPU. So we only make the comparison on QX5. From the experimental results reported in Table 3 , we can see that our algorithm has a conspicuous improvement over the algorithm in [25] . Moreover, it is very efficient: for input circuits with up to 10,000 elementary gates, our algorithm finds the solution within one minute.
The algorithm proposed in [13] uses reverse traversal technique to search for a good initial mapping and has polynomial complexity. Although it considers multiple levels in its heuristic function, this algorithm does not consider the weights for gates in different layers in the heuristic function. Unlike our algorithm, the algorithm in [13] can only be applied to undirected architecture graphs. Therefore, we only compared it with ours on Q20. From the experimental results reported in Table 4 , we see that, for small circuits, both algorithms find the optimal output circuits; but for circuits with large size, our algorithm again has a conspicuous improvement. As for QX5, our algorithm is able to find within two minutes the solution to input circuits with up to 30,000 elementary gates.
We also compared our algorithm with the algorithm proposed in [7] , which also works for both directed and undirected architecture graph and its performance is comparable with the one in [25] . In Appendix, from the experimental results reported in Table 5 and 6, we can see that our algorithm also has a better performance.
It is worth mentioning that, if the depth for look-ahead in the selection process is increased, the quality of the output circuits could be further improved. However, the time consumption will be increased dramatically. See Fig. 10 for the experiment on a few examples, which indicates that 1-depth look-ahead reaches the best trade off of time and performance. Although the time overhead for increasing the depth is vast, it may still be acceptable in some application scenarios and this can easily be done by adjusting the relevant parameters of our algorithm. Besides, the weight parameters in the heuristic function are also adjustable when different architecture graphs and circuits are considered.
Conclusion
In this paper, we propose an algorithm to solve the quantum circuit transformation problem by using simulated annealing and heuristic search. A double look-ahead mechanism is novelly adopted in the algorithm. We look ahead at subsequent layers when defining a flexible heuristic cost function which also supports weight parameters to reflect the variable influence of gates in different layers. Moreover, we look ahead at grandchild states with minimal cost in selecting the best state for the next step of the circuit transformation. Detailed evaluation on extensive realistic circuits shows that our algorithm has consistent and significant improvement when compared with the two state-of-the-art algorithms proposed in the literature for IBM QX5 and Q20. For future studies, we propose the following problems to solve. First, our program still runs slowly for circuits with large sizes. Thus it is necessary to optimize the code to reduce the running time. Second, the quality of the initial mappings obtained from the simulated annealing algorithm (Alg. 1) is not stable, which is not acceptable for commercial use. Third, we only consider connectivity in the architecture graphs; other constraints like cross talk, gate error and qubits decoherence should be included to make the algorithm more practical. Fourth, only using the sizes of circuits as the criterion for evaluation is not enough. Criteria like circuit error and running time should also be considered in future work.
A More experimental results

Circuit Name
Original Gates Algorithm in [7] Proposed Algorithm 
