The rapid progress of physical implementation of quantum computers paved the way of realising the design of tools to help users write quantum programs for any given quantum devices. The physical constraints inherent to the current NISQ architectures prevent most quantum algorithms from being directly executed on quantum devices. To enable two-qubit gates in the algorithm, existing works focus on inserting SWAP gates to dynamically remap logical qubits to physical qubits. However, their schemes lack the consideration of the depth of generated quantum circuits. In this work, we propose a depth-aware SWAP insertion scheme for qubit mapping problem in the NISQ era.
I. INTRODUCTION
Quantum computing has exhibited its theoretical advantage over classical computing by showing impressive speedup on applications including large integer factoring [1] , database search [2] , and quantum simulation [3] . It is considered to be a new computational model that may have a subversive impact on the future, and has attracted major interests of a large number of researchers and companies.
With the advent of advanced manufacturing technology, the industry is able to build small-scale quantum computers -Noisy Intermediate-Scale Quantum [4] (NISQ) devices. A NISQ device is equipped with dozens to hundreds of qubits. IBM [5] released its 53-qubit quantum computer in October 2019 and has made it available for commercial use. Google [6] released the 72-qubit Bristlecone quantum computer in March 2018. Other companies including Intel [7] , Rigetti [8] , and IonQ, have released their quantum computing devices with dozens of qubits. The current NISQ technology may not be perfect, but it's a good first step towards the more powerful quantum devices in the future.
In order to map high level quantum programs to NISQ devices, it is important to overcome two obstacles. First, to be able to execute a quantum circuit, it is necessary to map logical qubits to physical qubits with respect to architecture and program coupling constraints. Any quantum program can be implemented using an universal gate set [9] of a small number of elementary gates. For instance, the {H, CNOT, S, T} set is an universal gate set, in which the {H, S, T} gates are single qubit gates, the CNOT gate is a two-qubit gate.
The two-qubit gate must be mapped to two qubits that are physically connected. However, in real quantum architecture, qubits may have limited connection and not every two qubits are connected, as shown in the IBM QX2 architecture in Fig. 1 (a). For this reason, a quantum circuit is not directly executable on a NISQ device, unless circuit transformation is performed. The common practice is to insert SWAP operations to dynamically remap the logical qubit such that the transformed circuit is hardware-compliant for each (set of) two-qubit gate(s).
Second, it is critical that the depth of a quantum circuit be minimized for the NISQ device. A qubit is volatile and error prone. It gradually decays over time and may have phase and bit flip errors. It may completely lose its state after a certain period of time, called coherence time. Quantum error correction (QEC) codes can detect error syndromes and fix them. However, QEC needs to use a large number of redundant physical qubits. A realistic QEC circuit may need more than 10,000 physical qubits, which is not possible for today's NISQ device. Without QEC, a program must terminate within a threshold amount of time. The depth of the circuit, which is the number of steps the circuit executes, must be optimized. IBM proposed the metric of quantum volume [10] for evaluating the effectiveness of quantum computers which accounts for not only the width of the circuit (the number of qubits), but also the depth, how many steps the circuit can execute.
Transforming the logical circuit into a hardware-compliant one will inevitably result in increased gate count and circuit depth. Most previous work for qubit mapping [11] - [16] focus on minimizing the number of inserted gates, but not the depth of the transformed circuit. However, even if the gate count is small, it does not necessarily mean the depth of the circuit is small, due to the dependence between different gates. We discover that previous work that aims to minimize number of inserted gate may significantly increase the depth of the circuit (in Section IV). For instance, the Sabre approach by Li et al. [11] reduces the gate count by 1.1%, but increases the depth of the 10-qubit QFT circuit by over 44.5%. The two studies [17] , [18] stress the importance of taking into consideration the variability in the qubit (link) error rates, but they do not directly address the issue of the increased circuit depth. The depth of the circuit, as mentioned above, is critical and determines if a quantum program is executable on a NISQ device with respect to its physical limits. In this paper, we propose the first depth-aware qubit mapping scheme for quantum circuits running on arbitrary qubit connectivity hardware. Our depth-aware qubit mapper searches for the mapping that minimizes the transformed circuit depth and keeps the gate count within a reasonable range. Our results show we can reduce the depth of the transformed circuit by up to 30% compared with two best known qubit mappers [11] , [12] , and in the meantime, have on average less than 3% additional gates over a large set of representative benchmarks.
II. BACKGROUND AND MOTIVATION A. Quantum Computing Basics 1) Qubit: A quantum bit or qubit, is the counterpart to classical bit in the realm of quantum computing. Different from a classical bit that represents either '1' or '0', a qubit is in the coherent superposition of both states. It is considered as a two-state quantum system that exhibits the peculiarity of quantum mechanics [9] . An example is the spin of the electron that the two states can be spin up and spin down.
2) Quantum Gates: There are two types of basic quantum gates. One type of basic gates is the single-qubit gate, a unitary quantum operation that can be abstracted as the rotation around the axis of the Bloch sphere [9] which represents the state space of one qubit. A single qubit-gate can be parameterized using two rotation angles around the axes. There are several elementary single-qubit gates including the Hadamard (H) gate, the phase (S) gate, and the π/8 (T) gate [9] . The other type of basic gates is the multi-qubit gate. However, all complex quantum gates can be decomposed into a sequence of single qubit gates H, S, T, and the two-qubit CNOT gate. Thus we only focus on the two-qubit CNOT gate. The CNOT gate operates on two qubits which are distinguished as a control qubit and a target qubit. If the control qubit is 1, the CNOT gate flips the state of the target qubit, otherwise, the target qubit remains the same.
3) Quantum Circuit: Quantum circuit is composed of a set of qubits and a sequence of quantum operations on these qubits. There are various ways to describe the quantum circuits. One way is to use the quantum assembly language called OpenQASM [19] released by IBM. Another way is to use the circuit diagram, in which qubits are represented as horizontal lines and quantum operations are the different blocks on those lines. In Fig. 2 (a), we show a simple example of quantum circuit diagram. A single-qubit gate is denoted as a square on the line, and one CNOT gate is represented by a line connecting two qubits and a circle enclosing a plus sign.
B. Qubit Mapping and Depth-Awareness
To enable the execution of a quantum circuit, the logical qubits in the circuit must be mapped to the physical qubit on the target hardware. When applying a CNOT gate, the two qubits connected by the CNOT gate need to be physically connected to each other. Due to the irregular physical qubit layout of existing devices, it is generally considered impossible to find an initial mapping that makes the entire circuit CNOTcompliant. The common practice is to insert SWAP operations to remap the logical qubits. A swap operation exchanges the states of the two input qubits of interest. As shown in Fig. 3 , a SWAP operation is implemented using 3 CNOT gates for architecture with bi-directional links, or 3 CNOT gates plus 4 Hadamard gates for architecture with single-direction links, where a bi-directional link means both ends of the link can be the control or target qubit, while single-direction link means only one end of it can be the control qubit. IBM's Qiskit uses a stochastic method to insert SWAPs [15] operations but often results in significant increase in the number of inserted gates and depth. Existing works [11] , [14] , [16] are more efficient than IBM's Qiskit mapper. They use efficient heuristics to find the mapping rather than a stochastic method. However, the main objective of these methods is to reduce the gate count. It makes sense to minimize the gate count, but it is more important to focus on the depth of circuit, as in the NISQ era the depth is equivalent to the estimated execution time. Reducing the depth of the circuit can reduce the likelihood of the circuit failing at an early stage.
We show an motivation example in Fig. 2 . The hardware model is shown in Fig. 1 (a) . It has five qubits and the connectivity is the same as the IBM QX2 architecture except that the links are all bidirectional. There are 5 physical qubits: Q 1 to Q 5 and six bi-directional edges. One CNOT gate can only be applied on one of these edges.
In the example, the initial mapping between logical qubits (denoted by lower case q) and physical qubits (denoted by the upper case Q) is shown next to each qubit (line), which is
With this initial mapping, it starts scheduling gates one by one until it encounters a (set of) CNOT gate(s) which cannot be scheduled due to physical constraints. We show the interaction of logical qubits in Fig. 1 (b) such that two logical qubits are connected if there is a CNOT operation between them. When we encounter the gate "CNOT q 2 , q 5 " (marked red in the circuit diagram in Fig. 2 and as the dotted line in the logical coupling graph Fig. 1 ), the scheduling has to terminate since this translates into "CNOT Q 2 , Q 5 " on the hardware, while no physical link exists between Q 2 and Q 5 . Necessary SWAP operations are needed. When applying a SWAP operation, the two input physical qubits will exchange their states. Fig. 2 (b) and (c) provide two options for transforming the circuit. Fig. 2 (b) inserts 2 SWAPs (SWAP Q 3 , Q 4 and SWAP Q 3 , Q 5 ) such that "CNOT q 2 , q 5 " becomes "CNOT Q 2 , Q 3 ", however the two SWAPs can run in parallel with existing single qubit gates in the circuit, without having to increase the depth of the circuit. Fig. 2 (c) inserts only 1 SWAP (SWAP Q 2 , Q 3 ) such that "CNOT q 2 , q 5 " becomes "CNOT Q 3 , Q 5 ", but it can not overlap with existing single-qubit gates in the circuit and will only increase the depth of the circuit by 3 (assuming we use 3 gates to implement the SWAP operation and each elementary gate takes 1 cycle in this example). In this example, the best two known approaches by Zulehner et al. [14] and Li et al. [11] will both choose to insert 1 SWAP since they only optimize the number of gates inserted into the circuit (or the depth of the inserted gates), but not the depth of the entire transformed circuit. This example stresses the importance of depth-awareness in SWAP insertion schemes and motivates our work. 
III. PROPOSED SOLUTION A. Metric
As our work is a depth-aware SWAP insertion scheme, we first precisely define the metric for characterizing the depth of a circuit. In order to fully explain the metric, we need to introduce the concepts of dependency graph and critical path.
The dependency graph represents the precedence relation between quantum gates in a logical quantum circuit. The definition is below: Definition 1. Dependency Graph : The dependency graph of a quantum circuit C with a set of gates Ψ is a Directed Acyclic Graph G ψ = (Ψ, E ψ ), E ψ ⊆ ψ × ψ. A directed edge from node ψ 1 to node ψ 2 exists if and only if the output of gate ψ 1 is (part of) the input of gate ψ 2 in the quantum circuit C.
The critical path is referred to as the longest path in the dependency graph. And the definition is below: Definition 2. Critical Path : Given a dependency graph G ψ = (Ψ, E ψ ) of a quantum circuit. The critical path is CP = M ax(P ath(ψ 1 , ψ 2 )) s.t. ψ 1 , ψ 2 ∈ E ψ and ψ 1 = ψ 2
The depth is characterizing the number of execution steps of a quantum circuit, which is tantamount to the critical path length of the circuit. The longest path in the dependence graph describes the minimal number of steps the circuit needs in order for every gate's data dependence be resolved. In Algorithm 1, we show how we calculate the critical path. We first sort the nodes in the directed acyclic graph in topological order. Then we process the nodes in that order. For each node, we check the earliest start time for each of its predecessors, and add it by the latency of that predecessor, then we choose the maximum and use it as the earliest start time of this node. The maximum of all nodes' earliest start time added by their latency is the critical path length.
We use the critical path length as the metric for ranking different swap insertion options.
B. Framework Design
With the metric precisely explained in previous section, now we continue to explain the work flow of our framework and the intuitions behind it.
Before delving into the details of this framework, we need to define the layer and the coupling graph. We can divide the set of quantum gates in a circuit into layers, so that all gates in the same layer can be executed concurrently. The formal definition of a layer is:
Qubit Connectivity

Org. Circuit & Its Initial Layer
Process
The set of gates at layer l i can run concurrently and act on distinct sets of qubits.
To divide a circuit into layers, we group the gates that have the same earliest start time (defined in Algorithm 1) into the same layer. The order of the layers is thus determined by the order of the earliest start times.
We use an iterative process to find the mapping. Our framework is depicted in Fig. 4 . And this iterative process is explained as below. We start the framework by taking the input of the coupling graph (also denoted as Qubit Connectivity) and the original circuit's initial layer.
We process the circuit layer by layer. Given a layer, we perform the following steps.
• We check the layer to see if it is hardware-compliant based on the coupling graph and the qubit mapping before current layer is scheduled. • If YES, we move on to next layer. • If NO, we invoke our mapping searcher to search for (the set of) swaps that are necessary to solve the current layer. We consider depth-awareness during the selection of the set of swap gates -the resulted mapping of which generates the smallest critical path length (described in Section III-C). After we find a hardware-compliant mapping, we move to the next layer. After all layers are processed, the mapping terminates.
C. Circuit Mapping Searcher
Here we describe the specific mapping searcher we use to overcome the coupling constraint for a given layer.
We build our method upon the A-star algorithm for finding valid mappings that minimize the number of only the inserted SWAP gates [14] . We extend it by changing the ranking metric and allowing it to search for feasible mappings that do not necessarily have the smallest SWAP gate counts. It will help us search in a way that minimizes the depth while not significantly increasing the gate count.
We rank the swap options by the increase in the critical path length. Since it is an iterative process that handles the gates layer by layer, it is tempting to consider only minimizing the depth of the already processed circuit when deciding which swaps to use. Fig. 5 shows that not only the processed circuit, but also the remaining circuit can help overlap the SWAPs with existing gates in the circuit without affecting the critical path. As shown in Fig. 5 , for the CNOT gate (in red), there is no way it can overlap the necessary SWAPs with the processed circuit (dubbed as the circuit before the dashed line). But when we look after the dashed line, the three single-qubit gates can overlap with inserted SWAP. And this renders less impact to the depth of the resulting circuit, compared to if we insert the SWAP on Q 1 and Q 2 .
Based on this intuition, we design our scheme of choosing the SWAP candidate as in Fig. 6 . For each of the hardwarecompliant remapping candidates that we acquire from the Astar searcher, we calculate the critical path after merging the candidate (set of) swap(s) with both the processed circuit and the not-processed circuit. We choose the mapping that yields the shortest critical path. 
D. Optimizations
We use two ways to optimize our proposed solution. One is to expand more nodes during the A-star search, and another one is to search into deeper levels.
1) Expand More Nodes: In the search process for A-star, the normal routine is to expand the one node of least cost at each step. Here, we can expand more than one node at each step and increase the search space. The number of nodes that can be expanded at a time can go from 1 to larger number.
2) Deeper Search: We increase the depth of the A-star search tree. In normal case, the search process ends when it finds the first node that minimizes the number of SWAPs, which is reflected as a certain level of the A-star tree. To this end, the second optimization that we applied here is to continue the search into a deeper level of the A-star tree. We can specify and tune the parameter of the deeper search.
By tuning these parameters, there are more possible nodes added into our search space. With a larger search space, we have a larger possibility to jump out of one local optima and go to the global optima.
IV. EVALUATION
In this section, we evaluate our depth-aware swap insertion scheme (denoted as DPS) and compare it with the two stateof-the-art qubit mappers. The experiment setup is listed below:
• Benchmarks: We use the quantum circuits from RevLib [20] , IBM Qiskit [15] , and ScaffCC [21] . • Hardware Model: We use IBM's 20-qubit Q20 Tokyo architecture, which was used in [11] 's work. The qubit connectivity graph is shown in Fig. 7 . • Evaluation Platform: The mapping experiments are conducted on a Intel 2.4 GHz Core i5 machine, with 8 GB 1600 MHz DDR3 memory. The operating system is MacOS Mojave. We use IBM's Qiskit [15] to evaluate the depth of the transformed circuit. • Baselines: We compare our work with two best know qubit mapping solutions, the work by Zulehner and others [14] (denoted as Zulehner), the Sabre qubit mapper from [11] (denoted as Sabre), and IBM's stochastic mapper in Qiskit. Since IBM's Qiskit mapper is significantly worse in terms of gate count and depth than all other mappers we evaluate, as also evidenced in the work by Zulehner et al. [14] , we do not present Qiskit results. • Metrics: We are comparing the depth and gate count of the transformed circuit circuits for all different strategies. Fig. 7 . IBM Q20 Tokyo Physical Layout [11] Table. I shows a summary experimental results. For gate count, we compare the total gate count generated in the transformed circuit. For depth, we compare the increased depth for each benchmark, denoted as "Depth-delta" in Table I . The improvement columns provides the ratio between one of the two baseline's depth-delta and our depth-delta. We use the term minimum improvement to denote the improvement over the best of the two baselines, and the term maximum improvement to denote the improvement over the worse of the two baselines.
We discuss our findings from the following three aspects: depth reduction, gate count change, and the trade-off between gate count and depth.
A. Depth Reduction
For depth reduction, as shown in Table I , our proposed solution outperforms the two baselines Zulehner and Sabre. Comparing depth-delta, the added depth of the circuit, our approach outperforms the better of the two baselines by more than 20% and up to 3X. For five out of the twenty-three benchmarks, our improvement on depth-delta is less than 20% compared with the better of the two baselines. However, for these cases, our approach still achieves considerable improvement over the worse of the two baselines. In these cases, it is possible that one of the two baselines happen to achieve very good depth in the transformed circuit and there is not much potential to improve. But our approach is still able to find a good mapping for these benchmarks and the performance is on par with the better of the two baselines.
B. Gates Count Changes
The primary goal of our depth-aware qubit mapper is to minimize the depth of the circuit. However, we discover that our qubit mapper can sometimes reduce the gate count. We discover that four out of the twenty three (17%) benchmarks, our qubit mapper yields the smallest number of gates among all three versions of qubit mappers. For 57% of these benchmarks, our method is ranked among top-2 of the three qubit mappers in terms of gate count. For the benchmarks where our method yields the largest gate count, the increased gate count percentage is negligible. On average, our depth-aware qubit mapper adds 3% gate count. From the experiment results, we can see that our solution does not greatly increase the number of gates while reducing the depth of the circuit.
C. Trade-off between Gate Count and Depth
While all previous works focus on reducing the total gate count (and the depth among the inserted gates themselves) after qubit mapping transformation, it is crucial to think about the trade-off between the resulted gate count and depth. Sometimes the choice made during the search process that favors the reduced gate count, might adversely affect the critical path. In Table I , the Sabre mapper reduces the number of gates for 10-qubit QFT by 1.1% compared with Zulehner's mapper, but increases the depth by 44.5%. For the sym 9 246 benchmark, Sabre reduces the gate count by 3.8% compared with our approach, but increases the depth by 25.5%. Therefore a small reduction in the gate count may not be worthwhile if it increases the circuit depth significantly.
