Quantum computing (QC) technologies have reached a second renaissance in the last decade. Some fully programmable QC devices have been built based on superconducting or ion trap technologies. Although di erent quantum technologies have their own parameter indicators, QC devices in the NISQ era share common features and challenges such as limited qubits and connectivity, short coherence time and high gate error rates. Quantum programs written by programmers could hardly run on real hardware directly since two-qubit gates are usually allowed on few pairs of qubits. Therefore, quantum computing compilers must resolve the mapping problem and transform original programs to t the hardware limitation.
INTRODUCTION
Quantum Computing (QC) has attracted huge attention in recent a decade due to its ability to exponentially accelerate several important algorithms [12, 15, 27, 33] . Both QC algorithm designers and programmers work at a very high level, and need to know little about (future) Noisy Intermediate-Scale Quantum (NISQ) devices that (will) execute quantum programs. There exists a gap, however, between NISQ devices and the hardware requirements (e.g., size and reliability) of QC algorithms. To bridge this gap, QC requires abstraction layers and toolchains to translate and optimize applications [9] . QC compilers typically translate high-level QC code into (optimized) circuit-level assembly code in multiple stages. In order to use NISQ hardware, quantum circuit programs have to be compiled to the target device, which includes mapping logical qubits to physical ones of the device. The mapping step, which we focus on in this abstract, faces a tough challenge because further physical constraints have to be considered. In fact, 2-qubit gates can only be applied to certain physical qubit pairs. Therefore, additional SWAP operations have to be inserted in order to "move" the logical qubits to positions where they can interact with each other. This qubit mapping problem has been proved to be a NP-Complete problem [35] .
Previous solutions to this problem can be classi ed into two types. One type is to formulate the problem into an equivalent mathematical problem and apply a solver [6-8, 22, 25, 28, 30, 31, 38, 39, 41, 43] , and another type is to use heuristic search to obtain approximate results [5, 17-20, 29, 34, 42, 45] . The former su ers from very long runtime and can only be applied to small size cases. The latter is better in runtime especially when the circuit is in a large scale. All of them assume di erent gates have the same execution duration. On NISQ hardware, however, di erent gates have di erent durations (see Table 1 ). Ignoring the gate duration di erence may cause these algorithms to nd the shortest depth but not the shortest execution time. The real execution time of the circuit is associated with the weighted depth, in which di erent gates have di erent duration weights. Considering gate duration di erence will help the compiler make better use of the parallelism of quantum circuit and generate the circuit with shorter execution time. In this abstract, we focus on solving the qubit mapping problem by heuristic search with the consideration of gate duration di erence and program context to explore more program's parallelism. To address the challenges of qubit mapping problem and adapt to di erent quantum technologies, we rst give several examples to explain our motivation, then propose a quantum abstract machine (QAM) for studying the qubit mapping problem. The QAM is modelled as a 2D coupling graph with limited connectivity and con gurable durations of di erent kinds of quantum gates. Based on the QAM, we further propose two mechanisms that enable COntext-sensitive and Duration-Aware Remapping algorithm (C ) to solve the qubit mapping problem with the awareness of gate duration di erence and program context. Experimental results show that compared to the best known remapping algorithm, C can cut down 17.5% ∼ 19.4% weighted depth at average.
PROBLEM ANALYSIS

Recent Work on bit Mapping
There are a lot of research on the qubit mapping problem. Here we focus on analyzing some valuable solutions in recent two years [4, 19, 26, 35, 37, 41, 45] . All of them are proposed for some IBM QX architectures, and none of them consider the gate duration di erence.
Solutions only considering qubit coupling. [35, 41] provide solutions for 5-qubit IBM QX architectures with directed coupling. Siraichi et al. [35] propose an optimal algorithm based on dynamic programming, which only ts for small circuits; then they propose a heuristic one which is fast but oversimpli ed with results worse than IBM's solution. Wille et al. [41] present a solution with a minimal number of additional SWAP and H operations, in which qubit mapping problem is formulated as a symbolic optimization problem with high complexity. They utilize powerful reasoning engines to solve the computationally task. [19, 45] use heuristic search to provide good solutions in acceptable time for large scale circuits. Zulehner et al. divide the two-qubit gates into independent layers, then use A * search plus heuristic cost function to determine compliant mappings for each layer [45] . Li et al. propose a SWAP-based bidirectional heuristic search algorithm, named SABRE [19] , which can produce comparable results with exponential speedup against previous solutions such as [45] .
Solutions further considering error rates. [4, 26, 37] provide another type of perspective for solving the qubit mapping problem. They consider the variation in the error rates of di erent qubits and connections to generate directly executable circuits that improve reliability rather than minimize circuit depth and number of gates. Based on the error rate data from real IBM Q16 and Q20 respectively, [26, 37] use a SMT solver to schedule gate operations to qubits with lower error probabilities. Ash-Saki et al. propose two approaches, Sub-graph Search and Greedy approach, to optimize gate-errors [4] . Circuits generated by them may su er from long execution time due to no consideration of the minimal circuit depth.
What we consider in the qubit mapping. We want to produce solutions for the qubit mapping problem with speedup against previous works and maintain the delity meanwhile. Besides the coupling map, what we further concern includes the program context and the gate duration di erence, which a ect the design of qubit mapping. Considering these factors will help to nd remapping solution with approximate optimal execution close to reality.
Motivating Examples
We use several examples written in OpenQASM [11] to explain our motivation for considering program context and gate duration di erence in qubit remapping process. The two examples base on the coupling map of four physical qubits Q 0 ∼ Q 3 and the assumed gate durations de ned in Fig. 1 (a) and (b). We directly map the logical qubits q[0]∼q [3] initially to physical qubits Q 0 ∼ Q 3 for easier explanation.
(a) Impact of program context. Consider the OpenQASM code fragment shown in Fig.1 (a) , where CX means CNOT in OpenQASM. Since qubits Q 0 and Q 3 are non-adjacent, the instruction "CX q[0],q [3] " at line 2 cannot be applied. To solve the problem, SWAP operation is required before performing the CX operation. In this case, there are four candidate SWAP pairs, i.e., (Q 0 , Q 1 ), (Q 0 , Q 2 ), (Q 3 , Q 1 ) and (Q 3 , Q 2 ). If the program context, i.e., the predecessor instruction "T q [2] ;", is not considered, there are no di erences among four candidates when selecting. However, SWAP operation on pair (Q 3 , Q 2 ) or (Q 0 , Q 2 ) con icts with the context instruction "T q [2] ;" due to operating the same Q 2 , and has to be executed serially after T operation as shown in Fig.1 (c) . SWAP on pair (Q 3 , Q 1 ) or (Q 3 , Q 1 ) does not con ict with "T q [2] ;" and can be executed in parallel as shown in Fig.1 (d) . With the awareness of the context information, SWAP operations which improve parallelism can be sifted out.
Impact of gate duration di erence. We use a 4-qubit QFT (quantum fourier transform) circuit to explain the limitation of ignoring the duration of quantum gates. Fig. 2 Fig. 2 . A 4-qubit QFT example reflecting the impact of gate duration di erence: "SWAP q [3] ,q [1] " is the best candidate since it can start immediately a er "T q [1] " while "CX q[0],q [2] " has not finished yet, increasing the parallelism of the circuit. a 4-qubit QFT OpenQASM program, which is generated by Sca CC compiler [2] . Similar to the rst example, SWAP operation is required before performing the CX operation and there are also the same four candidate SWAP pairs. Instructions "T q [2] " and "CX q[0],q [2] " can be executed in parallel and we assume both of them start at cycle 0. If the di erence of gate durations is ignored, the two gates "T q [2] " and "CX q[0],q [2] " are assumed to nish at the same time t and the four candidate SWAP operations have to start after t. But if the duration of CX is twice as much as that of T, we nd that "T q [2] " will nish at cycle 1 while "CX q[0],q[2]" at cycle 2. As a result, SWAP between q[3] and q [1] can start at cycle 1 as shown in Fig.2 (d) , while other three candidate SWAP operations have to start at cycle 2 since one of operands Q 0 or Q 2 is occupied as shown in Fig.2 (c) . Fig. 2 (d) has better parallelism, which can be deduced by the awareness of di erent quantum gate durations.
2.3
antum Architecture Abstraction 
Quantum operations in the circuit program I A sequence of quantum operations, I = [ 1 , 2 ,..., k ] if k = |I |, and the length of I is written as I .l en
Since the qubit mapping problem is a ected by the constraints of underlying QC devices, which base on various and evolving quantum technologies, it is essential to design quantum mapping algorithms that are compatible with di erent quantum technologies.
In view of the above, we consider the qubit connectivity of various NISQ devices, and take each gate duration as a multiple of quantum clock cycle τ u , which can be analogized to the classic clock cycle. We then introduce a Multi-architecture Adaptive Quantum Abstract Machine (maQAM) which consists of static and dynamic structures, denoted as A = (A s , A d ). Table 2 shows the de nitions for maQAM, where A s = (Q H , G, M, τ, D), and A d = (π, CF). We assume the device can provide enough physical qubits (denote the number as N ) for the program's execution (denote the number of logical qubits in the program as n), i.e., N ≥ n.
For a QC device, we abstract its qubit layout as a graph M where qubits are vertices and there are edges between qubit pairs where a two-qubit gate is allowed to apply on them. We introduce the Gate Duration Map τ into A s which maps each kind of quantum gate to its duration, depending on the information from quantum architecture. We assume the same kind of quantum gates have the same duration and delity. We also introduce the shortest distance matrix map D between each pair of physical qubits for quick selection of exchangeable qubits in our C scheduling algorithm.
DESIGN
In this section, we discuss our COntext-sensitive and Duration-Aware Remapping algorithm (C ). We rst overview the idea of C , then introduce the two key mechanisms that enable C context-sensitivity and duration awareness. The main idea of C is to generate an executable gate sequence for a given input OpenQASM program by adjusting the gate sequence and inserting the swap operation with the program semantics unchanged. The generated gate sequence ts quantum hardware limitation on one hand, and has better parallelism on the other hand to reduce the circuit's weighted depth , i.e., simulated execution time. We propose qubit lock mechanism in Section 3.1 for quickly nding available qubits. And we adjust the gate order based on the quantum gate commutativity described in Section 3.2.
bit Lock
C is based on a reasonable assumption: a qubit cannot be applied by two or more gates at the same time. If a qubit is occupied by a gate, it is called busy (not free) qubit and cannot be applied by other gates. As the example shown in Fig. 1, when inserting SWAP for a speci c two-qubit gate CX q[0],q [3] , the neighbour qubit q [2] of the target qubits may be occupied by the contextual gate which has started in earlier time. Using the occupied qubits to route the two-qubit gate will reduce the parallelism of the program because the routing process has to wait until occupied qubits become free.
To make C aware of the qubit occupation by the past contextual gate, we introduce a qubit lock t end for each physical qubit in Q . When start applying a quantum gate g ∈ G at time t on a physical qubit in Q and the gate's duration is τ g , C will update this qubit's t end as t+ τ g which means that it is occupied before t+τ g . A qubit is free only when its lock t end ≤ current time. When try to nd routing path for a speci c two-qubit gate, by comparing t end of each qubit with the current time, C can be aware of which qubit is occupied by the past contextual gate. Fig. 3 shows an example. Gates can only be applied to the physical qubits in free state. The gates whose associate physical qubits are all free, are called lock free gates.
Qubit lock can also help C aware of the gate duration di erence. Di erent gate kinds have di erent durations and C updates the operated qubit's lock t end with di erent value. As a result, qubits applied by gates with shorter duration will be set smaller t end and become free earlier. Thus C can use those earlier free qubits to route two-qubit gates and improve the parallelism of the program. As the example shown in Fig. 2 (d) , suppose the program starts at time 0 and τ T =1, τ CNOT =2. Then t end of Q 1 is set to 1 while t end of Q 0 and Q 2 are set to 2. Q 1 becomes free at time 1 while Q 2 is still busy. C can use Q 1 to route for the third gate and need not wait for the freedom of Q 2 .
Commutativity Detection
Qubit lock brings C awareness of the past contextual gate. Considering gate commutation relation can expose more future contextual gate for C to decide routing path. De nition 3.1 (Commutative Forward Gate, CF gate). Given a gate sequence I=[ 1 , 2 , ..., k , ...], ∀ k ∈ I, k is a commutative forward gate i ∀j, 0 < j < k, j and k are commutative.
The commutation relation between two-qubit gates A , B that share qubits with each other can be resolved by checking the relevant unitary operatorsÂB =BÂ. Gates applied to disjoint qubits are obviously commutative with each other. If a commutative forward gate is commutative with all the gates before it in sequence I, it can exchange with the head of I.
All CF gates in sequence I are denoted as CF (I ), which can be executed instantly from software perspective. Compared to the method that fetches gates with no predecessor as instantly-executable gates, choosing CF gates as instantly-executable gates can expose more contextual gates for heuristic search to determine better remapping solutions.
Suppose a sequence I contains two gates: CX q1,q3 and CX q2,q3 in order. The second gate shares q3 with the rst and might not be regarded as instantly executable due to qubit dependence. However, because the second commutes with the rst and is a CF gate in I, it is instantly executable in fact. Commutativity detection will expose both CXs for heuristic search which will improve its contextual look-ahead ability.
Example
Now we use an example shown in Fig. 4 to explain our algorithm. Suppose there is a 6-qubit device and we are given a gate sequence I that contains a CX on {q0,q2}, a T on {q1} and a CX on {q0,q3}. The number near the qubit node represents the value of its t end . All the three gates are CF gates. Due to the coupling limitation, CX on {q0,q3} is not directly executable and SWAP is needed. The algorithm simulates the execution timeline and starts at cycle 0. At cycle 0, the rst gate "CX q0,q2" and the second gate "T q1" are directly executable so both of them will be launched and qubits {q0,q1,q2}'s t end locks are updated with the gate duration (T=1 cycle, CX=2 cycle). At cycle 0, each of {q0,q1,q2} has bigger t end than current time and thus they are locked. Therefore the SWAP between {q1,q3} and {q2,q3} are blocked. SWAP between {q3,q5} with H basic < 0 (which means the SWAP wonâĂŹt shorten the total distance of CF gates according to our heuristic cost function) moves q3 away from q0 and will not be inserted. As a result, no SWAP will be inserted in cycle 0 and the mapping π stays unchanged. At cycle 1, qubit q1 becomes free while q2 stays busy. The SWAP between {q1,q3} becomes free while the SWAP between {q3,q2} is still blocked. Therefore the algorithm will know that the SWAP between {q1,q3} can start earlier than SWAP between {q3,q2} and choose SWAP q3,q1 to solve the remapping problem. After launching the SWAP between {q1,q3}, qubit locks of {q1,q3} are also updated by the sum of its start time (cycle 1) and the duration of SWAP (6 cycle) as 7.
EXPERIMENTAL EVALUATION
In this section, we evaluate C with benchmarks based on the latest reported hardware models.
Comparison with Previous Algorithms. Several recent algorithms proposed by IBM [16] , Siraichi et al. [35] , Zulehner et al. [45] and Li et al. [19] try to nd solutions of the qubit mapping problem with small circuit depth. Among them, Li's SABRE [19] beats the other three in the performance of benchmarks, thus it is used for comparison in this paper.
Hardware Con guration. We test our algorithm on several latest reported architectures, including IBM Q20 Tokyo [19] , IBM Q16 Melbourne [1] , 6 × 6 grid model proposed by En eld [35] 's GitHub and Google Q54 Sycamore [3] . The gate duration di erence con guration is based on experimental data of symmetric superconducting technology shown in Table 1 , where two-qubit gate duration is generally twice as much as that of the single-qubit gate.
Benchmarks. To evaluate our algorithm, we totally collect 71 benchmarks which are selected from the previous work, including: 1) programs from IBM Qiskit [10] 's Github and RevLib [40] ; 2) several quantum algorithms compiled from Sca CC [2] and Quipper [14] ; 3) benchmarks used in the best-known algorithm SABRE [19] . The size of the benchmarks ranges from using 3 qubits up to using 36 qubits and about 30,000 gates. For the IBM Q16, Q20 and 6 × 6 architectures, 68 benchmarks out of the 71 benchmarks except 3 36-qubit programs are tested. While all 71 benchmarks are tested on Google Q54 Sycamore. Circuit Execution Speedup. We collect the weighted circuit depth of the circuits produced by C and SABRE for the 71 benchmarks. Initial mapping has been proved to be signi cant for the qubit mapping problem, and for a fair comparison, we use the same method as SABRE to create the initial mapping for the benchmarks. We use the depth of circuits produced by SABRE compared with the one of C to show the ability of our algorithm to speed up the quantum program. As shown in Fig.5 , the average speedup ratio of C on four architecture models, IBM Q16 Melbourne, En eld 6×6, IBM Q20 Tokyo and Google Q54 are respectively 1.212, 1.241, 1.214 and 1.258.
CONCLUSION
In NISQ era, most quantum programs are not directly executable because two-qubit gates can be applied between arbitrary two logical qubits while it can only be implemented between two adjacent physical qubits due to hardware constraints. To solve this problem, in this paper we propose C that can transform the origin circuit and insert necessary SWAP operations making the circuit comply with the hardware constraints. With the design of qubit lock and commutativity detection, C is aware of program context and the gate duration di erence which help C remapper nd the remapping with good parallelism and reduce QC's weighted depth. Experimental results show that compared to the best known remapping algorithm, C remapper can cut down 17.5% ∼ 19.4% weighted depth at average.
