A number of quantum processors consisting of a few tens of noisy qubits already exist, and are called Noisy Intermediate-Scale Quantum (NISQ) devices. Their low number of qubits precludes the use of quantum error correction procedures, and then only small-size quantum algorithms can be successfully run. These quantum algorithms need to be compiled to respect the constraints imposed by the quantum processor, known as the mapping or routing problem. The mapping will result in an increase of the number of gates and circuit depth, decreasing the algorithm's success rate.
A number of quantum processors consisting of a few tens of noisy qubits already exist, and are called Noisy Intermediate-Scale Quantum (NISQ) devices. Their low number of qubits precludes the use of quantum error correction procedures, and then only small-size quantum algorithms can be successfully run. These quantum algorithms need to be compiled to respect the constraints imposed by the quantum processor, known as the mapping or routing problem. The mapping will result in an increase of the number of gates and circuit depth, decreasing the algorithm's success rate.
In this paper, we present a mapper called Qmap that makes quantum circuits executable on the Surface-17 processor, a scalable processor with a surface code architecture. It takes into account not only the elementary gate set and qubit connectivity constraints but also the restrictions imposed by the use of shared classical control, which have not been considered so far. Qmap is embedded in the OpenQL compiler and uses a configuration file where the processor's characteristics are described and that makes it capable of targeting different quantum processors. To show this flexibility and evaluate its performance, we map 56 quantum benchmarks on two different superconducting quantum processors, the Surface-17 (17 qubits) and the IBM Q Tokyo (20 qubits), while using different routing strategies. We show that the best router can reduce the resulting overhead up to 80% (72%) for the number of gates and up to 71.4% (66.7%) for the circuit latency (depth) on the Surface-17 (IBM Q Tokyo) when compared to the baseline (trivial router). In addition, having a slightly higher qubit connectivity helps to decrease the number of inserted movement operations (up to 82.3%) and the use of MOVE operations instead of SWAPs can reduce the number of gates and the circuit latency up to 38.9% and 29%, respectively. Finally, we analyze how the mapping affects the reliability of some small quantum circuits. Their fidelity shows a decrease that ranges from 1.8% to 13.8%.
I. INTRODUCTION
Quantum computers promise to solve a certain set of complex problems that are intractable for even the most powerful current supercomputers, being the most famous example the factorization of large numbers using Shor's algorithm [1] . However, a fault-tolerant (FT) large-scale quantum computer with thousands or even millions of qubits will be required to solve such a kind of problem [2, 3] .
Quantum computing is still far away from that as it is now just entering the Noisy Intermediate-Scale Quantum (NISQ) era [4] . This refers to exploiting quantum processors consisting of only 50 to a few hundreds of noisy qubits -i.e qubits with a relatively short coherence time and faulty gates [5] , [6] . Due to the limited number of qubits, hardly or no quantum error correction (QEC) will be used in the next coming years posing a limitation on the size of the quantum applications that will be successfully run on NISQ processors. Nevertheless, these processors will still be useful to explore quantum physics, and implement small quantum algorithms that will hopefully demonstrate quantum advantage [4] . For running nearterm quantum applications on noisy quantum devices, it is thus crucial to minimize their size in terms of circuit width (number of qubits), number of gates, and circuit depth (number of time-steps) [7, 8] .
In addition, these quantum applications have to be adapted to the hardware constraints imposed by current quantum processors. The main constraints include:
• Elementary gate set: Generally, only a limited set of quantum gates that can be realized with relatively high fidelity will be predefined on a quantum device. Each quantum technology may support a specific universal set of single-qubit and two-qubit gates. For instance, some superconducting quantum technologies have CZ as an elementary twoqubit gate [9, 10] .
• Qubit connectivity: quantum technologies such as superconducting qubits [5, [11] [12] [13] and quantum dots [14, 15] arrange their qubits in 2D architectures with nearest-neighbour (NN) interactions. This means that only neighbouring qubits can interact or in other words, qubits are required to be adjacent for performing a two-qubit gate. In other technologies such as trapped-ion qubits, they are fully connected and allow all-to-all interactions [16] .
• Classical control: classical electronics are required for controlling and operating the qubits. Using a dedicated instrument per qubit is not scalable and very expensive approach. Therefore, shared control is required especially when building scalable quantum processors. For instance, a single Arbitrary Waveform Generator (AWG) is used for operating on a group of qubits and several qubits are measured through the same feedline [17, 18] . This limits the possible parallelism of quantum operations, leading to longer latencies and a larger circuit depth.
All these constraints may vary between different qubit implementations and even within the same quantum technology. In order to meet them, a mapping procedure is required to transform a hardware-agnostic quantum circuit into a hardware-aware version that can be run on a given quantum processor. Mapping will: i) map virtual qubits (qubits in the circuit) to hardware qubits (physical qubits in the processor), ii) route qubits to move non-adjacent qubits to neighbouring positions when they need to interact. To this purpose, the path that the qubits will follow needs to be determined and movement operations such as shuttling in trapped ion and Si-spin quantum processors [15, 19] , and SWAPs in superconducting quantum processors will be inserted accordingly. Note that routing will increase the number of operations as well as the circuit depth. iii) Schedule the operations respecting not only the dependencies between them but also the classical control constraints. In addition, gates will be decomposed to elementary gates and the circuit will be optimized at different stages of the compilation process. For NISQ processors it is key to minimize the mapping overhead such that the resulting circuit still has a high reliability and success rate. Note that the higher the number of gates and/or the circuit depth, the higher the failure rate of computation and thus the lower the reliability of the circuit.
Different solutions have been proposed to map quantum circuits onto NISQ processors. [20] [21] [22] [23] [24] [25] [26] [27] [28] map quantum algorithms onto processors with a 2D grid structure. [29] [30] [31] [32] [33] [34] [35] and [36, 37] propose mapping algorithms targeting IBM and Rigetti processors, respectively. Most of the works done so far focus on a specific processor architecture and mainly consider the connectivity constraint and the elementary gate set. But also other constraints of NISQ devices such as shared classical control should be taken into account to make quantum applications executable [18, 38] .
These mapping algorithms usually use either the number of inserted movement operations or the circuit depth as optimization metric; that is, the routing path that inserts the least number of extra gates or the one that produces the minimal circuit depth overhead is chosen. The same metrics together with the execution time (time it takes to perform the mapping) are considered to evaluate the quality of the mapping algorithms. Although, both number of gates and circuit depth are correlated with the reliability of quantum circuits and they should be minimized as we mentioned, an analysis on how they degrade the algorithm's performance is not provided.
Recent works [8, 32, 34, 35, 39, 40] , propose to use reliability as an optimization metric and analyze how the mapping process affects the success rate (also called execution success probability) of the algorithm. They suggest to choose the routing path based on the fidelity of the two-qubit gates along the path as they are used to implement the movements (noise-aware mapper). Note that the fidelity of two-qubit gates can vary between different pairs of qubits. However, the reliability of a path is calculated by simply multiplying the reliability of each gate without considering error propagation and decoherence, which makes this metric incomplete and not very accurate; it sometimes fails in predicting the most reliable route [35] . Based on the results presented in these papers, it seems that optimizing reliability instead of just number of gates leads to better success rates, at least for small quantum circuits.
As we showed, most of the works on mapping focus on IBM and Rigetti superconducting processors or on the UMD trapped ion processor. They only target a particular quantum processor (e.g. IBM Q Yorktown) or a family of processors (e.g. IBM Q Tenerife, IBM Q Melbourne, IBM Q Tokyo). Recently, mappers capable of generating executable circuits for different quantum processors have been presented [40, 41] . However, none of them take into account the control electronics constraints that can be very restrictive especially when scaling-up quantum processors. They do neither consider information such as gate duration (except [41] ) and then assume when scheduling operations that all gates take the same number of cycles to execute. They all use SWAP operations for moving qubits when targeting superconducting quantum processors. In addition, so far no mapper has been developed for more scalable quantum processors such as the Surface-17 processor presented in [38, 42] . This processor has been designed with the aim of building a large qubit array capable of performing fault-tolerant quantum computations based on surface code. However, it can be used not only for performing QEC cycles (memory) but also for running quantum algorithms. This paper is the first to map several quantum benchmarks to the Surface-17 processor, a scalable processor with a surface code architecture. It presents a mapper called Qmap that takes into account all three types of constraints: elementary gate set, qubits connectivity and control electronics. Qmap is composed of several modules, including initial placement of qubits, routing of qubits, and gate scheduling, together with decomposition and optimization steps. It is embedded in the OpenQL compiler 1 [43] that gives the flexibility to be applied to different underlying quantum processors. Hardware information such as the necessary gate decomposition rules, the processor's elementary gate set with gate duration information, processor topology, and classical control constraints are described in a configuration file that is used by different compiler passes. The mapper takes as an input a quantum program written in OpenQL (C++ or Python), maps and optimizes the corresponding quantum circuit for the given quantum platform and generates executable low-level QASM-like code [44] . The mapper is used not only for mapping quantum circuits to the Surface-17 processor but also to the IBM Q Tokyo chip [5] to show its universality. Several benchmarks that differ in number of qubits and gates are evaluated. We analyze the overhead caused by mapping in terms of the number of extra gates and circuit depth/latency when using different routing strategies.
The main contributions of this paper are the following:
• We have developed a mapper (Qmap) for a scalable superconducting quantum processor such as the Surface-17. The mapper considers not only common processor constraints such as the choice of the elementary gate set and the qubit connectivity but also gate execution time (gate duration) and classical control constraints resulting from using control electronics that is shared among qubits.
• With the goal of supporting several quantum processors, our mapper has been embedded in our OpenQL compiler. It can target different quantum chips by using a configuration file in which the constraints of the processor are described. This flexibility allows performing a comparative analysis between them and give some directions for building future quantum machines. We compile 56 benchmarks taken from RevLib [45] and QLib [46] onto two quantum processors, the Surface-17 and the IBM Q Tokyo processors.
• The developed mapper supports different routing strategies. Three of them are used and evaluated in this work (trivial, base and minextendrc). After mapping (using the best router), the circuit latency (depth) can increase up to 260% (150%) and the overhead in the number of gates can be as high as 78.1% and 68% for the Surface-17 and IBM Q Tokyo, respectively.
• Our mapper uses not only SWAP operations (3 CNOTs) for moving qubits but also MOVE operations (2 CNOTs) when possible. It reduces the number of gates and the circuit latency up to 38.9% and 29%, respectively.
• An analysis on how the mapping affects the reliability of some small quantum circuits when mapped to the Surface-17 chip is also presented. They show fidelity decrease that ranges from 1.8% to 13.8%.
The rest of this paper is organized as follows. We first describes all the hardware parameters that will be considered in this work in Section II. Then we introduce the proposed mapping procedure and corresponding routing algorithms in Section III. The evaluation results are shown in Section IV. Finally, Section V concludes the paper.
II. QUANTUM HARDWARE CONSTRAINTS
In this section, the hardware constraints of the superconducting Surface-17 and the IBM Q Tokyo quantum processors will be briefly introduced, including the primitive gates that can be directly performed, the topology of the processor which limits interactions between qubits, and the constraints caused by the classical control electronics which impose extra limitations on the parallelism of the operations.
A. Elementary gate set
In order to run any quantum circuit, a universal set of operations needs to be implemented. In superconducting quantum processors, these operations commonly are measurement, single-qubit rotations, and multi-qubit gates.
Surface-17 processor: In principle, any kind of single-qubit rotation can be performed on the Surface-17 processor. However, an infinite amount of gates cannot be predefined. In this work, we will limit single qubit gates to X and Y rotations (easier to implement), and more specifically ± 45, ± 90 and ± 180 degrees will be used in our decomposition. The primitive two-qubit gate in this transmon processor is the conditional-phase (CZ) gate. Table I shows the gate duration (gate execution time) of single-qubit gates, CZ gate and measurement (in the Z basis) [47] . After mapping, the output circuit will only contain operations that belong to this elementary gate set. The decomposition for Z, H, S, S † , T, T † , CNOT, SWAP and MOVE gates into the elementary gates shown in Table I can be found in Appendix A. 
and the conditional-NOT (CNOT) gate. This means that the gate set { Pauli, H, S, S † , T, T † , CNOT } can be directly supported without further decomposition. In this work, we do not take the gate duration of the IBM Q Tokyo processor into account since the authors did not find the duration of two-qubit gates of this device by the time of writing this paper. the qubits and edges represent the connections (resonators) between them. Two-qubit gates can only be performed between connected qubits, i.e., nearestneighbouring qubits. This implies that qubits that have to interact but are not placed in neighbouring positions will need to be moved to be adjacent. Quantum states in superconducting technology are usually moved using SWAP gates. A SWAP gate is implemented by three CNOTs that in the case of the Surface-17 processor need to be further decomposed into CZ and R Y gates as shown in Figure 6 . In this work, we also consider the use of a MOVE operation which only requires two CNOTs (see Figure 6 ). Note that a MOVE operation requires that the destination qubit where the quantum state needs to be moved to, is in the |0 state. As mentioned, moving qubits results in an overhead in terms of number of operations and circuit depth, which in turn will decrease the circuit reliability.
B. Processor topology

C. Classical control constraints
In principle, any qubit in a processor can be operated individually and then any combination of single-qubit and two-qubit operations can be performed in parallel. However, scalable quantum processors use classical control electronics with channels that are shared among several qubits. Here we will describe how the classical control electronics used in the Surface-17 processor affect the parallelism of operations. The classical control constraints for the IBM Q Tokyo processor were not found by the authors and then they will not be considered in this work.
a. Single-qubit gates: Single-qubit gates on transmons are performed by using microwave pulses. In Surface-17, these pulses are applied at a few fixed specific frequencies to ensure scalability and precise control. The three frequencies used in Surface-17 are shown in Figure 1a: single-qubit gates on red, blue and pink colored qubits are performed at frequencies f 1 , f 2 , and f 3 , respectively [38] . In this work, we assume that same-frequency qubits are operated by the same microwave source or arbitrary waveform generator and a vector switch matrix (VSM) is used for distributing the control pulses to the corresponding qubits [17] .
The consequence of this is that one can perform the same single-qubit gate on all or some of the qubits that share a frequency, but one cannot perform different single-qubit gates at the same time on these qubits (as these would require other pulses to be generated). For instance, an X gate can be performed simultaneously on any of the pink qubits (7, 8 and 9) but not an X and a Y operation.
b. Measurement: Measuring the qubits is done by using feedlines each of which is coupled to multiple qubits [38] . In Figure 1a , qubits in the same dashed rectangle are using the same feedline, e.g., qubits 13 and 16 will be measured through the same feedline. Because measurement takes several steps in sequence, measurement of a qubit cannot start when another qubit coupled to the same feedline is being measured, but any combination of qubits that are coupled to the same feedline can be measured simultaneously at a given time. For instance, qubits 13 and 16 can be measured at time t 0 , but it is not possible to start measuring qubit 13 at time t 0 and then measure qubit 16 at time t 1 if the previous measurement has not finished.
c. Two-qubit gates: As mentioned, in the processor of Figure 1a each qubit belongs to one of three frequency groups f 1 > f 2 > f 3 , colored red, blue and pink, respectively; links between neighbouring qubits are either between qubits from f 1 and f 2 , or between qubits from f 2 and f 3 , i.e. between a higher frequency qubit and a next lower one. In between additional frequencies are defined:
(see the frequency arrangement and the example interactions presented in Figure 5 of [38] ); each qubit can be individually driven with one of the frequencies of its group. A CZ gate between two neighbouring qubits is realized by lowering the frequency of the higher frequency qubit near to the frequency of the lower one. For instance, a CZ gate between qubits 3 and 0 is performed by detuning qubit 3 from f 1 to f int 1 , which is near to f 2 , the frequency of qubit 0. However, CZ gates will occur between any two neighbouring qubits which have close frequencies and share a connection, e.g. between qubits 3 and 6 in the given example. To avoid this, the qubits that are not involved in the CZ gate must be kept out of the way. In this example, q6 is detuned to a lower parking frequency, f park 2 . Note that, qubits in parking frequencies cannot engage in any two-qubit or single-qubit gate. In addition, qubit 2 must stay at f 1 when qubits 3 and 0 perform a CZ, to avoid interaction between qubits 2 and 0. This example shows that the implementation of two-qubit gates poses some limitations on gate parallelism.
The hardware characteristics described in this section are included in a configuration file (in json format) that is used by all modules of the mapper.
III. MAPPING QUANTUM ALGORITHMS: THE QMAP MAPPER
Mapping means to transform the original quantum circuit that describes the quantum algorithm and is hardware-agnostic to an equivalent one that can be executed on the target quantum processor. To this purpose, the mapping process has to be aware of the constraints imposed by the physical implementation of the quantum processor. These include the set of elementary gates that is supported, the allowed qubit interactions that are determined by the processor topology, and the limited concurrency of multi-gate execution because of classical control constraints. Because of mapping, the number of operations that are required to implement the given algorithm as well as the circuit depth are likely to increase, decreasing the reliability of the algorithm. An efficient mapping is then key, especially in NISQ processors where noise sets a limit on the maximum size of a computation that can be run successfully.
A. Overview of the Qmap mapper
The Qmap mapper developed in this work is embedded in the OpenQL compiler [43] . We show its flow in Figure 2 . Its input is a quantum circuit written in OpenQL (C++ or Python). The OpenQL compiler reads and parses it to a QASM-level intermediate representation. Qmap then performs the mapping and optimization of the quantum circuit based on the information provided in a configuration file that includes the processor topology (connectivity and number of qubits), its elementary gate set, gate decomposition rules, the duration of each gate, and the classical control constraints. After mapping, QASM-like code is generated. Currently, the OpenQL compiler is capable of generating cQASM [44] that can be executed on our QX simulator [48] as well as eQASM [49] , a QASM-like executable code that can target the Surface-17 processor. The generation of other QASM-like languages will be part of future extensions of the OpenQL compiler.
Note that, as the characteristics of the quantum processor are described in a configuration file that is provided to the mapper, Qmap can easily target different quantum devices just by providing it with different configuration files with appropriate parameters.
The modules of Qmap will be discussed in the next sections. We refer to the qubits in the quantum circuit as virtual qubits (others call them program qubits or logical qubits). These need to be mapped to the qubits in the quantum processor called physical, real or hardware qubits or locations. 
B. Initial placement
Qubits are preferably placed initially such that highly interacting qubits are placed next to each other. Qmap tries to find an initial placement that minimizes the number of qubit movements by using the Integer Linear Programming (ILP) algorithm presented in [50] . Similar to the placement approaches in [24, 51, 52] , the initial placement problem is formulated as a quadratic assignment problem (QAP) with the communication overhead between qubits modeled by their distance minus 1. Such an initial placement implementation can only solve smallscale problems in reasonable time. Even though for nearterm implementations these numbers largely suffice, for large-scale circuits, one can either partition a large circuit into several smaller ones or apply heuristic algorithms to efficiently solve these mapping models [20-22, 24, 53-55] . Other works solve this initial placement problem by using a Satisfiability Modulo Theories (SMT) solver [40] .
An example circuit, its virtual to physical qubits mapping found by the initial placement module and the cQASM code before routing and scheduling are shown in Figure 3 .
C. Qubit Router
It is unlikely to find an initial placement in which all interactions between the qubits are satisfied. That is, not all the qubits that will perform a two-qubit gate can be placed in neighboring positions. Therefore, they will have to be moved during computation. For instance, based on the initial placement of qubits proposed in Figure 3a , the first 6 CNOTs of the circuit can be performed directly as qubits are NN, but the last 2 CNOTs will require qubits to be routed to adjacent positions. Routing refers to the task of finding a series of movements that enables the execution of two-qubit gates on a given processor topology with low communication overhead. To do so, several routing paths are explored and one is selected based on various optimization criteria such as the number of added movement operations, increase of circuit depth, or decrease of circuit reliability [29-37, 39, 40] . Then, the corresponding movement operations are inserted.
In this work, next to and after the ILP-based initial placement, a heuristic algorithm is used to perform this routing task. It is a graph-based heuristic of which the objective is to achieve the shortest circuit latency and therefore the highest instruction-level parallelism. Algorithm 1 shows the pseudo code of our routing algorithm; it finds all two-qubit gates in which qubits are not nearest-neighbours and inserts the required movement operations to make them adjacent. As mentioned in Section II B we use SWAPs as well as MOVE operations for moving qubits. The algorithm works as follows:
1. From the QASM representation of the quantum circuit a Quantum Operation Dependency Graph (QODG) G(V G , E G ) is constructed, in which each operation is denoted by a node v i ∈ V G , and the data dependency between two operations v i and v j is represented by a directed edge e(v i , v j ) ∈ E G with weight w i that represents the duration of operation v i . Pseudo source and sink nodes are added to the start and end to simplify starting and stopping iteration over the graph. The QODG of the circuit in Figure 3a is shown in Figure 4a .
2. The router algorithm starts by mapping the pseudo source node and then selecting all available operations from the input QODG, that is, the operations that do not depend on any not yet mapped operation. As long as among these are single-qubit gates or two-qubit gates with qubits that are NN, these are mapped first and a new set of available operations is computed. Mapping a (NN) gate implies replacing virtual qubit operands by their physical counterparts according to the table similar to the one shown in Figure 3a and decomposing it to its primitives when the configuration specifies so. After that, only non-NN two-qubit gates remain in the available set. The router, looking ahead to all not yet mapped operations of the circuit, selects from this availability set the ones which are most critical in the remaining dependency graph since they have the highest likelihood to extend the circuit when mapped in an inefficient way or when delayed. When there are several of these equally critical, it takes of these the first in the input circuit to map. After mapping it, it recomputes the set of available operations and runs the algorithm until there are no available operations anymore.
3. When mapping a non-NN two-qubit gate, all shortest paths between the qubits involved in this gate are considered. During Qmap initialization time, the distance (i.e. the length of the shortest path) between each pair of qubits has been computed using the Floyd-Warshall algorithm. Finding all shortest paths between a pair of qubits at mappingtime is done by a breadth-first search (BFS), selecting only path extensions which decrease the distance between the qubits. For each shortest path, several movement sets are computed. Each movement set consists of a sequence of movements that brings the two qubits to adjacent positions. That is, qubits can meet in any neighboring position within the path. Note that all movement sets would lead to adding an equal minimum number of movements to the circuit. To choose the best movement set, several strategies can be used that differ in how the movement set is selected and what constraints are considered. In this work, we consider three to be compared and evaluated:
MinExtendRCRouter: As shown in Algorithm 2, this routing strategy evaluates all movement sets by looking back to the previously mapped operations and interleaving each set of movements with those operations using an as-soon-as-possible (ASAP) scheduling policy. Then, it selects the one(s) which minimally extend(s) the circuit depth or latency. When there are multiple minimal sets, a random one is taken. The scheduling in this strategy takes gate duration and the classical control resource constraints into account, the latter limiting instruction-level parallelism. Its aim is to minimize the extension of the circuit latency caused by the addition of the movements by maximizing instruction-level parallelism within the constraints of the system.
BaseRouter: It just randomly selects one of the movement sets that are generated as described above, i.e. not evaluate them for their extension of the circuit latency or depth.
TrivialRouter: The gates in the circuit are mapped in the order as they appear in the circuit, i.e. by-passing the QODG. Then, when there is a non-NN two-qubit gate, only the first shortest path that is found, is taken. In addition, a single movement set is generated for it; the one moving the control qubit until it is near to the target. From the movement set, only SWAPs are generated, not MOVEs.
After the movement set selection, the SWAP/MOVE operations are scheduled into the output circuit and the set of available gates and the map of virtual to physical qubits are updated. Vnn ← All single-qubit and NN two-qubit gates in Vav
Algorithm 1 Routing algorithm
if Vnn = ∅ then
7:
Select v ∈ Vnn arbitrarily 8:
Vc ← Most-critical gates ⊂ Vav in G(VG − Vm, EG)
10:
Select v ∈ Vc which is first in the circuit
11:
Insert movement(s) for v
12:
Update M
13:
end if
14:
Map v according to M
15:
Add v to Vm
16:
Vav ← All available gates in G(VG − Vm, EG) 17: end while
Algorithm 2 Movement insertion algorithm
Input: QODG G(VG, EG), gate v, VP-map M , JSON file Output: The set of movements for v 1: P ← All shortest paths for v 2: M VP ← All possible sets of movements based on P 3: for mvj in M VP do
4:
Interleave mvj with previous gates (looking back)
5:
Tmv j ← circuit's latency extension by mvj 6: end for 7: if Tmv i = min( j Tmv j ) then
8:
Select mvi as the set of movements, picking a random minimum one when there are more 9: end if
D. RC-scheduler
After routing, the circuit adheres to the processor topology constraint for two-qubit interactions, and has been scheduled in an as-soon-as-possible (ASAP) way, taking the resource constraints into account only in the case of the MinExtendRC router. The RC-scheduler reschedules the routed circuit to achieve the shortest circuit latency and the highest instruction-level parallelism.
It does this in an as-late-as-possible (ALAP) way to minimize the required life-time and thus the decoherence error of each qubit, while taking the resource constraints into account. The resource constraints encode the control concurrency limitations together with the duration of the individual gates. 
E. Decomposition and optimization
Starting from a quantum circuit described in cQASM format (see Figure 3) , the circuit is also decomposed during mapping into one which only contains the elementary gates specified in the configuration file (json file), on top of adherence to the other constraints. Next to this, it is optimized to reduce the number of operations, e.g., two consecutive X gates can cancel each other out.
The decomposition and optimization can be done at every step of the mapping procedure, i.e. before, during, and after routing. Qmap reduces sequences of single qubit operations to their minimally required sequence both before and after routing. The implementation of the QODG represents the commutability of all gates with disjoint qubit operands but also of the known two-qubit operations CNOT and CZ with overlapping operands, and optimizes their order, both during routing and during RC-scheduling. These optimizations are not performed in the TrivialRouter. Whether gate decomposition is to be applied at a mapping step is specified in the configuration (json) file.
The cQASM code generated after the Qmap mapper is shown in Figure 4b .
IV. QMAP EVALUATION
In this section, we evaluate the Qmap by mapping a set of benchmarks from RevLib [45] and QLib [46] on two superconducting processors, namely, the processor with a distance-3 surface-code topology (Surface-17) [38] and the IBM Q Tokyo (IBM-20) processor [5] . These two processors have different elementary gate sets, processor topologies, and hardware constraints as described in Section II. Specifically, for the Surface-17 processor, the elementary gates with their real gate duration, the topology and the electronic control constraints are considered. For the IBM-20 processor, only the elementary gates without considering their duration and the qubit topology are considered. All mapping experiments are executed on a sever with 2 Intel Xeon E5-2683 CPUs (56 logical cores) and 377GB memory. The Operating System is CentOS 7.5 with Linux kernel version of 3.10 and GCC version of 4.8.5.
A. Benchmarks
The circuit characteristics of the used benchmarks are shown in Table II . All circuits have been decomposed into ones which only consist of gates from the universal set {Pauli, S, S † , T, T † , H, CNOT}. In these benchmarks, the number of qubits varies from 3 to 16, the number of gates goes from 5 to 64283, and the percentage of CNOT gates varies from 2.8% to 100%. Moreover, the minimum circuit depth and the minimum circuit latency are also included, ranging from 2 to 35572 time-steps and from 5 to 12256 cycles (using the gate duration of Surface-17), respectively. Note that these numbers are meant to characterize the algorithms before being mapped to the quantum processor and therefore are obtained without considering any hardware constraint.
The latter two parameters will be also used as a metrics to evaluate our Qmap mapper. They can be defined as follows:
Circuit depth is the length of the circuit. It is equivalent to the total number of time-steps for executing the circuit assuming each of the gates takes one time-step.
Circuit latency refers to the execution time of the circuit considering the real gate duration. Latency and gate duration are expressed in cycles. In this paper, we assume that a cycle takes 20 nanoseconds.
In addition, other parameters after mapping the benchmarks to the two quantum processors are provided, such as the total number of gates and two-qubit gates, the number of inserted SWAP and MOVE operations, and the time the mapping process takes. Table IV and Table V show the results of mapping the benchmarks to the superconducting Surface-17 processor and IBM-20 processor respectively using the three different routers: trivial, base, and minextendrc router. We take the results of the mapping with the trivial router as a baseline. In this case, a naive initial placement is used in which qubits are just placed in order, no optimization is made, and only SWAP operations are inserted. The mapping results for the base and minextendrc routers use the ILP-based initial placement. As we mentioned, this method can only solve small scale problems (small circuits) in a reasonable time. Note that the qubit initial placement is an NP-hard problem. In this paper, the mapper is set to only find an initial placement for the first ten two-qubit gates in any given circuit and computation time is limited to 10 minutes. In addition, when using the base and minextendrc routers, circuit optimizations are enabled and both SWAP and MOVE gates are inserted. The mapping procedure is executed for five times and the one with minimum overhead is reported.
B. Mapping results
Mapping overhead
In order to get quantum circuits which are executable on real processors, extra movement operations need to be added and gate parallelism will be compromised. We first analyze the impact of the mapping procedure in terms of number of gates, circuit latency (for Surface-17) or depth (for IBM-20) compared to the circuit characteristics before mapping in Table II . As shown in Table IV  and Table V , no matter which router is applied, the mapping procedure results in high overhead for most of the benchmarks. The only exceptions are the 'benstein v' and 'graycode6 47' circuits, because some operations in these circuits can be canceled out by the optimization module in the mapper, decreasing their circuit sizes.
Mapping to Surface-17: As shown in Table IV , when the trivial router is used, the mapping leads to an overhead in the circuit latency and the total number of gates ranging from 50% ('graycode6 47') to 1160% ('xor5 254') and from 122.9% ('wim 266') to 800% ('xor5 254'), respectively. The base router results in an increase of the circuit latency and the total number of gates that goes from 38.9% ('alu v0 27') to 260% ('xor5 254')) and from 26.0% ('cuccaroAdder 1b') to 373.4% ('rd84 142')), respectively. Finally, the minextendrc router increases the circuit latency and the total number of gates from 32.4% ('miller 11') to 260% ('xor5 254')) and from 20.7% ('cuccaroAdder 1b') to 78.1% ('rd32 v0')), respectively.
Mapping to IBM-20: In Table V , it is shown that the overhead in the circuit depth and the total number of gates caused by the trivial ranges from 80.5% ('decod24 e') to 650% ('xor5 254') and from 56.5% ('cnt3 5') to 257.1% ('xor5 254'), respectively. The circuit depth after mapping with both the based and the minextendrc router has increased from 13.8% ('miller 11') to 150% ('xor5 254'). The total number of gates has increased from 10% ('ham3 102') for both routers to 72% and 68% TABLE II: The characteristics of the input benchmarks including the number of qubits, the total number of gates, the number of two-qubit gates (CNOTs), its circuit depth and its circuit latency in cycles (20 ns per cycle). alu bdd 288  7  84  38  48  169  sym9 146  12  328 148  127  450  alu v0 27  5  36  17  21  72  sys6 v0 111 10  215 98  74  266  benstein vazirani 16  35  1  5  40  vbeAdder 2b 7  210 42  52  116  4gt12 v1 89  6  228 100  130  448  wim 266  11  986 427  514  1788  4gt4 v0 72  6  258 113  137  478  xor5 254  6  7  5  2  5  4mod5 bdd 287 7  70  31  40  140  z4 268  11  3073 1343  1643 5688  cm42a 207  14  1776 771  940  3249  adr4 197  13  3439 1498  1839 6377  cnt3 5 180  16  485 215  207  729  9symml 195 ('rd84 142') for the base router and the minextendrc router, respectively.
Benchmarks Qubits Gates CNOTs Depth Latency Benchmarks Qubits Gates CNOTs Depth Latency
Comparison of different routers
Furthermore, we evaluate the performance of these three different routers. As expected, for both processors, the trivial router leads to the highest mapping overhead, as it is our baseline. It is also observed that in general the minextendrc router shows the best performance as it leads to the lowest increase in circuit depth/latency and number of gates (Table IV and Table V) . This is because the base router includes optimizations but randomly selects one movement set. The minextendrc router optimizes circuits and evaluates more shortest movement paths to select one which minimally extends the circuit latency (Section III).
Mapping to Surface-17: As shown in Table IV , the base router always outperforms the trivial router, the latency and the number of gates can be reduced up to 71.4% ('xor5 254') and up to 80% ('benstein bazirani'), respectively. Moreover, the minextendrc router has lower or equal overhead than the base router in terms of both circuit latency and number of gates for 85.7% and 94.6% of the benchmarks, respectively. The minextendrc router can reduce the latency up to 20.5% ('decod24 b') and decrease the number of gates up to 10.61% ('sf 274') compared to the base router.
Mapping to IBM-20: Based on the mapping results in Table V , the base router can reduce the depth for 91.1% of benchmarks (up to 66.7% for 'xor5 254') and decrease the number of gates for 94.6% of benchmarks (up to 72% for 'xor5 254') compared to the trivial router. Furthermore, the minextendrc router results in a lower or equal overhead than using the base router in both circuit latency and # gates for 96.4% and 87.5% of the benchmarks, respectively. For example, the minextendrc router leads to latency reduction up to 38.2% and gate reduction up to 17.4% for the benchmark '4gt12 v1 89' compared to the base router.
Comparison of processor topology
In addition, we also investigate how the processor topology affects mapping overhead in terms of the number of inserted movement operations. For a comparison between the Surface-17 and IBM-20 processors, we transform the number of movements (SWAPs and MOVEs) into the number of elementary two-qubit gates (that is, CZ for Surface-17 and CNOT for IBM -20) . Based on the mapping results shown in Table IV and Table V , the IBM-20 processor requires less movement operations than the Surface-17 processor because it has more connectivity. For example, no movement operations are even needed when mapping some benchmarks ('ham3 102', 'miller 11', and 'xor5 254') to the IBM-20 processor. For other benchmarks, the IBM-20 processor can reduce the number of inserted elementary two-qubit gates up to 82.3% ('alu v0 27') compared to the Surface-17 processor.
Runtime and scalability
We have tested the proposed mapper for different sizes of benchmarks, in which the number of qubits ranges from 3 to 16 and the two-qubit gate number from 5 to 62483. The runtime (in seconds) that Qmap requires for mapping each benchmark can be found in Table IV and  Table V , which is measured by the CPU time that the entire mapping procedure takes (excluding the time the ILP-based initial placement takes). The router that performs more optimizations and evaluates more movement sets should have longer runtime, which is consistent with the results shown in in Table IV and Table V . The trivial router has the shortest execution time, whereas the minextendrc shows the longest one.
For example, for the largest benchmark 'sym10 262' with 62483 gates, the mapper using the trivial router only takes 72.8 seconds and 5.02 seconds for the Surface-17 processor and the IBM-20 processor, respectively. In comparison, when the minextendrc router is used, it takes 9083.4 seconds and 1698.4 seconds for the Surface-17 processor and the IBM-20 processor, respectively. Based on the above observation, we can conclude that our mapper is scalable in terms of large number of gates. However, our experiments only use benchmarks which have less 20 qubits. Therefore, its scalability with the number of qubits needs to be further investigated. Besides, it is necessary to analyze the trade-off between mapping optimizations and runtime for largescale benchmarks.
MOVEs versus SWAPs
As mentioned in Section II, a SWAP gate is implemented by three consecutive CNOT gates whereas a MOVE operation is implemented by two consecutive graycode6 47 6  20  5  5  6  16  15  5  xor5 254  6  5  7  5  6  18  18  8  ham3 102  3  41  20  11 3  60  62  17  cuccroadder 1b 4  58  73  17 5  90  92  23  alu v0 27  5  72  36  17 6  100  116 30  rd32 v0 66  4  66  34  16 6  105  113 32  miller 11  3  105  50  23 4  156  166 46 CNOT gates but requiring an ancilla qubit in the state |0 . Therefore, if there are available ancilla qubits (qubits that are not used for computation), then it is preferable to use MOVE operations rather than SWAP gates, which helps to reduce the mapping overhead. In the mapping results of Tables IV and V, MOVE operations are allowed for both base and minextendrc routers. In this section, we evaluate the benefit of using MOVE operations, instead of only using SWAPs. We map the benchmarks in Table II onto the Surface-17 processor using the base router. Different from the setups in Table IV , to have a fair comparison between using MOVEs if possible and only using SWAPs, in this case ILP-based initial placement is not applied and the first movement set is always selected. As shown in Table VI , generating MOVEs instead of SWAPs can reduce both the number of gates up to 38.9% ('bestein vazirani') and the circuit latency up to 29% ('graycode6 47').
Fidelity analysis
Qubits have limited coherence time and quantum operations are faulty, therefore, higher number of operations and longer circuit latency/depth will possibly lead to lower algorithm reliability which is measured by fidelity in this paper. We investigate how the mapping affects the circuit fidelity by simulating various small benchmarks on a density-matrix-based simulator called quantumsim [47] . The error models in this simulator are implemented based on experimental parameters for transmon qubits. In this work, we only consider qubit decoherence (relaxation and dephasing), gate and measurement errors, using the parameters from [47] . More specifically, the qubit relaxation time T 1 and dephasing time T φ are set to be 30000 ns and 60000 ns, respectively. The in-plane error and in-axis error for single-qubit rotations are set to be 5 * 10 −4 and 10 −4 , respectively. The incoherent deviation from the expected phase value for CZ gates is 0.01 2π and the readout error is 0.0015. Figure 5 shows the fidelity before mapping and after mapping several small-scale benchmarks (Table III) fidelity of the circuits after being mapped drops. This decrease goes from 1.8% for the 'graycode6'circuit to 13.8% for 'rd32 v0 6' and it is due to insertion of more operations and the increment of the circuit latency. Moreover, for most of the benchmarks, if a benchmark has both longer latency and more gates, then it will have lower fidelity.
These two observations suggest that circuit fidelity is correlated with the latency and the number of gates. However, other parameters may also affect the fidelity such as the number of qubits and how errors propagate through two-qubit gates, and it is not clear which one has a higher impact on it. For instance, the mapped benchmarks 'miller 11' has longer latency and more gates than 'rd32 v0 6', but it achieves higher fidelity. Another example is that the mapped benchmark 'alu v0 27' which has shorter latency but more gates achieves higher fidelity than the mapped 'rd32 v0 6'. The impact of the mapping on the algorithm fidelity needs further investigation. The next step will be then to analyze which circuit characteristics affect (most) the fidelity, and then develop a metric which not only can well represent the fidelity but also can be easily formulated to be optimized by the mapping procedure.
V. CONCLUSION AND DISCUSSION
In this work, we have presented a mapper called Qmap to make quantum circuits executable on the Surface-17 chip. It takes into account common processor constraints such as the elementary gate set and qubit connectivity, as well as classical control electronics restrictions. Qmap has been embedded in the OpenQL compiler and consists of several modules, including qubit initial placement and routing, operation scheduling, and gate decomposition and optimization. It can be applied to different processors of which hardware constraints are described in a configuration file.
We mapped 56 quantum benchmarks on two superconducting processors, which are the surface-17 processor and the IBM Q Tokyo processor. Three different routers, namely, trivial, base, and minextendrc, were used in this evaluation by the Qmap mapper. For both processors, the mapping using the minextendrc router results in the lowest overhead in terms of both circuit latency/depth and number of gates. Furthermore, as expected, the IBM-20 processor requires less movement operations compared to the Surface-17 processor due to its slightly higher qubit connectivity. We also showed that the use of a cheaper movement operation (MOVE) helps to substantially reduce the resulting overhead in terms of both added gates and latency.
We can then conclude that a flexible mapper is required for making quantum circuits executable on different real quantum processors. It must consider not only processor restrictions but also control electronic constraints as they may limit the parallelism of the operations. In addition, evaluating all possible shortest paths and different movement sets within each path and choosing one based on how well it interleaves with previous operations (look-back feature), lead to lower number of gates and circuit latency/depth. As shown, these two metrics seem to be correlated with the reliability of the algorithm, but a deeper analysis is required to develop an accurate reliability metric that can be directly used by the mapping procedure. Finally, optimizations for reducing the number of operations at different steps of the mapping process are also necessary.
Although our mapper has shown the capability to map benchmarks with large number of operations, we need to make it scalable for larger number of qubits. Future work will also include the improvement of the initial placement and routing by, for instance, finding movement operations for several two-qubit gates simultaneously. Furthermore, more mapping metrics need be investigated and included in the mapper. Note that what parameter(s) to optimise during the mapping might depend on the characteristics of the target quantum processor. In addition, our mapping approach is based on the compilation of quantum circuits at the gate level, where the generated instructions are further translated by the mi-croarchitecture into appropriate signals that control the qubits [57] . A different approach is to directly compile quantum algorithms to control pulses [58] . Further work will compare both solutions and investigate the trade-off of allocating mapping tasks to a compiler and a microarchitecture. The results of mapping quantum benchmarks to the Surface-17 processor, including the total number of gates and the number of two-qubit gates (CZs) in the mapped output circuits, the circuit latency in cycles (20 ns per cycle), the numbers of inserted SWAP (SWs) and MOVE (MVs) operations, and the CPU time that mapping takes in seconds. The results of mapping quantum benchmarks to the IBM-20 processor, including the total number of gates and the number of two-qubit gates (CNOTs) in the mapped output circuits, the circuit depth, the numbers of inserted SWAP (SWs) and MOVE (MVs) operations, and the CPU time that mapping takes in seconds. 
