SlackQ : Approaching the Qubit Mapping Problem with A Slack-aware Swap
  Insertion Scheme by Zhang, Chi et al.
SlackQ : Approaching the Qubit Mapping Problem
with A Slack-aware Swap Insertion Scheme
Chi Zhang∗‡ Yanhao Chen† Yuwei Jin†
Wonsun Ahn∗ Youtao Zhang∗ Eddy Z. Zhang†§
∗University of Pittsburgh
†Rutgers University
‡chz54@pitt.edu
§eddy.zhengzhang@gmail.com
Abstract—The rapid progress of physical implementation of
quantum computers paved the way for the design of tools to help
users write quantum programs for any given quantum device.
The physical constraints inherent in current NISQ architectures
prevent most quantum algorithms from being directly executed
on quantum devices. To enable two-qubit gates in the algorithm,
existing works focus on inserting SWAP gates to dynamically
remap logical qubits to physical qubits. However, their schemes
lack consideration of the execution time of generated quantum
circuits. In this work, we propose a slack-aware SWAP insertion
scheme for the qubit mapping problem in the NISQ era. Our
experiments show performance improvement by up to 2.36X
at maximum, by 1.62X on average, over 106 representative
benchmarks from RevLib [1], IBM Qiskit [2], and ScaffCC [3].
I. INTRODUCTION
Quantum computing has been considered as a potentially
disruptive computation model. In 2019, Google [4] demon-
strated “Quantum Supremacy” with its 54-qubit quantum
processor that is able to perform a computational task in 200
seconds which would have taken the state-of-art classical su-
percomputer 10,000 years. In general, quantum computing has
significant advantage over classical computing for applications
including large number factoring [5], database search [6], and
quantum simulation [7].
Labs in academia and industry are now able to build
quantum computers with up to 49-72 qubits. IBM [8] released
its 53-qubit quantum computer in October 2019 and has made
it available for commercial use. Google [9] released the 72-
qubit Bristlecone quantum computer in March 2018. Intel [10]
and Rigetti [11] respectively have released quantum computing
devices with dozens of qubits. Further, a few small-scale
quantum computers with less than 20 qubits are made freely
available to the public [12], for example, the series of quantum
computers provided by IBM Q experience [12].
The physical constraints inherent in quantum architectures
prevent quantum algorithms from being directly executed
on the device. One of the major constraints that must be
accounted for before quantum algorithms can be executed is
the qubit connectivity constraint. In the superconducting-based
quantum computers (the implementation adopted by major
industry players such as IBM and Google), qubits are not
fully connected. It follows nearest neighbor (NN) interaction
model, enforced by the connectivity of the physical qubits
array. If an algorithm requires communication between qubits
that are not physically connected, the algorithm cannot be
directly executed on the device.
To solve the qubit connectivity problem, any two logical
qubits that need to communicate according to the algorithm
must be mapped to physical qubits on the device that are
neighboring (connected). This is done through one or a se-
quence of SWAP operation(s). A SWAP operation exchanges
the states of two neighboring qubits, in effect “moving” the
two qubits. An example is shown in Fig. 1. This dynamic
remapping between logical and physical qubits may need to
happen multiple times throughout the algorithm.
Inserting swap operations inevitably results in increased
gate count and execution time. Previous studies [13], [14],
[15], [16], [2], [17] focus on optimizing gate count but not
execution time. Execution time is an important measure of
the performance of a circuit. Minimal gate count do not
necessarily guarantee minimal time. We show an example in
Fig. 1 where two qubit mapping solutions yield the same gate
count but only one of them is optimal in time.
Optimizing the execution time of a circuit is important not
only for optimizing the performance but also for improving
the fidelity of a quantum circuit. Quantum computers are not
perfect. Qubits are fickle and error prone. As time goes by,
a qubit decoheres and error accumulates. The time a qubit
can survive without losing its state information with high
probability is called coherence time. The longer a circuit
has to execute, the more likely it will approach a qubit’s
coherence time. IBM proposes the metric of quantum volume
[18] for evaluating the effectiveness of quantum computers.
One important factor for calculating quantum volume is the
maximum depth of a circuit that can be executed by a
quantum computer before accumulating a certain amount of
error. Here the depth represents the amount of time a circuit
executes. Optimizing the depth of a circuit is important as only
circuits that fit into the quantum volume can run successfully
and generate meaningful computational results. Thus quantum
compilers must take the execution time of the generated circuit
into consideration.
.
ar
X
iv
:2
00
9.
02
34
6v
1 
 [c
s.E
T]
  4
 Se
p 2
02
0
q1(Q1)
q2(Q2)
q3(Q3)
q4(Q4)
q5(Q5)
H
H S
q1(Q1)
q2(Q2)
q3(Q3)
q4(Q4)
q5(Q5)
H
H S
q1(Q1)
q2(Q2)
q3(Q3)
q4(Q4)
q5(Q5)
H
H S
Q1 Q2
Q3
Q4 Q5
(Q3)
(Q2)
X
X
X
X (Q5)
(Q3)
(a) IBM QX2 (b) (c) (d)
g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6
Fig. 1: (a) Physical qubit connectivity; (b) the original logical circuit (logical qubits q1, q2, q3, q4, q5 are mapped to physical
qubits Q1, Q2, Q3, Q4, Q5); The gate marked in red is the CNOT gate that cannot be executed due to a connectivity constraint
(as Q2 and Q5 are not physically connected). (c) uses 1 swap but the execution time of the circuit is not increased; (d) uses
1 swap but the execution time of the circuit is increased. The operations marked in blue are swap operations. We assume a
swap operation can be decomposed into three CNOT gates and each gate takes 1 cycle in this example.
In this paper, we focus on time-aware qubit mapping. A
good time-aware qubit mapper needs to yield a hardware-
compliant circuit while having optimal or near-optimal ex-
ecution time. We discover the key is to find intervals with
slack in the circuit and to use the slack to hide the latency
of inserted swap operations. We present important consid-
erations for detecting and exploiting slack in the circuit.
Our implemented qubit mapper named SlackQ automatically
searches for dynamic qubit mappings given an input program
on a quantum architecture with arbitrary qubit connectivity.
The experiments show that SlackQ improves performance by
up to 2.36X, by 1.62X on average, over 106 representative
benchmarks from RevLib [1], IBM Qiskit [2], and ScaffCC
[3].
II. BACKGROUND AND MOTIVATION
A. Quantum Computing Basics
1) Qubit: A quantum bit or qubit, is the counterpart to
a classical bit in the realm of quantum computing. Different
from a classical bit that represents either ‘1’ or ‘0’, a qubit is
in the coherent superposition of both states. The state —ψ¿
associated with a qubit is a unit vector in a two-dimensional
vector space. The state of a qubit can be represented as
|ψ > = α|0 > + β|1 > =
[
α
β
]
,
where α and β are two complex numbers such that |α|2 +
|β|2 = 1. α and β are called amplitudes. Upon the standard
measurement, the state —ψ¿ will collapse into the basis state
—0¿ with probability |α|2 or the basis state —1¿ with proba-
bility |β|2. A system of n qubits encodes a state superposition
of 2n basis vectors with 2n amplitudes. The classical n-bit
system encodes the information of one basis vector in the
vector space, but n-qubit system encode the information of
2n-dimensional vector space. Operating on one n-qubit state
is as if operating on 2n complex numbers at one time. This is
one of the reasons for the potential exponential speedup using
quantum computing.
2) Quantum Gates: There are two types of elementary
quantum gates. One is the single-qubit gate, which is a unitary
quantum operation that can be abstracted as the rotation around
the axis of the Bloch sphere [19]. A single-qubit gate can also
be represented using a 2 by 2 unitary matrix. Important single-
qubit gates include the H (Hadamard) gate, and the S (phase
shift by pi/4) gate [20].
The second type of gate is the multi-qubit gate. The
controlled-NOT (CNOT) is a two-qubit gate that performs the
most important role (arguably) in quantum computation. The
two qubits involved in a CNOT gate are: the control qubit
and the target qubit. If the control qubit is 0, it leaves the
target qubit unchanged. If it is 1, it applies a NOT gate to the
target qubit. The CNOT gate entangles qubits and allow qubits
to communicate. The CNOT gate, H gate, S gate, and T gate
together form a universal set called the Clifford+T library. Any
quantum algorithm can be implemented using a composition
of gates from the universal set.
3) Quantum Circuit: A quantum algorithm can be ex-
pressed as a quantum circuit which is composed of a set of
qubits and a sequence of quantum operations on these qubits.
A quantum circuit can be thought of a quantum algorithm in
“assembly language”. There are two different ways to describe
the quantum circuits. One way is to use the quantum assembly
language called OpenQASM [21] released by IBM. The other
way is to use a circuit diagram, in which qubits are represented
as horizontal lines. Input is the on the left and output is on
the right. Unlike a classical circuit, a quantum circuit must
have the same number of input and output qubits. Fig. 1 (b)
shows an example quantum circuit diagram. Logical qubits are
denoted using lowercase letters (q1, q2, ...) and physical qubits
are denoted using uppercase letters (Q1, Q2, ...). Initially,
logical qubits q1, q2, q3, q4, and q5 are mapped to physical
qubits Q1, Q2, Q3, Q4, and Q5. A single-qubit gate is denoted
as a square on the line. A CNOT gate is represented as a line
connecting two qubits where the control qubit is marked with
a dot and the target qubit with a ⊕ sign. In this paper, we use
the circuit diagram representation to describe examples.
B. Qubit Mapping Problem
To enable the execution of a quantum circuit, the logical
qubits in the circuit must be mapped to the physical qubit on
the target hardware. When applying a CNOT gate, the two
logical qubits involved in the CNOT gate must be mapped
to two physical qubits connected to each other. Due to the
irregular layout and connectivity of the qubits in the target
device, it is sometimes impossible to find an initial mapping
that makes the entire circuit CNOT-compliant. The common
practice is to insert SWAP operations to remap the logical
qubits, whenever a CNOT gate cannot be applied.
A SWAP operation exchanges the states of the two input
qubits of interest. As shown in Fig. 2 (b), a swap operation
is implemented using 3 CNOT gates for architectures with bi-
directional links, where a bi-directional link means both ends
of the link can be the control or target qubit. Or it can be
implemented using 3 CNOT gates plus 4 Hadamard gates for
architectures with single-direction links as shown in Fig. 2 (c),
where a single-direction link means only one end of the link
can be the control qubit.
The qubit mapping problem takes a logical circuit and a
hardware coupling graph as input and outputs a transformed
circuit that fits on the hardware device by inserting SWAP
operations. After the transformation, all CNOT gates must
be performed on qubits that are connected in the physical
architecture. Due to the swaps, a logical qubit may be mapped
to different physical qubits at different points in the circuit
execution. But, at any given point, a logical qubit will be
mapped to exactly one physical qubit since we are only using
swaps to move qubits and are not making any copies.
X
X
H
H
H
H
(a) (b) (c)
m
n
n
m
m
n
m
n
n
m
n
m
= =
Fig. 2: Implementation of a SWAP operation: (a) the SWAP
notation, where m and n are two logical qubits, after SWAP, m
and n exchanged their states, (b) for bidirectional links, where
the three CNOT that implement the SWP do not need to use
the same control qubit, and (c) for single direction links, where
the three CNOT must use the same control qubit.
An example of circuit transformation is shown in Fig. 1
where (a) is the physical connectivity, (b) is the original circuit,
(c) and (d) offer two different hardware-compliant circuits
generated from the same original circuit.
C. Parallelism in Quantum Circuit
Like in classical computers, parallelism is also important
in quantum computers. Parallelism comes from independent
operations on different qubits. Gates on the same qubit have to
run sequentially. For instance, if a and b are two consecutive
gates on the same qubit, and a is before b in the program,
then gate b depends on a. Gates that do not share any qubit
are independent. A two-qubit gate depend on up to two gates
since it involves two qubits. A two-qubit gate has up to two
immediate successors in the dependence graph. A dependence
graph can be built with respect to the partial order between
gates. It is a directed acyclic graph (DAG).
In a transformed circuit, the parallelism could be (1) be-
tween the gates in the original circuit (as g2 and g3 in Fig.
1 (b)), (2) between the SWAP gates that are inserted into the
original circuit, and (3) between a gate in the original circuit
and a newly inserted gate (as swap 3,5 and g1 in Fig. 1(c)).
A good qubit mapping algorithm should consider all types
of parallelism. However, existing studies only consider the
first two types of parallelism. Our work is the first one that
systematically exploits all type of parallelism.
As shown in Fig. 1, the best two known approaches by
Zulehner et al. [16] and Li et al. [13] do not distinguish
solution (c) from solution (d) as the two solutions both insert
1 SWAP. And [16] and [13] only optimize the number of
gates inserted into the circuit (or the parallelism of the inserted
gates), but not the parallelism of the transformed circuit. The
solution in Fig. 1 (c) is better than Fig. 1 (d) as the inserted
swap can run in parallel with the gates in the original circuit.
This example stresses the importance of time-awareness in
SWAP insertion schemes and motivates our work.
III. INSIGHT AND DESIGN
To improve the parallelism between the inserted SWAP
operation and the gates in the original circuit, we discover
that it is important to exploit the slack intervals in the circuit.
The slack represents the idle time in the original circuit for a
given set of qubits. The key is to hide the latency of inserted
swap operations by using the qubits that are idle at that time
of the circuit execution. This forms the main idea of this paper
and we insert SWAP operations such that they leverage slack
in the circuit as much as possible.
A. Slack
We define slack as the idle time between two consecutive
gates on the same qubit(s) and can be used to perform SWAP
operation without affecting the total execution time of the
entire circuit. The slack time is usually caused by dependence
between gates and/or variation of gate count on individual
qubits.
The slack time due to dependence between gates only
occurs when there are two-qubit gates in the circuit. Recall that
a CNOT gate depends on up to two other gates, since CNOT is
a two-qubit gate. If the qubits are running at different speeds,
one of the other qubits might be ready earlier than the other.
The faster qubit thus needs to wait for the slow qubit to finish
before the CNOT gate can be executed. On the other hand,
if a circuit has a number of qubits, and the number of gates
on each qubit is different (even if they are all independent),
then some qubits will inevitably be idle at some point of the
execution. The slack intervals can be used for inserting swap
operations that resolve qubit mapping constraints. An example
of slack in the circuit is shown in Fig. 3.
Hq1  
q2  
q3  
q4  
S H
g3
g2
g1
slack
Fig. 3: Slack in the circuit: Since g3 depends on g1 and g2,
g1 finishes earlier than g2, therefore, for qubit q3, there is a
slack interval of three cycles (assuming each gate takes one
cycle) between g1 and g3 in the circuit.
There are two types of slacks in the circuit. One type does
not require the rescheduling of the gates, and we define it
as fixed slack. The other type of slacks may have variable
number of cycles, and we denote it as flexible slack. A good
qubit mapper needs to search globally and exploit both fixed
and flexible slack.
a) Fixed Slack: An example of fixed slack is shown
in Fig. 4 (b). Assuming each gate takes one cycle, there is a
fixed slack between g1 and g2 on qubit q2. Here it cannot
delay g2 or start g1 early if the total execution time needs to
remain unchanged. If qubit q2 is used to perform another gate
such as the swap operation during the three cycles, it will not
affect the execution time of the entire circuit. In this case, the
number of cycles that can be used on q2 between g1 and g2
is fixed.b) Flexible Slack: Sometimes fixed slack does not al-
ways exist. It is necessary to move the gates in order to create
slacks for latency hiding purpose. We show an example Fig.
4 (c) and (d), where slack can be created by moving g1. Let’s
say cnot(q1, q2) and cnot(q3, q4) are scheduled on cycle 1. The
three single-qubit gates on Q1 are scheduled on cycle 2,3,4
respectively. With this going on, g3 expects to be executed on
cycle 5 at the earliest. g3 depends on g1. g1 can be scheduled
at the second cycle, the third cycle (Fig. 4 (c) ) or the fourth
cycle (Fig. 4 (d)) without delaying g5. To this end, a slack
with zero, one or two cycles can be created between g2 and
g1, depending on when g1 is scheduled. And this type of slack
between g2 and g1 is flexible. On the other hand, since g1 is
not directly executable due to the connectivity constraint in
Fig. 4 (a), the more slack intervals before g1 there are, the
better it is for hiding the swap latency. In Fig. 6, we show
that by moving g1 forward, q2 and q3 can have more slack
intervals before g1, and swap(3,4) is inserted which utilizes
the slack, resulting in a total circuit time of 6 cycles only,
which is optimal in this case.
It is worth mentioning flexible slack could be cascading as
the rescheduling of one gate might affect its descendants or
predecessors. For the fixed slack, the gates involved cannot
be delayed without affecting the circuit time. Flexible slack
allows one or multiple gates to delay start within reasonable
time window(s). Flexible slack are more complicated than
fixed slack. It is necessary to analyze and exploit flexible slack
in a systematic way.
B. Dynamic Gate Scheduler
We model the resolution of qubit mapping conflicts as a
dynamic scheduling process. Gates in the circuit are scheduled
as soon as their dependencies are resolved. When a gate
cannot be scheduled due to a connectivity problem, we insert a
(combination of) swap(s) to change the qubit mapping so that
the gate can be executed on the physical device. All the gates
that have already been scheduled at one point of scheduling
are called the Processed Circuit, and the gates that still await
scheduling are called the Remaining Circuit.
Fig. 6 shows an example of how the scheduling works.
With initial mapping of {q1, q2, q3, q4} → {Q1, Q2, Q3, Q4},
the first two CNOTs cnot(q1, q2) and cnot(q3, q4) and the
three single-qubit gates on Q1 can be scheduled without
remapping. At this point, those gates that are scheduled are
part of the Processed Circuit. The remaining two CNOT
gates (g1 and g3) that cannot be scheduled are part of the
Remaining Circuit. Gate g1 cannot be scheduled because Q2
and Q3 are not connected in the device. Gate g3 cannot be
scheduled because g1 must be scheduled before g3 (write-
after-read dependency on Q2). The dashed lines divide the
circuit into processed part and remaining part.
To minimize the circuit time, we search for swap candi-
dates for g1 that results in maximally hiding swap latencies
using circuit slack. Fig. 5 shows the key idea behind the
searching for optimal swap candidates. The search reveals
multiple hardware-compliant candidates that utilize different
sequences of swaps to achieve compliance. We choose the
optimal candidate by calculating the Slack Utilization of each
candidate, and choosing the one with the best utilization. In
Fig. 6, we choose swap candidate swap(q3, q4) since it best
hides the swap latency behind the 2-cycle slack shown in Fig.
4 (d). Now g1 is satisfied and scheduling can proceed.
C. Critical Gates
The gates in the remaining circuit pending scheduling whose
dependences have been resolved but connectivity problems
haven’t been can be divided into two groups: those on the
critical path and those that are not. We denote the gates on the
critical path as critical gates, and the others as non-critical
gates. In parallel computing, the critical path length is equal
to the execution time when there is enough parallelism. In
this case, the critical path is equal to execution time as the
maximum parallelism (the maximum number of gates that can
run concurrently) is at most the same as the number of qubits.
Thus it is important to prioritize the scheduling of critical gates
over non-critical gates.
To prioritize critical gates, what we need to do is to
resolve the connectivity problems of critical gates as early as
possible. Imagine a scenario where two gates have connectivity
problems, one gate is critical and the other is non-critical gate.
Their connectivity issues cannot be resolved at the same time.
Under this situation, we should resolve the critical gate first, as
resolving the non-critical gates can be likely delayed without
affecting the overall execution time.
Q1 Q2
Q3 Q4
Hq1  (Q1)
q2  (Q2)
q3  (Q3)
q4  (Q4)
S H
(a) (b) (c)
Qubit Coupling
g1
g2
Hq1  (Q1)
q2  (Q2)
S H
fixed slack
Hq1  (Q1)
q2  (Q2)
q3  (Q3)
q4  (Q4)
S H
(d)
g1
g2
1-cycle slack 2-cycle slack
g3g3
g1 g2
Fig. 4: (a) Qubit coupling graph; (b) An example of fixed slack in the quantum circuit; (c) and (d) Examples of flexible slacks,
where g1 can be moved within a time window without affecting the circuit execution time, the slack before g1 can be either
1 cycle or 2 cycles assuming every gate takes one cycle. Note the slack between g1 and g3 may also vary due to scheduling
of g1.
Processed Circuit

(PC) S
Remaining Circuit

(RC)
S: candidate SWAPs
S1 S2 S3 Sn…Hardware-compliant  candidates
Select Sx with the greatest SU(X)
SU(1) SU(2) SU(3) SU(n)…
Calculate Slack 
Utilization:  
SU(X) =  
calc_slack_util(PC, SX, 
RC)
Fig. 5: Choose SWAP Candidates
Hq1  (Q1)
q2  (Q2)
q3  (Q3)
q4  (Q4)
S H
X
X
(Q4)
(Q3)
swap (3,4) absorbs the slack
Processed 
Circuit
Remaining 
Circuit
g1
g3
Q1 Q2
Q3 Q4
(a)
Qubit Coupling 
Graph
(b)
Fig. 6: Scheduling gates to create more slacks: gate g1 can be
moved forward such that swap Q3, Q4 can absorb the longer
slack on q3
We use an example from Fig. 7 to show how criticality can
play an important role in determining the overall circuit time.
We use a five-qubit quantum machine, whose connectivity is
shown in Fig. 7 (a). This example circuit consists of 4 CNOTs
and 3 single-qubit gates, with g1, g2 scheduled, and g3, g4 not
yet scheduled due to connectivity issues. It’s crucial to note
that g3 is on the critical path, while g4 is not. The two gates
g3 and g4 cannot be resolved at the same time if both of them
want to use only one swap, since qubit C is the on the path
from T1 to T2, and from X1 to X2. Whether to prioritize g3
over g4 when using the hub qubit C for swap, makes a big
difference in terms of circuit time. We show this discrepancy
by illustrating two strategies and their resulting circuits.
• Strategy One - Prioritizing critical gates Resolve g3
first. Shown in Fig 7 (c), it is necessary to insert swap(q2,
q3) before g3. After g3 is resolved and scheduled,
swap(q2, q4) is inserted such that g4 can be resolved.
swap(q2, q4) can take advantage of the slack on logical
qubits q2 and q4, as logical qubit q1 is processing three
single-qubit gates. This strategy results in total circuit
time of 10 cycles, assuming CNOT and single-qubit gate
both have latencies of one cycle.
• Strategy Two - Not distinguishing critical gates from
non-critical path Resolve g4 first. Shown in Fig 7 (d),
it is necessary to insert swap(q3, q4) before g4, and let
g4 be scheduled. In the meantime, g3 has to wait, which
results in the critical path being elongated due to the delay
of the execution time of g4 and swap(q3, q4). It is because
when g4 is being executed, the mapping that allows g4
must be kept, which will delay all the remaining gates.
In this case, it is not desirable to delay all the remaining
gates as they are on critical path. Delaying gates that are
critical will have a more detrimental impact than delaying
gates not on critical path. After g4 is resolved, the fastest
way to resolve g3 is to swap(q2, q4) before g3. This
strategy as a whole results in total circuit time of 13
cycles, which is 30% more than strategy one.
It can be seen from this example the later resolving of the
non-critical gates are highly likely to overlap with the gates
on the critical path, and result in less impact to overall circuit
execution.
IV. IMPLEMENTATION
Based on the design consideration on Section III, we imple-
ment a slack-aware qubit mapping framework called SlackQ.
A. Overview of SlackQ
Our algorithm is an iterative gate scheduler which dynam-
ically resolves the connectivity issues encountered during the
scheduling process. Initially, a dependency graph of the circuit
is built. Then we traverse the dependency graph of the circuit
q1 (X1)
q2 (X2)
q4 (T1)
q5 (T2)
g4
g3
q3 (C)
H S H
g3
H S H
X
X
(C)
(X2)
X
X (C)
(T1)
H S H
X
X (C)
(T1)
X
X (C)
(X2)
(a) (b) (c) (d)
10 cycles 13 cycles
T1
X1 C X2
T2
g2
g1
g4
g2
g1
g3
g4
g2
g1
q1 (X1)
q2 (X2)
q4 (T1)
q5 (T2)
q3 (C)
q1 (X1)
q2 (X2)
q4 (T1)
q5 (T2)
q3 (C)
Fig. 7: (a) Qubit Connectivity for a five-qubit machine (b) Original Circuit before qubit mapping (c) Strategy One: Resolving
gates on critical path first (d) Strategy Two: Resolving gates on non-critical path first.
and schedule the gates one by one. We keep a frontier set of
gates ready to be scheduled. When resolving the connectivity
issues, we invoke a priority-queue based searcher for swap
candidates. It returns hardware-compliant candidates. Among
these hardware-compliant candidates, the one that has the
best slack utilization is chosen, and the scheduling process
proceeds. We describe the algorithm below with respect to the
pseudo-code shown in Algorithm 1:
Step One - Initialization This step prepares for the searching
process. It builds the dependency graph of circuit. It finds the
gates that do not depend on any other gates. Then it places
those gates into the frontier F . It also initializes the processed
gate set P as empty set, and the remaining gate sets R as the
entire circuit.
Step Two - Schedule Ready Gates This step goes through
frontier list F . It finds all gates in F that can be scheduled
immediately due to having no connectivity issues according
to the current mapping pi. It schedules all these gates. When
finishing the scheduling of one gate, it finds the descendant
gate and see if this descendant’s other parent has also been
scheduled. If this is the case, the descendant gate’s dependency
is resolved. It then places this descendant gate into F . This
step is repeated until F contains no gate that can be scheduled
with respect to the current qubit mapping.
Step Three - Resolve Qubit Mapping Conflicts We go
through the frontier F again, finding the gates with resolved
dependencies but are constrained by the current mapping and
are on the critical path of the remaining circuit. Put those gates
into a set called Fcritical. Run a priority queue based searcher
for hardware-compliant mappings. Our mapping searcher here
returns a list of hardware-compliant mappings candidates,
called M . Among these candidates, it finds the one (call it m)
with the best slack utilization. Then we use the swap sequence
associated with m to update the mapping pi and add the swap
sequence into the processed circuit P.
Step Four Repeat Step Two and Three until all gates are
scheduled in the circuit. Return transformed circuit.
In Sections IV-B to IV-D, we describe a few important
aspects of this algorithm.
B. Initialization
Before calling the scheduler described in Algorithm 1,
we initialize the frontier F and processed circuit P, and the
remaining circuit R. Initially, P is empty and R is the entire
circuit. For F , it creates a Directed Acyclic Graph (DAG) to
represent the dependency between quantum operations. Fig. 8
(a) shows an example of a dependency graph from the circuit
illustrated in Fig. 1 (b).
Algorithm 1: Dynamic Gate Scheduler
Input : Frontier F , initial mapping pi, processed circuit
P , remaining circuit R
Output: Transformed circuit T
while F not empty do
E = getSchedulableGates(F, pi);
while E not empty do
F.remove(E);
P.add(E);
R.remove(E);
for g ∈ E do
for d ∈ g.children do
if d’s dependency is resolved then
F.add(d);
end
end
end
E = getSchedulableGates(F, pi);
end
Fcritical = select critical gates (F );
mapping candidates = resolve conflicts (Fcritical, pi);
m = best slack utilization (mapping candidates, P,
R);
pi = update mapping with swaps(m, pi);
P.add(m.swaps);
end
return P;
C. Choosing the Best Mapping Candidate
With multiple hardware-compliant mappings, it is necessary
to determine the candidate that has the best slack utilization.
The best slack utilization means the inserted swap sequence
makes best use of the slack currently existing in the circuit.
To evaluate these mapping candidates, for each of them, we
tentatively insert the associated swap sequence and monitor
how the swap insertions affect the dependence graph and the
critical path. We trace the nodes that are affected due to the
inserted swaps and detect how much their start/ending time
changes.
We again use the example circuit and qubit coupling graph
in Fig. 1 to show how our evaluation approach works. We
first show the original dependency graph in Fig. 8 (a). The
numbers on gates denote the start/end cycle of each gate. For
instance, [0, 1] represents the start cycle as 0 and the ending
cycle as 1. We assign a dummy gate node at the beginning of
the circuit for each qubit q1 ˜ q5, for the sake of illustration.
The dummy node starts at time 0 and takes 0 cycles. In this
example, we have two possible mapping candidates whose
swap sequences are to be inserted on different qubits. We need
to choose the better one out of the two mapping candidates.
The first mapping candidate shown in Fig. 8 (b) inserts one
swap on logical qubits q2 (the qubit for g4) and q3 in between
gates g4 and g5, corresponding to the circuit in Fig. 1 (d). It
affects gates g5 and g6, which are marked in red. For each
affected node in the dependence graph, to calculate its earliest
start time, one needs to check each of its parent nodes’ ending
time, choose the maximum one, and use it as its own start time.
In this example, added swap result in change in g5’s start time
as well as the change in g6’start time, and delays the entire
circuit by 3 cycles. Here we assume each gate takes one cycle.
We assume a swap is implemented using 3 CNOT gates and
thus is 3 cycles.
The second mapping candidate shown in Fig. 8 (c) inserts
one swap on logical qubits q3 and q5 placed right in front
of g5. It results in no changes to the start/end cycles of the
entire circuit, since there is slack on physical qubits q3 and q5.
Obviously, it should choose the mapping candidate illustrated
in Fig. 8 (c).
We use an algorithm to systematically analyze the
start/ending time of each gate due to inserted swaps. The
algorithm does not have to traverse the entire dependence
graph. Instead, it only traces the affected gates in the original
circuit to find the candidate that leads to the smallest increment
to total circuit time. We also add an optimization to quickly
terminate the tracing when a candidate is deemed hopeless. A
candidate is deemed hopeless if one of the affected gates is
on the critical path and the delay to that gate due to swaps
already exceeds the smallest increment found in a previous
candidate. In that case, it terminates the tracing and moves on
to the next candidate. The algorithm is described in Algorithm
2.
D. Navigating the Candidate Search Space
We use a priority queue based searcher for qubit mapping
candidates. The search space consists of state nodes that
represent possible mappings from logical qubits to physical
qubits. A mapping can be represented as pi : {q1, q2, ..., qn} →
{Q1, Q2, ..., Qn}. Applying swaps on top of a mapping can
convert it into another mapping. Specifically, if we apply
”swap qi, qj” on a certain mapping piold and create the
resulting mapping pinew, we will have pinew[qi] = piold[qj ]
and pinew[qj ] = piold[qi].
Algorithm 2: Find the best slack-utilizing mapping
Input : Mapping candidates M , dependency graph G,
proessed circuit P
Output: Best slack-utilizing mapping mbest
smallest inc = ∞;
CP = getCriticalPath(G);
for m ∈M do
RG = getLastScheduledGateOnEachQubit(P);
G’ = G;
G’.addGates(m.swaps);
graph updated = True;
circuit time = 0;
while RG not empty do
RG′ = [];
for g ∈ RG do
g.updateTentativeStartAndEndCycle();
delta = g.tentativeStart - g.originalStart;
if g is on critical path & delta ¿ smallest inc
then
graph updated = False;
break the while loop;
end
if delta ¿ 0 then
RG’.add(g.children);
end
circuit time = max(circuit time,
g.tentativeEnd);
end
RG = RG’;
end
if (graph updated == True & circuit time > CP)
then
if ( (circuit time - CP) < smallest inc ) then
smallest inc = circuit time - CP ;
mbest = m;
end
end
end
return mbest;
Given Fcritical and current mapping pi, it starts searching
the state space of all feasible mappings that satisfy Fcritical.
It picks a node to expand and enumerate all possible parallel
one-step swaps as the node’s successors. We use a priority
queue that is similar to that in [16]. Unlike the work by [16]
where the search stops when the first state node that resolves
all connectivity conflicts is retrieve from the priority queue, our
search stops after m expansions since the mapping candidate
with minimal swap count is found, or when the gate count
of the mapping candidate that is just retrieve has less than
or equal to k times more gates than the minimal swap count.
We set m = 20 and k = 2 such that the returned mapping
candidate will have reasonable gate counts. After all mapping
candidates have been retrieved, we rank them with respect to
[0, 1]
Dummy
Dummy
Dummy
Dummy
Dummy
[1, 2]
[2, 3]
[3, 4]
[4, 5] [0, 1]
Dummy
Dummy
Dummy
Dummy
Dummy
[1, 2]
[1, 2]
[2, 3]
[6, 7]
[7, 8]
SWAP
[3, 6]
[0, 1]
Dummy
Dummy
Dummy
Dummy
Dummy
[1, 2]
[1, 2]
[2, 3]
[3, 4]
[4, 5]
SWAP
[0, 3]
(a) (b) (c)
g1 g2
g3
g4
g5
g6 g1 g2
g3
g4
g5
g6 g1 g2
g3
g4
g5
g6
[1, 2]
(q1)
(q2)
(q3)
(q4)
(q5)
(q1)
(q2)
(q3)
(q4)
(q5)
(q1)
(q2)
(q3)
(q4)
(q5)
Fig. 8: (a) The generated dependency graph from the example of Fig. 1. The numbers displayed on each gate refer to the
start/end cycle of this gate. (b) One mapping candidate whose inserted swap results in two later gates delaying its start/end
cycles. (C) Another mapping candidate whose inserted swap does not affect the start/end cycles of the entire circuit.
the metric of best slack utilization discussed above.
V. EVALUATION
In this section, we evaluate our slack-aware swap insertion
scheme (SlackQ) and compare it with the two state-of-the-art
qubit mappers, respectively by [16] and [13].
A. Experiment Setup
Benchmark. We use 106 benchmarks from RevLib [1],
IBM Qiskit [2], and ScaffCC [3]. RevLib comprises of a col-
lection of benchmarks in the domain of reversible and quantum
circuit design. Qiskit is a programming framework for quan-
tum computing provided by IBM. ScaffCC is a compilation
framework for the Scaffold quantum programming language.
These benchmarks feature functionalities from implementing
ALU logics, comparing inputs with constant values, ternary
counters, to classic quantum algorithms like Quantum Fourier
Transform (QFT) and ising model.
Baseline We compare our work with two best known qubit
mapping solutions [16] (denoted as Zulehner) and the Sabre
qubit mapper from [13] (denoted as Sabre). We also compare
our results with IBM’s stochastic mapper in Qiskit [2]. Since
IBM’s Qiskit mapper is significantly worse in terms of circuit
time than all other mappers we have evaluated, we do not
show the results. The performance of Qiskit mapper is also
noted in the work by [16].
Metrics We compare the execution time of the transformed
circuits generated by different qubit mapping strategies. It is
worth mentioning that our approach can take any gate latency
as input parameters and generate transformed circuits based
on the input. However, to make evaluation results as close to
real machines as possible, we use the results from the studies
by [22], [23]. In these studies, different types of quantum
architecture are investigated, and the studies reveal that two-
qubit gates usually takes around twice as much time as single-
qubit gates. Hence we assume single-qubit gates take 1 cycle
and two-qubit CNOT gates take 2 cycles in our experiments.
The time is reported as the total number of executed cycles.
Platform We use IBM’s 20-qubit Q20 Tokyo architecture
[13] as the underlying quantum hardware. The qubit mapping
approach is implemented in C++ and executed on a Intel 2.4
GHz Core i5 machine, with 8 GB 1600 MHz DDR3 memory.
B. Experiment Analysis
We categorize the 106 benchmarks into four categories.
Benchmarks in the first category each has less than 200 gates,
and we denote them as mini benchmarks. There are 22 mini
benchmarks. The second category has benchmarks with 200
to 1,000 gates. We name this category as small benchmarks.
There are 39 small benchmarks. The third category of bench-
marks have 1,000 to 10,000 gates. We name it as medium
benchmarks and there are 21 benchmarks in this category.
The fourth category of benchmarks have 10,000 to 200,000
gates. We refer to it as large benchmarks and there are 24
benchmarks in this category. The results for mini, small,
medium, and large benchmarks are presented in Fig. 9, Fig.
10, Fig. 11, and Fig. 12 respectively.
0
0.5
1
1.5
2
2.5
Speedup Over Zulehner Speedup Over Sabre
Fig. 9: Speedup for Mini Benchmarks (¡ 200 gates)
0
0.5
1
1.5
2
2.5
Speedup Over Zulehner Speedup Over Sabre
Fig. 10: Speedup for Small Benchmarks (¡ 1000 gates)
It can be observed from the results that as the problem
size scales, the performance improvement brought by SlackQ
improves. For most benchmarks in the mini and small cat-
egory, the speedup is between 1.1X and 1.5X. However,
for the medium and large category, the speedup for most
benchmarks is above or around 1.5X. The average speedup for
mini benchmarks is 1.45X and for small, medium, and large
00.5
1
1.5
2
2.5
Speedup Over Zulehner Speedup Over Sabre
Fig. 11: Speedup for Medium Benchmarks (¡ 10000 gates)
0
0.5
1
1.5
2
2.5
Speedup Over Zulehner Speedup Over Sabre
Fig. 12: Speedup for Large Benchmarks (¡ 200000 gates)
benchmarks, the average speedup becomes 1.55X, 1.70X, and
1.86X respectively. The results show that our approach works
well in general, and in particular for larger benchmarks.
There are two baselines we compare against: the Zulehner
approach and the Sabre approach. For mini and small bench-
marks, the Zulehner approach does not seem to perform as
well as the Sabre approach. It can be seen from the fact
that the relative speedup of SlackQ over Zulehner is usually
larger than SlackQ over Sabre. However, the Sabre approach
performs worse than the Zulehner approach for medium and
large benchmarks. It can be seen that Zulehner and Sabre
perform well in different scenarios when compared against
each other. Regardless, our approach SlackQ outperforms both
of them.
VI. CONCLUSION
The physical layout of contemporary quantum devices im-
poses limitations for mapping a high level quantum program to
the hardware. It is critical to develop an efficient qubit mapper
in the NISQ era. Existing studies aim to reduce the gate
count but are oblivious to the depth of the transformed circuit.
This paper presents the design of the first time-efficient slack-
aware swap insertion scheme. Experiment results show that our
proposed solution generates hardware-compliant circuits with
faster execution time compared with state-of-the-art mapping
schemes.
REFERENCES
[1] R. Wille, D. Große, L. Teuber, G. W. Dueck, and R. Drechsler, “Revlib:
An online resource for reversible functions and reversible circuits,” in
38th International Symposium on Multiple Valued Logic (ismvl 2008).
IEEE, 2008, pp. 220–225.
[2] QISKit: Open Source Quantum Information Science Kit, https://https:
//qiskit.org/.
[3] A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T. Chong,
and M. Martonosi, “Scaffcc: a framework for compilation and analysis
of quantum computing programs,” in Proceedings of the 11th ACM
Conference on Computing Frontiers. ACM, 2014, p. 1.
[4] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,
R. Biswas, S. Boixo, F. G. S. L. Brandao, D. A. Buell, B. Burkett,
Y. Chen, Z. Chen, B. Chiaro, R. Collins, W. Courtney, A. Dunsworth,
E. Farhi, B. Foxen, A. Fowler, C. Gidney, M. Giustina, R. Graff,
K. Guerin, S. Habegger, M. P. Harrigan, M. J. Hartmann, A. Ho,
M. Hoffmann, T. Huang, T. S. Humble, S. V. Isakov, E. Jeffrey, Z. Jiang,
D. Kafri, K. Kechedzhi, J. Kelly, P. V. Klimov, S. Knysh, A. Korotkov,
F. Kostritsa, D. Landhuis, M. Lindmark, E. Lucero, D. Lyakh, S. Mandra`,
J. R. McClean, M. McEwen, A. Megrant, X. Mi, K. Michielsen,
M. Mohseni, J. Mutus, O. Naaman, M. Neeley, C. Neill, M. Y. Niu,
E. Ostby, A. Petukhov, J. C. Platt, C. Quintana, E. G. Rieffel, P. Roushan,
N. C. Rubin, D. Sank, K. J. Satzinger, V. Smelyanskiy, K. J. Sung, M. D.
Trevithick, A. Vainsencher, B. Villalonga, T. White, Z. J. Yao, P. Yeh,
A. Zalcman, H. Neven, and J. M. Martinis, “Quantum supremacy using
a programmable superconducting processor,” Nature, vol. 574, no. 7779,
pp. 505–510, 2019.
[5] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
and factoring,” in Proceedings 35th annual symposium on foundations
of computer science. Ieee, 1994, pp. 124–134.
[6] L. K. Grover, “A fast quantum mechanical algorithm for database
search,” in Proceedings of the Twenty-eighth Annual ACM Symposium
on Theory of Computing, ser. STOC ’96. New York, NY, USA:
ACM, 1996, pp. 212–219. [Online]. Available: http://doi.acm.org/10.
1145/237814.237866
[7] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou,
P. J. Love, A. Aspuru-Guzik, and J. L. OBrien, “A variational
eigenvalue solver on a photonic quantum processor,” in Nature
Communications, vol. 5, no. 1, 2014, p. 4213. [Online]. Available:
https://doi.org/10.1038/ncomms5213
[8] W. Knight, “IBM Raises the Bar with a 50-Qubit Quantum
Computer,” https://www.technologyreview.com/s/609451/
ibm-raises-the-bar-with-a-50-qubit-quantum-computer, 2017.
[9] J. Kelly, “A Preview of Bristlecone, Googles New Quantum
Processor,” https://ai.googleblog.com/2018/03/a-preview-of-bristlecone-
googles-new.html, 2018.
[10] J. Hsu, “Intels 49-Qubit Chip Shoots for Quantum Supremacy,”
https://spectrum.ieee.org/tech-talk/computing/hardware/intels-49qubit-
chip-aims-for-quantum-supremacy, 2018.
[11] Rigetti, https://www.rigetti.com/.
[12] IBM Q, https://www.ibm.com/quantum-computing.
[13] G. Li, Y. Ding, and Y. Xie, “Tackling the qubit mapping problem
for nisq-era quantum devices,” in Proceedings of the Twenty-Fourth
International Conference on Architectural Support for Programming
Languages and Operating Systems. ACM, 2019, pp. 1001–1014.
[14] R. Wille, L. Burgholzer, and A. Zulehner, “Mapping quantum circuits
to ibm qx architectures using the minimal number of swap and h
operations,” in Proceedings of the 56th Annual Design Automation
Conference 2019. ACM, 2019, p. 142.
[15] A. Zulehner, S. Gasser, and R. Wille, “Exact global reordering for near-
est neighbor quantum circuits using A∗,” in International Conference
on Reversible Computation. Springer, 2017, pp. 185–201.
[16] A. Zulehner, A. Paler, and R. Wille, “Efficient mapping of quantum
circuits to the ibm qx architectures,” in 2018 Design, Automation &
Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp.
1135–1138.
[17] M. Y. Siraichi, V. F. d. Santos, S. Collange, and F. M. Q. Pereira,
“Qubit allocation,” in Proceedings of the 2018 International Symposium
on Code Generation and Optimization. ACM, 2018, pp. 113–125.
[18] A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, and J. M.
Gambetta, “Validating quantum computers using randomized model
circuits,” Physical Review A, vol. 100, no. 3, Sep 2019. [Online].
Available: http://dx.doi.org/10.1103/PhysRevA.100.032328
[19] M. A. Nielsen and I. Chuang, “Quantum computation and quantum
information,” 2002.
[20] M. Amy, D. Maslov, M. Mosca, and M. Roetteler, “A meet-in-the-
middle algorithm for fast synthesis of depth-optimal quantum circuits,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 32, no. 6, p. 818830, Jun 2013. [Online]. Available:
http://dx.doi.org/10.1109/TCAD.2013.2244643
[21] A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, “Open
quantum assembly language,” arXiv preprint arXiv:1707.03429, 2017.
[22] P. Murali, J. M. Baker, A. Javadi-Abhari, F. T. Chong, and M. Martonosi,
“Noise-adaptive compiler mappings for noisy intermediate-scale
quantum computers,” in Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages
and Operating Systems, ser. ASPLOS ’19. New York, NY,
USA: ACM, 2019, pp. 1015–1029. [Online]. Available: http:
//doi.acm.org/10.1145/3297858.3304075
[23] N. M. Linke, D. Maslov, M. Roetteler, S. Debnath, C. Figgatt, K. A.
Landsman, K. Wright, and C. Monroe, “Experimental comparison of
two quantum computing architectures,” Proceedings of the National
Academy of Sciences, vol. 114, no. 13, pp. 3305–3310, 2017.
