Software Pipelining for Quantum Loop Programs. by Guo, J & Ying, M
Software Pipelining for Quantum Loop Programs
Guo Jingzhe






CQSI, FEIT, University of Technology Sydney
Australia




We propose a method for performing software pipelining on
quantum for-loop programs, exploiting parallelism in and
across iterations. We redefine concepts that are useful in pro-
gram optimization, including array aliasing, instruction de-
pendency and resource conflict, this time in optimization of
quantum programs. Using the redefined concepts, we present
a software pipelining algorithm exploiting instruction-level
parallelism in quantum loop programs. The optimization
method is then evaluated on some test cases, including pop-
ular applications like QAOA, and compared with several
baseline results. The evaluation results show that our ap-
proach outperforms loop optimizers exploiting only in-loop
optimization chances by reducing total depth of the loop
program to close to the optimal program depth obtained by
full loop unrolling, while generating much smaller code in
size. This is the first step towards optimization of a quantum
program with such loop control flow as far as we know.
Keywords: quantumprogram scheduling, quantumprogram
compilation
1 Introduction
Quantum computer hardware has reached the so-called quan-
tum supremacy showing that quantum computation can ac-
tually outperform classical computation for certain tasks, but
it is still in the NISQ (Noisy-Intermediate-Scale-Quantum)
era where there are no sufficient quantum bits (qubits, for
short) for quantum error correction.
Program optimization is particularly important for ex-
ecuting a quantum program on NISQ hardware in order to
reduce the number of required qubits, the length of gate
pipeline, and to mitigate quantum noise. Indeed, there has
already been plenty of work on optimization and paralleliza-
tion of quantum programs. Theoretically, it was proved in
[5] that compilation of quantum circuits with discretized
time and parallel execution can be NP complete. Practically,
quantum hardware architectures, especially those based on
superconducting qubits, provide instruction level support for
exploiting parallelism in quantum programs; for example,
Rigetti’s Quil [20] allows programmers to explicitly spec-
ify multiple instructions that do not involve same qubits
to be executed together, while in Qiskit, ASAP or ALAP
scheduling is performed implicitly [23]. Furthermore, sev-
eral compilers have been implemented that can optimize
quantum circuits by exploiting instruction level parallelism;
for example, ScaffCC [11] introduces critical path analysis
to find the “depth” of a quantum program efficiently, re-
vealing how much parallelism there is in a quantum circuit;
commutativity-aware logic scheduling is proposed in [18]
to adopt a more relaxing quantum dependency graph than
“qubit dependency” by taking in mind commutativity be-
tween the 𝑅𝑍 gates and CNOT gates as well as high-level
commutative blocks while scheduling circuits. There are also
some more sophisticated optimization strategies reported in
in previous works [10, 13, 19, 22] .
Quantum hardware will soon be able to execute quan-
tum programs with more complex program constructs, e.g.
for-loops. However, most of the optimization techniques in
previous work only deal with sequential quantum circuits.
Some methods allow loop programs as their input, but those
loops will be unrolled immediately and optimization will
be performed on the unrolled code. Loop unrolling is the
technique that allows optimization across all iterations of
a loop, but comes at a price of long compilation time, re-
dundant final code and run-time compulsory cache misses.
As quantum hardware in the near future may allow up to
hundreds of qubits, it will often be helpful to preserve loop
structure during optimization since the growth in number
of qubits will also lead to increment in total gate count, as
well as increment in difficulty unrolling the entire program.
Software pipelining [12] is a common technique in op-
timizing classical loop prosgrams. Inspired by the execution
of an unrolled loop on an out-of-order machine, software
pipelining reorganizes the loop by a software compiler in-
stead of by hardware. There are two major approaches for
software pipelining:
• Unrolling-based software pipelining usually unrolls
loop for several iterations and finds repeating pattern
in the unrolled part; see for example [2].
• Modulo scheduling guesses an initiation interval first
and try to schedule instructions one by one under
dependency constraints and resource constraints; see
for example [12].
OurContributions:We hereby presents a software pipelin-
ing algorithm for parallelizing a certain kind of quantum
























A Preprint, December 23, 2020 Guo, et al.
novel and more relaxed set of dependency rules on a CZ-
architecture (Theorems 1 and 2). The algorithm is essentially
a combination of unrolling-based software pipelining and
modulo scheduling [12], with several modifications to make
it work on quantum loop programs.
We carried out experiments on several examples and com-
pared the results with the baseline result obtained by loop
unrolling. Our approach proves to be a steady step toward
bridging the gap between optimization results without con-
sidering across-loop optimization and fully unrolling results
while restraining the increase in code size.
Organization of the Paper: In Section 2, we review some
basic definitions used in this paper. The theoretical tools for
defining and exploiting parallelism in quantum loop program
are developed in Section 3. In Section 4, we present our ap-
proach of rescheduling instructions across loops, extracting
prologue and epilogue so that depth of the loop kernel can
be reduced. The evaluation results of our experiments are
given in Section 5. The conclusion is drawn in the Section 6.
[For conciseness, all proofs are given in the Appendices.]
2 Preliminaries and Examples
This section provides some backgrounds [14, 25] on quantum
computing and quantum programming.
2.1 Basics of quantum computing
The quantum counterparts of bits are qubits. Mathematically,
a state of a single qubit is represented by a 2-dimensional
complex column vector (𝛼, 𝛽)𝑇 , where𝑇 stands for transpose.
It is often written in the Dirac’s notation as |𝜓 ⟩ = 𝛼 |0⟩+𝛽 |1⟩
with |0⟩ = (1, 0)𝑇 , |1⟩ = (0, 1)𝑇 corresponding to classical
bits 0 and 1, respectively. It is required that |𝜓 ⟩ be unit:
∥𝛼 ∥2+∥𝛽 ∥2 = 1. Intuitively, the qubit is in a superposition of
0 and 1, andwhenmeasuring it, wewill get 0with probability
∥𝛼 ∥2 and 1 with probability ∥𝛽 ∥2. A gate on the qubit is
then modelled by a 2 × 2 complex matrix 𝑈 . The output of
𝑈 on an input |𝜓 ⟩ is quantum state |𝜓 ′⟩. Its mathematical
representation as a vector is obtained by the ordinary matrix
multiplication𝑈 |𝜓 ⟩. To guarantee that |𝜓 ′⟩ is always unit,
𝑈 must be unitary in the sense that 𝑈 †𝑈 = 𝐼 where 𝑈 † is
the adjoint of 𝑈 obtained by transposing and then complex
conjugating𝑈 . In general, a state of 𝑛 qubits is represented
by a 2𝑛-dimensional unit vector, and a gate on 𝑛 qubits is
described by a 2𝑛 × 2𝑛 unitary matrix. [For convenience of
the readers, we present the basic gates used in this paper in
Appendix A.]
2.2 Quantum execution environment
Software pipelining is a highly machine-dependent approach
of optimization. So we must give out some basic assumptions
about the underlying machine that our algorithm requires.
State-of-the-art universal quantum computers differ in many
ways:
• Instruction set: A quantum computer chooses a uni-
versal set of quantum gates as its low-level instructions.
For example, IBM Q[4] uses controlled-NOT CNOT
and three one-qubit gates𝑈1,𝑈2,𝑈3, but Rigetti Quil[20]
uses controlled-Z CZ and one-qubit rotations 𝑅𝑋 , 𝑅𝑍 .
We use the universal gate set {𝑈3,𝐶𝑍 } for the reason
that𝑈3 itself is universal for single qubit gates, which
allows us to merge single qubit gates at compile time.
[see Appendix A for the definition of these gates.]
• Instruction parallelism: Different quantum comput-
ers are implemented on different technologies, con-
straining their power to execute multiple instructions
simultaneously. Usually superconductive quantum com-
puters support parallelism while ion-trap ones do not.
We assume qubit-level parallelism: instructions on dif-
ferent qubits can always be executed simultaneously.
• Timing: Different quantum computers may use differ-
ent timing strategies, using continuous time or discrete
time. Also execution time of different instructions may
differ and is highly machine-dependent. Usually a two-
qubit gate (e.g. CZ and CNOT ) costs much longer
time than single qubit gates. We use a discrete time
model with every gate requiring 1 tick equally.
• Qubit connectivity: Different machines may have
different qubit topologies. However, we assume that all
gates in the input are directly executable, which may
require a layout synthesis step before our optimization.
• Classical control. The support for classical control
flow varies among different quantum computers; for
example, IBM Q does not support any complex control
flow, while Rigetti Quil supports branch statements.
We assume such classical controls [see Appendix C].
The above assumptions do not fit into the existing quan-
tum hardware architecture perfectly (for instance, IBM Q
requires CNOT and Quil disallows𝑈3), while the architec-
ture of Google’s devices[22] fits these requirements most.
With some slight modifications, however, our method can be
easily adapted to unsupported architectures [see Appendix
L].
2.3 Quantum loop programs
We focus on a special family of quantum loop programs,
called one-dimensional for-loop programs, defined as below:
program :=header statement∗
header :=[(qdef | udef)∗]
qdef :=𝑞𝑢𝑏𝑖𝑡 ident[N];
udef :=𝑑𝑒 𝑓 𝑔𝑎𝑡𝑒 ident[N] = gate;
gate :=[(C2×2)∗] | 𝑅𝑍 | 𝑅+𝑍 | 𝑈𝑛𝑘𝑛𝑜𝑤𝑛
gateref :=ident[expr]
qubit :=ident[expr]
op :=𝑆𝑄 (gateref) qubit | 𝐶𝑍 qubit, qubit;
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
statement :=op | 𝑓 𝑜𝑟 ident 𝑖𝑛 Z 𝑡𝑜 Z{op∗}
| 𝑓 𝑜𝑟 ident 𝑖𝑛 ident 𝑡𝑜 ident{op∗}
expr :=Z ∗ ident + Z
where:
• The loop involves a group of one-dimensional qubit
array variables defined by qdef.
• The loop has only one iteration variable 𝑖 starting from
𝑎 to 𝑏 with stride 1. The range [𝑎, 𝑏] is completely
known at compile time, or completely unknown until
execution. This allows our algorithm to be performed
on a program with parametric loop range.
• All array index expressions are in the form (𝑘𝑖 + 𝑏),
where 𝑖 is the iteration variable, and𝑘, 𝑏 ∈ Z are known
constants.
• All operations in the loop body are either an one-qubit
gate, or a 𝐶𝑍 gate on two qubits. We don’t consider
measurement operations.
• One-qubit gates are defined by udef. They are given
as known matrices, or “an element in an array of un-
known matrices” when a hint on whether the matrix
array is diagonal or antidiagonal can be given. This
allows our algorithm to be performed on a program
with parametric gates or performing different gates on
different iterations.
At the very start of the entire program, all qubit arrays are ini-
tialized as |0⟩. Our optimization may introduce some branch
statements if the endpoints 𝑎 and 𝑏 are unknown before code
execution. As a result, the output language of the compiler
is a superset of the input language above, with support for
branch statements [see Appendix C for one possible definition
of output language]. To show versatility of the above loop,
let us consider several popular quantum algorithms.
Example 1. Grover algorithm [9] is designed for the black-
box searching problem: given a function 𝑓 : {0, 1}𝑛 → {0, 1},
find a bitstring 𝑥 : {0, 1}𝑛 such that 𝑓 (𝑥) = 1. While a classical
algorithm requires Ω(𝑛) calls to the oracle, Grover search can
find a solution in 𝑂 (
√
𝑛) calls of quantum oracle 𝑈𝑓 ( |𝑥⟩ ⊗
|𝑏⟩) = |𝑥⟩ ⊗ |𝑏 ⊕ 𝑓 (𝑥)⟩. This is done by repeating a series of
quantum gates, called Grover iteration. Grover search can be
written as the loop program:




for i in 1 to 𝑂 (
√
𝑁 ) do
𝑈𝑓 [𝑞, 𝑞𝑤𝑜𝑟𝑘 ]; (2 |𝜓 ⟩ ⟨𝜓 | − 𝐼 ) [𝑞]
end for
Example 2. A Quantum Approximate Optimization Algo-
rithm (QAOA for short) is designed in [8] to solve the MaxCut
problem on a given graph 𝐺 = ⟨𝑉 , 𝐸⟩. It can be written as a
parametric quantum loop program:
for i=0 to (N-1) do
𝐻 [𝑞 [𝑖]]
end for
for i=1 to p do
for (𝑎, 𝑏) ∈ 𝐸 do
𝐶𝑁𝑂𝑇 [𝑞 [𝑎], 𝑞[𝑏]];𝑈𝐵 [𝑖] [𝑞 [𝑏]]; 𝐶𝑁𝑂𝑇 [𝑞 [𝑎], 𝑞[𝑏]]
end for
for j=0 to (N-1) do
𝑈𝐶 [ 𝑗] [𝑞 [ 𝑗]]
end for
end for
Here, we use parametric gate arrays 𝑈𝐶 [𝑖] = 𝑅𝑋 (𝛽𝑖 , 𝑗) and
𝑈𝐵 [𝑖] = 𝑅𝑍 (−𝜔𝑎𝑏𝛾𝑖 ) of rotations. The two innermost loops can
be unrolled to satisfy our input language requirements. Since
QAOA repeatedly executes the circuit but each time with dif-
ferent sets of angles {𝛽𝑖 } and {𝛾𝑖 }, an optimizer has to support
compilation of the circuit above without knowing all parame-
ters in advance. Note that the compiler can know in advance
that 𝑈𝐵 [𝑖] are diagonal matrices, and this hint might be used
during optimization. [for a further explanation of QAOA see
see Appendix B]
3 Theoretical tools
In this section, we develop a handful of theoretical techniques
required in our optimization. To start, let us identify some
of the most critical challenges in optimizing quantum loop
programs:
• Instructions may bemerged together at compile time,
potentially reducing the total depth. However, merging
instructions needs to know which instructions may be
adjacent in the unrolled pattern, thus requiring us to
resolve all possible qubit aliasings.
• Data dependency graph in a quantum program is
usually much denser than that in classical program,
since generally two matrices are not commutable, that
is, 𝐴𝐵 ≠ 𝐵𝐴.
• Resource constraint, which prevents instructions
that do not have dependency from executing together,
is quite different in quantum case from classical case.
We will show how much optimization can be done by miti-
gating these challenges in loop reordering.
3.1 Gate merging
Our assumptions allow several instructions to be merged
into a single instruction with the same effect:
• Two adjacent one-qubit gates on the same qubit can
be merged, since we are using𝑈3.
• Two adjacent 𝐶𝑍 gates on the same qubits can cancel
each other.
Example 3. Figure 1 is a simple case for periodical gate merg-
ing pattern. The two one-qubit gates in different iterations may
A Preprint, December 23, 2020 Guo, et al.
f o r i =0 t o 3 do
U q [ i ] ; V q [ i + 1 ] ; W q [ i + 2 ] ;




|q2〉 W V U















(a) 𝑖 ≠ 0 ∧ 𝑖 ≠ −1
|o〉 H • H
|j〉 •
(b) 𝑖 = 0
|i〉 •
|o〉 H • H
(c) 𝑖 = −1
Figure 2. The 𝐶𝑍 gate prevents the two Hadamard gates
from merging, due to potential qubit aliasing.
merge with each other, thus simplifying the dependency graph
and introducing new opportunities for optimization.
Gate merging allows us to decrease count of gates, and
thus reduce total execution time. However, the existence of
potential aliasing adds to the difficulty of finding “adjacent”
pairs of gates. Figuring out pairs of gates that can be safely
merged is one of the critical problems when scheduling the
program.
Example 4. Even for a simple program, it can be hard to
decide whether two adjacent instructions on a qubit can be
merged. Consider the simple program:
for i=a to b do
𝐻 [𝑞 [0]]; 𝐶𝑍 [𝑞 [𝑖], 𝑞[𝑖 + 1]]; 𝐻 [𝑞 [0]];
end for
We can merge the Hadamard gates if and only if ∀𝑖, 𝑖 ≠
0 ∧ (𝑖 + 1) ≠ 0. Three possible cases of 𝑖 lead to three different
results, as Figure 2 shows.
The above example reveals that resolving qubit aliasings
is crucial in gate merging.
f o r i =0 to 3 do
H q [ 1 ] ;
CZ q [ i ] , q [ i + 1 ] ;
H q [ 1 ] ;
end f o r
(a) Loop program.
|q0〉 •





Figure 3. Unrolled loop does not reveal periodic feature due
to qubis aliasing.
f o r i =0 to 3 do
H q [ i ] ;
CZ q [ i ] , q [ i + 1 ] ;
H q [ i + 1 ] ;
end f o r
(a) Loop program.
|q0〉 H •
|q1〉 • H H •
|q2〉 • H H •
|q3〉 • H H •
|q4〉 • H
(b) Unrolled circuit.
Figure 4. Periodic feature in the unrolled loop can be cap-
tured.
3.2 Qubit aliasing resolution
Allowing arbitrary linear expressions being used to index
qubit arrays introduces the problem of qubit aliasing both
in a single iteration and across iterations. Potential aliasing
in quantum programs leads two kinds of problems: lack of
periodic features in unrolled schedule, and extra complexity in
detecting aliasings.
The first problem is that non-periodic features cannot be
captured using software-pipelining (or other loop schedul-
ing methods). For example, in Figure 3, the situation where
𝐶𝑍 blocks two Hadamards from merging only occurs in one
or two iterations of the loop program, but it prevents the
merging in all iterations, since software pipelining can only
generate a periodic pattern and has to generate conservative
code. The only kind of aliasing (two different qubit expres-
sions refering to the same qubit) that software pipelining
can capture is those expressions on the same qubit array and
with the same slope, as shown in Figure 4.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 5. An example for across-loop qubit aliasing with
𝑘1 = 3 and 𝑘2 = 2. For 𝑇 = Z, Δ𝑖 = 1, while for 𝑇 = [4, 10],
Δ𝑖 = 2.
To see the second problem, we note that detection of mem-
ory aliasing [1] is usually solved by an Integer Linear Pro-
gramming (ILP) problem solver such as Z3[7]. However, a
general ILP problem is NP-complete in theory and may take
long time to solve in practice. Fortunately, we will see that
all problems that we are facing can be solved efficiently in
𝑂 (1) time without an ILP solver.
We consider two references to a same qubit array:𝑞 [𝑘1𝑖 + 𝑏1] ,
𝑞 [𝑘2𝑖 + 𝑏2] , 𝑖 ∈ 𝑇 , where 𝑇 is the loop interval when the
loop range is known and Z when unknown.
Definition 1. In-loop qubit aliasing: To check whether
two instructions can always be executed together, we have
to check if one qubit reference may be an alias of another, that
is, (∃𝑖 ∈ 𝑇 ) (𝑘1𝑖 + 𝑏1 = 𝑘2𝑖 + 𝑏2) .
This problem can be easily solved by checking whether
(𝑏2 − 𝑏1) is a multiple of (𝑘1 − 𝑘2) and 𝑏2−𝑏1𝑘1−𝑘2 lies in 𝑇 .
Definition 2. Across-loop qubit aliasing: To checkwhether
there is an across-loop dependency between two instructions,
we have to check if one qubit reference may be an alias of an-
other qubit reference several iterations later. Thus, we need
to find the minimal increment Δ𝑖 ⩾ 1, s.t.
(∃𝑖 ∈ 𝑇 ) ((𝑖 + Δ𝑖 ∈ 𝑇 ) ∧ (𝑘1𝑖 + 𝑏1 = 𝑘2 (𝑖 + Δ𝑖) + 𝑏2)) . (1)
This issue can be reduced to the Diophantine equation
(𝑘2 − 𝑘1)𝑖 + 𝑘2 (Δ𝑖) = 𝑏1 − 𝑏2, 𝑖 ∈ 𝑇, 𝑖 + Δ𝑖 ∈ 𝑇,Δ𝑖 ⩾ 1, (2)
which can be solved in𝑂 (1) time [see Appendix D]. We solve
the equation every time when needed rather than memoriz-
ing its solution. A visualization of across-loop qubit aliasing
is presented in Figure 5.
3.3 Instruction data dependency
One most important step in rescheduling a loop is to find the
data dependencies - instrucions that can not be reordered
while scheduling. Previous work mostly defined instruction
dependency according to matrix commutativity: the order of
two instructions can change if their unitary matrices satisfy
𝐴𝐵 = 𝐵𝐴. This captures most commutativity between gates,
but not all. Here, we relax this requirement by establishing
several novel and more relexed commutativity rules between
quantum instructions. Since 𝐶𝑍 gates is the only two-qubit
gate we use and any two𝐶𝑍 gates commute with each other,
what we need to care about is commutativity between 𝐶𝑍
gates and one-qubit gates.
Definition 3. (CZ conjugation) If for one-qubit gates𝑈𝐴,𝑈𝐵 ,
𝑉𝐴 and 𝑉𝐵 , we have 𝐶𝑍𝑈𝐴𝑈𝐵𝐶𝑍 = 𝑉𝐴𝑉𝐵 , we say 𝐶𝑍 conju-
gates𝑈𝐴 ⊗ 𝑈𝐵 into 𝑉𝐴 ⊗ 𝑉𝐵 .
Conjugation allows us to swap a 𝐶𝑍 gate with a pair of
one-qubit gates, at the price of changing 𝑈𝐴 and 𝑈𝐵 to 𝑉𝐴
and 𝑉𝐵 correspondingly. The following theorem identifies
all possible conjugations.
Theorem 1. (CZ conjugation of single qubit gates) 𝐶𝑍 con-
jugates𝑈𝐴 ⊗𝑈𝐵 into some𝑉𝐴 ⊗𝑉𝐵 if and only if𝑈𝐴 and𝑈𝐵
are diagonal or anti-diagonal: 𝑈𝑖 = 𝑅𝑍 (\ ) or 𝑈𝑖 = 𝑅+𝑍 (\ ) for
𝑖 ∈ {𝐴, 𝐵}.
Note 1. The antidiagonal rule has been named “EjectPhased-
Paulis” in [22]. However we propose the rules for both necessity
and sufficiency: no more commutation rules can be obtained
at gate level.
Since identity matrix 𝐼 is diagonal, 𝑈𝐴 and 𝑈𝐵 can be
thought of as going under conjugation separately. Thus, we
only need to consider two special cases: 𝐼 ⊗ 𝑅𝑍 and 𝐼 ⊗ 𝑅+𝑍 .
Note that in conjugation rules 𝑅+
𝑍
will always introduce a 𝑍
gate to the other qubit. This inspires us to generalize Theo-
rem 1 for a generalized form of 𝐶𝑍 defined in the following:
Definition 4. (Generalized 𝐶𝑍 gates) For 𝑥,𝑦 ∈ {0, 1}, we
define following variants of 𝐶𝑍 gate:
𝐶𝑍11 [𝑎, 𝑏] = 𝐶𝑍 [𝑎, 𝑏], 𝐶𝑍00 [𝑎, 𝑏] = −𝑍 [𝑎]𝑍 [𝑏]𝐶𝑍 [𝑎, 𝑏]
𝐶𝑍10 [𝑎, 𝑏] = 𝑍 [𝑎]𝐶𝑍 [𝑎, 𝑏], 𝐶𝑍01 [𝑎, 𝑏] = 𝑍 [𝑏]𝐶𝑍 [𝑎, 𝑏]
Equivalently,𝐶𝑍𝑥𝑦 can be defined as follows:𝐶𝑍𝑥𝑦 |𝑎𝑏⟩ =
(−1)𝛿𝑎𝑥𝛿𝑏𝑦 |𝑎𝑏⟩, where 𝛿𝑖 𝑗 is Kronecker delta. Now we have
the following commutativity rules for generalized 𝐶𝑍 :
Theorem 2. (Generalized 𝐶𝑍 conjugation of single qubit
gates) When exchanged with 𝑅+
𝑍
, 𝐶𝑍 gate changes into one of
its variants by toggling the corresponding bit.
1. 𝑅𝑍 (𝛼) [𝑏]𝐶𝑍𝑥𝑦 [𝑎, 𝑏] = 𝐶𝑍𝑥𝑦 [𝑎, 𝑏]𝑅𝑍 (𝛼) [𝑏];
2. 𝑅+
𝑍
(𝛼) [𝑏]𝐶𝑍𝑥𝑦 [𝑎, 𝑏] = 𝐶𝑍𝑥 (1−𝑦) [𝑎, 𝑏]𝑅+𝑍 (𝛼) [𝑏].
Since generalized 𝐶𝑍 gates are also diagonal, they com-
mute with each other and can be scheduled just as ordinary
𝐶𝑍 gate and converted back to 𝐶𝑍 by adding 𝑍 gates.
3.4 Instruction resource constraint
Qubits have properties that resemble both data and resource:
qubits work as quantum data registers and carry quantum
data; meanwhile, qubit-level parallelism allows all instruc-
tions, if they operate on different qubits, to be executed simul-
taneously. This results in a surprising property for quantum
programs: the resources should be described using linear ex-
pressions, instead of by a static “resource reservation table”
as in the classical case. Using the rules for detecting qubit


























Branch by m mod C
Figure 6. The entire compilation flow of our approach.
aliasings, we simply check if there is an aliasing between
the qubit references from two instructions, that is, the two
instructions share a same qubit at some iteration and cannot
be executed simultaneously.
4 Rescheduling loop body
Now we are ready to present the main algorithm for pipelin-
ing quantum loop programs. It is based on modulo schedul-
ing via hierarchical reduction [3], but several modifications
to the original algorithm are required to fit into scheduling
quantum instructions on qubits. The entire flow of our ap-
proach is depicted in Figure 6. For simplicity we suppose the
number of iterations is large enough so that we don’t worry
about generating a long prologue/epilogue.
4.1 Loop body compaction
At first we compact the loop kernel to merge the gates that
can be trivially merged, including: (a) adjacent single qubit
gates; (b) diagonal or antidiagonal single qubit gates and
their nearby single qubit gates, maybe at the other side of a
𝐶𝑍 gate; and (c) adjacent 𝐶𝑍 gates. To this end, we define
the following compaction procedure, which considers the
potential aliasing between qubits:
Definition 5. A greedy procedure for compacting loop kernel:
• Initialize all qubits with an ideneity gate.
• Place all instructions one by one. Initialize operation
to “Blocked”. Check the new instruction (A) against all
placed instructions (B). Update operation according to
Table 1.
• Perform the last operation according to the table.
– “Blocked” means the instruction is put at the end of
the instruction list.
– “Merge with B” means the single qubit instruction
is merged with the placed single qubit gate B. If the
placed gate is an antidiagonal,𝑍 gates should be added
for uncancelled 𝐶𝑍 gates that occur earlier but are
placed after the antidiagonal.
– “Cancelled” means two 𝐶𝑍 gates are cancelled. Note
that the added 𝑍 gates are not cancelled. Also, a third
arriving𝐶𝑍 can “uncancel” a cancelled𝐶𝑍 , which we
also call as “Cancelled”.
This compaction can be done in two directions: compact-
ing to the left or to the right. They can be seen as the results
of ASAP schedule and ALAP correspondingly. However, this
procedure does not guarantee compacting once will con-
verge: not all the outputs from the procedure are fixpoints of
the procedure. For example, the circuit in Figure 7 only con-
verges after three applications of left compaction. In general,
we have the following:
Theorem 3. Compacting three times results in a fixpoint of
the compaction procedure.
Note that we allow using unknown single-qubit gates. If
all components are known to be diagonal or antidiagonal,
the product of these matrices is also diagonal or antidiagonal
[see Appendix F]. Otherwise, we can only see the product
as a general matrix. However, this does not affect our result
of three-time compaction. Also compacting in one direction
does not capture all chances of merging. Figure 8 shows that
some single-qubit merging changes are missed out. In prac-
tice we perform a left compaction after a right compaction.
4.1.1 Loop unrolling and rotation. Loop kernel com-
paction can only discover gate merging and cancellation
in one iteration. However, gate merging and cancellation
can also occur across iterations. For example, in Figure 4
the last 𝐻 gate in the previous iteration can be merged and
cancelled with the first𝐻 gate in the next iteration. This kind
of cancellation cannot be discovered by software pipelining
either, since it is a reordering technique and cannot cancel
instructions out.
An instruction 𝑖 in one iteration may merge or cancel
with instruction 𝑗 from 𝑡 ⩾ 1 iterations later. All poten-
tial merging of single qubit gates and cancellable 𝐶𝑍 gates
can be written out by enumerating all pairs of instructions.
Loop rotation[15] is an optimization technique to convert
across-loop dependency to in-loop dependency (so that some
variables can be privatized and optimized out). Consider a
loop ranging from𝑚 to 𝑛: {𝐴𝑖𝐵𝑖𝐶𝑖 }𝑛𝑚 . Here, 𝐴𝑖 can be ro-
tated to the tail of the loop: 𝐴𝑚 {𝐵𝑖𝐶𝑖𝐴𝑖+1}𝑛−1𝑚 𝐵𝑛𝐶𝑛, and 𝐶𝑖
and 𝐴𝑖+1 are now in one iteration. If 𝐶𝑖 writes into a tem-
porary variable and 𝐴𝑖+1 reads from it, this variable can be
privatized. For merging candidates with 𝑡 = 1, we can use a
similar procedure:
Definition 6. An instruction is considered movable if it sat-
isfies one of following conditions:
• The instruction is a single-qubit gate, and there are no
gates on the same qubit or on an aliasing qubit before it;
in this case the instruction can be rotated to the right.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
A\B SQ with same qubit SQ with in-loop aliasing CZ with same qubit CZ with aliasing qubit
Diagonal SQ Merge with B Blocked
AntiDiagonal SQ Merge with B Blocked Blocked
General SQ Blocked Blocked Blocked Blocked
CZ Blocked Blocked If exactly-same then Cancel
Table 1. Operation table for loop kernel compaction. Empty cell means using previous operation. Check is performed from
left to right, so antidiagonal can pass through 𝐶𝑍 with a same qubit and an aliasing qubit.
|a〉 Z •
|b〉 X • H Z H
(a) Original circuit
|a〉 Z •
|b〉 X • X
(b) Compacting #1






Figure 7. Compacting more than once yields better result.
|a〉 •
|b〉 Z • H
Figure 8. Left compaction will miss the chance of compact-
ing the 𝑍 gate and the 𝐻 gate.
• The instruction is a 𝐶𝑍 gate, and there are no single-
qubit gates on the same qubit or on ailasing qubits; in
this case the instruction can be rotated to the right.
• The instruction is a 𝐶𝑍 gate, and there are no single-
qubit gates on the same qubit or on ailasing qubits
except the 𝐶𝑍 gate has only one linear offset reference
with 𝑘 = 0 and there is a single-qubit gate on this qubit.
In this case, the instruction will be rotated to the right
along with this single qubit gate.
This definition of movable instructions guarantees the
programs before and after the rotation are equivalent. We
use the following procedure to rotate one instruction from
left to right:
1. Find the first unmarked movable instruction that,
there exists another instruction to merge or cancel
with 𝑡 = 1.
2. Mark the chosen instruction, and rotate the instruc-
tion to the right. The instruction is added to prologue
and the others added to epilogue.
3. Perform left compaction on the new loop kernel. Note
that the left-compaction algorithm is modified, so that
merging single-qubit gates or cancelling𝐶𝑍 gates will
clear the mark.
4. If there is no rotatable instruction, stop the procedure.
Corollary 4. If the original loop has only candidates with
𝑡 = 1 and no one-qubit gate merges with itself, this procedure
eliminates all across-loop merging or cancellation. That is, if
we unroll the loop after rotation, the unrolled quantum “circuit”
should be a fixpoint of compaction procedure.
However, loop rotation can only handle potential gate
merging across one iteraion (i.e. from nearby iterations). To
handle potential merging across many iteraions, we adopt
loop unrolling from classical loop optimization. While the
major objective for loop unrolling is usually to reduce branch
delay, Aiken et al. [2] also used loop unrolling to unroll first
few iterations of loop and schedule them ASAP, so that re-
peating patterns can be recognized into an optimal software
pipelining schedule. Our approach uses modulo scheduling
instead of kernel recognition, but we can still exploit the
power of loop unrolling to capture patterns that require
many iterations to reveal. The key point is that unrolling
decreases 𝑡 . Suppose we use a graph to represent all “candi-
dates for instruction merging”, with edge𝐴 𝑡−→ 𝐵 indicating
instruction 𝐴 will merge with or cancel out instruction 𝐵
from 𝑡 iterations later, if we unroll the loop by 𝐶 times, the
weight of the edges in the graph will decrease.
Example 5. Figure 9 gives an example showing the connec-
tion between the “merging graph” before unrolling and the one
after unrolling: if ∀𝑡,𝐶 ⩾ 𝑡 , there are no edges with 𝑡 > 1.
There is a tradeoff between generated code length (deter-
mined by𝐶) and remaining 𝑡 > 1 edges. For example, if there
is an edge with 𝑡 = 10000, we are not likely to unroll the
loop for 10000 times just to merge the two single qubit gates.
Also for eliminating self-cancelling 𝐶𝑍 gates (i.e. 𝐶𝑍 gates
on a pair of constant qubits), we may want 𝐶 ⩾ 2 and 𝐶
even. In the following discussion we use 𝐶 as a configurable
variable in our algorithm determining the maximal allowed
unroll time (and the minimal time of iterations of the loop).
The new unrolled loop will be in the form
𝑓 𝑜𝑟 (𝑖 =𝑚; 𝑖 ⩽ 𝑛; 𝑖+ = 𝐶) {𝑜𝑝 (𝑘𝑡 𝑖 + 𝑏𝑡 )}
𝑓 𝑜𝑟 (𝑖 =𝑚′; 𝑖 ⩽ 𝑛; 𝑖 + 1) {𝑜𝑝 (𝑘𝑡 𝑖 + 𝑏𝑡 )}
(3)
and the first loop should be written into






− 1 and𝑚′ = 𝐶 (𝑛′ + 1) +𝑚. This step
of transformation makes sure the loop stride is still 1 after
A Preprint, December 23, 2020 Guo, et al.
Figure 9. Example for the QDGs of loop “𝐴 4−→ 𝐵” unrolled
2, 3, 4 and 5 times. Unrolling the loop decreases the edge
weight 𝑡 . When 𝐶 =𝑚𝑎𝑥 {𝑡} all edges will be decreased to
weight 1.
loop unrolling. Note that item (𝑚𝑘𝑡 ) appears in every offset
of the loop body. If𝑚 is unknown we can’t proceed with our
algorithm. Fortunately, since𝑚 = 𝑝𝐶 +𝑞, 𝑞 =𝑚 mod 𝐶 , we
have 𝐶𝑘𝑡 𝑖 + 𝑏𝑡 +𝑚𝑘𝑡 = 𝐶𝑘𝑡 (𝑖 + 𝑝) + 𝑏𝑡 + 𝑞𝑘𝑡 , showing that
when the range is unknown, the results of array dependency
depend only on the Euclidean modulo 𝑞 =𝑚 mod 𝐶 . In this
case, we can generate 𝐶 copies of code for each case of 𝑞,
and perform following parts of the algorithm on each copy.
Let us briefly summarize our compilation flow till now:
we compact the loop kernel, unroll the loop by 𝐶 , and rotate
some instructions in the unrolled loop kernel. The unrolling
step may copy the loop by 𝐶 times, and steps after unrolling
(including rotation) will be performed on each copy.
4.2 Modulo scheduling
Our next step is modulo scheduling borrowed from [12]:
1. Find in-loop and loop-carried dependencies.
2. Estimate an initialization interval 𝐼 𝐼 . For simplicity
we use binary search and the maximum 𝐼 𝐼 is total
instruction count. Use Floyd to check validity.
3. Using Tarjan algorithm to find strong connected com-
ponents and schedule all SCCs by in-loop dependency
subgraph.
4. Merge every SCC in DDG into one node, obtaining a
new DDG.
5. Schedule the new DDG by list scheduling.
There are some major differences between quantum pro-
grams and the classical programs considered in [12]:
4.2.1 Quantum dependency graph. The instruction de-
pendency for quantum programs is described by a QDG
(Quantum Dependency Graph) as a generalization of DDG
(Data Dependency Graph), where vertices represent instruc-
tions and edges represent precedence constraints that must
be satisfies while reordering. In modulo scheduling, a de-
pendency edge is described by two integers: 𝑚𝑖𝑛 and 𝑑𝑖 𝑓 .
Suppose there is an edge pointing from instruction 𝐴 to in-
struction 𝐵 with parameter (𝑚𝑖𝑛,𝑑𝑖 𝑓 ), it means “instruction
𝐵 from 𝑑𝑖 𝑓 iterations later should be scheduled at least𝑚𝑖𝑛
ticks later than instruction 𝐴 in this iteration”. Recall from
Section 3.2 and 3.3, our dependency is defined by the rules:
1. There are no dependencies between 𝐶𝑍 gates, or be-
tween a 𝐶𝑍 and a diagonal single qubit gate.
2. In-loop dependency: if two offsets are on the same
qubit array and reveal in-loop qubit aliasing, there is
a dependency edge (1, 0) between the corresponding
instructions. To unify with across-loop, we set Δ𝑖 = 0.
3. Across-loop dependency: if two offsets are on the same
qubit array and reveal across-loop qubit aliasing with
Δ𝑖 , there is a dependency edge (1,Δ𝑖) between the
corresponding instructions.
4. Exception on antidiagonal gates: if the qubit (𝑘1𝑖 +𝑏1)
of an antidiagonal gate aliases with one operand 𝑘2𝑖 +
𝑏2 of a 𝐶𝑍 gate and 𝑘1 = 𝑘2, we remove the edge if
there’s no aliasing on the other operand.
5. Exception on single qubit gates: if two single qubit
gates operate on the same qubit array where offsets
(𝑘1𝑖 + 𝑏1) and (𝑘2𝑖 + 𝑏2) aliases with each other and
𝑘1 = 𝑘2, we specify the dependency edge to be valued
(0,Δ𝑖), that is,𝑚𝑖𝑛 = 0 rather than𝑚𝑖𝑛 = 1.
There may be multiple edges in the graph connecting the
same pair of instructions; for example, an in-loop depen-
dency and an across-loop dependency between the two in-
structions. Since we are going to use Floyd algorithm on the
graph to compute largest distance in modulo scheduling, we
only need the edge with the maximal (𝑚𝑖𝑛 − 𝐼 𝐼 · 𝑑𝑖 𝑓 ) after
assigning 𝐼 𝐼 . Fortunately we don’t need to save all multiple
edges, since the following theorem guarantees that we can
compare (𝑚𝑖𝑛 − 𝐼 𝐼 · 𝑑𝑖 𝑓 ) before assigning different 𝐼 𝐼s.
Theorem 5. Suppose (𝑚𝑖𝑛1, 𝑑𝑖 𝑓1), (𝑚𝑖𝑛2, 𝑑𝑖 𝑓2) are two edges
with𝑚𝑖𝑛1 ⩽ 1,𝑚𝑖𝑛2 ⩽ 1 and 𝑑𝑖 𝑓1 > 𝑑𝑖 𝑓2. Then for all 𝐼 𝐼 ⩾ 1,
we have:𝑚𝑖𝑛1 − 𝐼 𝐼 · 𝑑𝑖 𝑓1 ⩽ 𝑚𝑖𝑛2 − 𝐼 𝐼 · 𝑑𝑖 𝑓2.
This theorem allows us to sort multiple edges by lexical
ordering of (𝑑𝑖 𝑓 ,−𝑚𝑖𝑛) (i.e. compare 𝑑𝑖 𝑓 first, and compare
(−𝑚𝑖𝑛) if 𝑑𝑖 𝑓1 = 𝑑𝑖 𝑓2) and the smallest one is exactly the
edge with maximal (𝑚𝑖𝑛 − 𝐼 𝐼 · 𝑑𝑖 𝑓 ).
4.2.2 Resource conflict handling. Another important
issue when inserting an instruction into modulo schedul-
ing table or merging two strong connected components is
resource conflict: there is no dependency between two 𝐶𝑍
gates, yet they may not be executed together because they
may share a same qubit. To solve this issue, let us first intro-
duce several notations:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
f o r x=m to n do
CNOT q1 [ x −50 ] , q0 [ x + 0 ] ;
CNOT q1 [ x −50 ] , q0 [ x + 0 ] ;
end f o r
(a) Loop program.
(b) Corresponding QDG.
Figure 10. Quantum dependency graph example. Tuples
represent (𝑚𝑖𝑛,𝑑𝑖 𝑓 ).
1. 𝐼 𝐼 is the current iteration interval being tested.
2. 𝐿 is the length of the original loop kernel.
3. The 𝑐-th instruction in the original loop is placed in
the modulo scheduling table at tick 𝑡 = 𝑝𝐼𝐼 + 𝑞, where
𝑝 ⩾ 0, 0 ⩽ 𝑞 < 𝐼 𝐼 .
Example 6. Figure 11 is a simple example for modulo sched-
uling. In this case, 𝐼 𝐼 = 2 and 𝐿 = 4. Instructions are placed
at time slot 0, 2, 3, 4. Thus, 𝐴 from one iteration, 𝐵 from a pre-
vious iteration, and 𝐷 from previous 2 iterations are executed
simultaneously, while 𝐶 is executed alone.
We use the retrying scheme: if a resource conflict is de-
tected, try next tick. The basic approach to detect resource
conflict is detecting in-loop qubit aliasing. This leads to two
new problems that do not exist in the classical case:
1. The array offsets of instruction operands may increase.
As 𝑡 increases, 𝑝 also increase, and the instruction
comes from one more iteration earlier, thus changing
array offsets.
2. The pair of instructions for resource conflict checking
may not both exist in some iterations. Increasing 𝑡
leads to a long prologue and long epilogue, shrinking
the range for loop kernel, and may eliminate the re-
source conflict that once existed (when the loop range
is known).
(a) Rescheduled sin-






ing table. Column in-
dex represents origi-
nal iteration.
Figure 11. Example for modulo scheduling loop 𝐴𝑖𝐵𝑖𝐶𝑖𝐷𝑖 .
In this case 𝐼 𝐼 = 2, 𝐿 = 4, 𝑇 = [0, 2].
Example 7. Suppose when generating the schedule in Figure
11, we have inserted instructions 𝐴, 𝐵 and 𝐶 , and are ready to
insert 𝐷 at time slot 4.
1. Since 4 = 2𝐼 𝐼 + 0, the 𝐷 in the loop kernel is from two
iterations earlier compared with the iteration that the
𝐴 is in. We have to decrease offset of 𝐷 operands by 2𝑖 .
The offseted index may no longer conflict with 𝐴.
2. When checking if there is resource conflict between𝐷 and
𝐴, we only need to check the case where both iterations
are valid; that is, 𝑖 = 2. This means the scheduling is still
valid even if 𝐴0 has a resource conflict with 𝐷−2, since
𝐷−2 does not even exist.
In the original modulo scheduling and other classical
scheduling algorithms, the retry strategy only allows 𝐼 𝐼 re-
tries. For example, if there is not enough 𝐴𝐿𝑈 or 𝐹𝑃𝑈 for
instruction 𝐴𝑖 in modulo scheduling table tick 𝑞, there is
also not enough resource for instruction 𝐴𝑖−1 from previous
iteration. However, this is not true for our case, and we have
to modify the strategy.
Example 8. Suppose we perform modulo scheduling on the
program in Figure 12. Since the three𝐶𝑍 s are exactly the same,
we may expect 𝐼 𝐼 = 3 due to resource conflict. However, if we
allow more retries, these 𝐶𝑍 s can be separated into different
iterations and can be executed concurrently with 𝐶𝑍 s from
other iterations.
We consider the general case where loop range is un-
known. When placing an instruction in the modulo schedul-
ing table, we check its operands with all operands scheduled
at this tick. Suppose now we check operand (𝑘2 (𝑖 −𝑝2) +𝑏2)
with operand (𝑘1 (𝑖−𝑝1)+𝑏1), and we find an aliasing, that is,
∃𝑖0 ∈ Z, 𝑘2 (𝑖0 − 𝑝2) + 𝑏2 = 𝑘1 (𝑖0 − 𝑝1) + 𝑏1. In case 𝑘1 = 𝑘2,
∀𝑖 ∈ Z, 𝑘2 (𝑖 − 𝑝2) + 𝑏2 = 𝑘1 (𝑖 − 𝑝1) + 𝑏1. When 𝑘1 = 0,
A Preprint, December 23, 2020 Guo, et al.
f o r x=0 to 6 do
CZ q [ x ] , q [ x + 1 ] ;
CZ q [ x ] , q [ x + 1 ] ;
CZ q [ x ] , q [ x + 1 ] ;
end f o r
(a) Original Program.
|q0〉 • • •
|q1〉 • • • • • •
|q2〉 • • • • • •
|q3〉 • • • • • •
|q4〉 • • • • • •
|q5〉 • • • • • •
|q6〉 • • • • • •
|q7〉 • • •
(b)Unrolled Program, for a clearer
view.
CZ q [ 0 ] , q [ 1 ] ;
CZ q [ 1 ] , q [ 2 ] ;
CZ q [ 0 ] , q [ 1 ] ; CZ q [ 2 ] , q [ 3 ] ;
CZ q [ 1 ] , q [ 2 ] ; CZ q [ 3 ] , q [ 4 ] ;
f o r x=4 to 6 p a r a l l e l do
CZ q [ x −4 ] , q [ x − 3 ] ;
CZ q [ x −2 ] , q [ x − 1 ] ;
CZ q [ x ] , q [ x + 1 ] ;
end f o r
CZ q [ 3 ] , q [ 4 ] ; CZ q [ 5 ] , q [ 6 ] ;
CZ q [ 4 ] , q [ 5 ] ; CZ q [ 6 ] , q [ 7 ] ;
CZ q [ 5 ] , q [ 6 ] ;
CZ q [ 6 ] , q [ 7 ] ;
(c) Software pipelined version.
|q0〉 • • •
|q1〉 • • • • • •
|q2〉 • • • • • •
|q3〉 • • • • • •
|q4〉 • • • • • •
|q5〉 • • • • • •
|q6〉 • • • • • •
|q7〉 • • •
(d) Software pipelined version, un-
rolled.
Figure 12. Three 𝐶𝑍 gates in a row. Although there seems to be resource conflicts, the minimal 𝐼 𝐼 = 1.
this is the same as classical resource scheduling; otherwise,
∀Δ𝑝 ≠ 0,∀𝑖 ∈ Z, 𝑘2 (𝑖 − 𝑝2 − Δ𝑝) +𝑏2 ≠ 𝑘1 (𝑖 − 𝑝1) +𝑏1. This
means if we delay the instruction by Δ𝑝𝐼𝐼 ticks, the conflict
will be resolved. We call it false conflict. In case 𝑘1 ≠ 𝑘2,
after Δ𝑝𝐼𝐼 ticks it will fall in the same time slot. There is still
a conflict iff ∃𝑖1 ∈ Z, 𝑘2 (𝑖1 −𝑝2 −Δ𝑝) +𝑏2 = 𝑘1 (𝑖1 −𝑝1) +𝑏1;
that is, 𝑖1 = 𝑖0 + Δ𝑝𝑘2𝑘2−𝑘1 , which means (𝑘2 − 𝑘1) |Δ𝑝𝑘2. The
conflict appears periodically as Δ𝑝 increases. However, in
the worst case where (𝑘2 − 𝑘1) |𝑘2, there is always a conflict
and can be seen as classical resource scheduling. We call it,
together with the case where 𝑘1 = 𝑘2 = 0, true conflict.
We insert an instruction or an entire schedule into the
modulo scheduling table in the following way: if there is
no conflict, we insert the instructions; if there is only false
conflict, we try next tick. As an exception, false conflicts
between two single qubit gates are also seen as no conflict;
and if there is true conflict, we start a “death countdown”
before trying next tick: if next (𝐼 𝐼 −1) retries do not succeed,
give up, as we do in classical retry scheme.
4.2.3 Inversion pair correction. The commutativity be-
tween antidiagonal 𝑅+
𝑍
gates and 𝐶𝑍 gates comes at a price
of a Z gate. In modulo scheduling stage we allowed them
to commute freely, ignoring the generated Z gates. Now we
have to fill them back to ensure equivalence. By the term
“inversion”, we mean that our scheduling alters the execution
order of instructions compared with original ordering:
Definition 7. If the original 𝑐th instruction ismodulo-scheduled
at 𝑡 = 𝑝𝐼𝐼 + 𝑞 in new loop (where the 𝑘th original loop is is-
sued), we define the absolute order of the instruction to be
𝑇 = (𝑘 − 𝑝)𝐿 + 𝑐 = 𝑘𝐿 + (𝑐 − 𝑝𝐿).
Example 9. Suppose 𝐿 = 4 and 𝐵 in Figure 11 is the second
instruction in the original loop (𝑐 = 1). 𝐵 is placed in the
modulo scheduling table at 𝑝 = 1 and 𝑞 = 0.
1. The first 𝐵 instruction is issued in the prologue (incom-
plete loop kernel) where the second (𝑘 = 1) iteration
is issued. Thus the absolute order of the instruction is
𝑇 = 1.
2. The second 𝐵 instruction is issued in the loop kernel
where the third (𝑘 = 2) iteration is issued. Thus the
absolute order is 𝑇 = 5.
3. The third 𝐵 instruction is issued in the epilogue (again
incomplete loop kernel) where the fourth (𝑘 = 3) itera-
tion is issued (or, should be issued). The absolute order is
𝑇 = 9.
We see that the absolute order is exactly the time when the
instruction is executed in the original loop.
Our idea is to check all inversion pairs in the modulo
schedule. There are two kind of order-inversions:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 13. An example of inverted pairs of instructions
across loop iterations.
Definition 8. 1. In-loop inversion: For two instructions
in the 𝑘-iteration in new scheduling (i.e. the iteration
where 𝑘th iteration of original loop is issued), if the first
precedes the second while its absolute order succeeds the
absolute order of the second instruction:𝑘𝐿+(𝑐1−𝑝1𝐿) >
𝑘𝐿 + (𝑐2 − 𝑝2𝐿), there is an in-loop inversion.
2. Loop-carried inversion: For two instructions in𝑘-iteration
and (𝑘 + 𝑟 )-iteration (𝑟 ⩾ 1), if 𝑘𝐿 + (𝑐1 − 𝑝1𝐿) >
(𝑘 + 𝑟 )𝐿 + (𝑐2 − 𝑝2𝐿), there is an across-loop inversion.
Since the 𝑘𝐿 term can be cancelled, inversion pairs in
modulo schedule also reveals periodicity. Figure 13 shows an
example with periodic 𝑟 = 1 inversions, and 𝑟 = 2 inversions.
Since the term (𝑘 + 𝑟 )𝐿 + (𝑐2 − 𝑝2𝐿) increases as 𝑟 increases,
there exists 𝑟0 s.t. ∀𝑟 > 𝑟0 there is no across-loop inversion.
We can increase 𝑟 and find pairs of inversion from iteration
𝑘 and (𝑘 + 𝑟 ), until there is no inversion pair. When finding
all inversion pairs, we can check the pairs to see if one is𝐶𝑍
and the other is antidiagonal on one of 𝐶𝑍 ’s operand. If so,
we add a 𝑍 gate at the tick where 𝐶𝑍 is placed.
4.2.4 Code generation for kernel, prologue and epi-
logue. We generate prologue and epilogue by removing
non-existing instructions from the loop kernel.
Example 10. Consider in Figure 11 (remember 𝑇 = [0, 2]),
the iteration where 𝑘th original iteration is issued (or should
be issued) by enumerating 𝑘 from −∞ to∞:
1. For 𝑘 < 0, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩ 𝑇 = Φ, no instruction is
put.
2. For 𝑘 = 0, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩𝑇 = {𝑘}, only 𝐴 is put.
3. For 𝑘 = 1, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩𝑇 = {𝑘, 𝑘 − 1}, 𝐴, 𝐵,𝐶 are
put.
4. For 𝑘 = 2, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩𝑇 = {𝑘, 𝑘 − 1, 𝑘 − 2}. This
is the complete loop kernel.
5. For 𝑘 = 3, {𝑘, 𝑘 − 1, 𝑘 − 2}∩𝑇 = {𝑘 − 1, 𝑘 − 2}, 𝐵,𝐶, 𝐷
are put.
6. For 𝑘 = 4, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩𝑇 = {𝑘 − 2}, 𝐷 is put.
7. For 𝑘 > 4, {𝑘, 𝑘 − 1, 𝑘 − 2} ∩ 𝑇 = Φ, no instruction is
put.
For prologue and epilogue, we have to remove instructions
from iterations that do not exist; for extra 𝑍 gates from the
inversion of a 𝐶𝑍 and an antidiagonal, removing either gate
will make the 𝑍 gate disappear. After removing non-existing
instructions, we perform compaction and ASAP schedule on
the two parts.
For loop kernel, we need to merge the single qubit gates
on the same qubit in the same time slot (from the resource
conflict exception) by their absolute order.
4.3 Modulo scheduling again
In the first round of modulo scheduling, inversion of 𝐶𝑍
and antidiagonal gates may introduce 𝑍 gates overlapping
𝐶𝑍s, resulting an illegal schedule. To generate an executable
schedule, we performmodulo scheduling again, but this time
we no longer allow “commutativity” between antidiagonals
and𝐶𝑍s, and thus the inversion-fix step can be skipped. The
scheduled loop by this second round of modulo scheduling
is directly executable on the device.
[An analysis on the complexity of our algorithm presented
in this section is given in Appendix K.]
5 Evaluation
We have implemented our method and carried out exper-
iments on several quantum programs. Some of them are
intrinsically parallel, while others are not. Baselines for our
evaluation come from the following sources:
• Kernel-ASAP performs compaction and ASAP sched-
uling on the loop kernel. We expect our work to out-
perform this naive approach.
• Unroll unrolls the loop and performs compaction as
well as ASAP scheduling on the unrolled circuit. The
software-pipelined version should generate a program
with similar depth but much smaller code size.
• Cirq uses the optimization passes in [22] to unroll the
loop. This gives another perspective of loop unrolling
besides our implementation.
The experiment results are in Table 2. We hereby analyze
some of the important examples:
5.1 Grover Search
Grover search is a test case with long dependency chain and
little space for optimization. Yet our approach can reduce
the overall depth by merging adjacent gates in iteration
and across iterations. We use the 𝐶𝐶𝑁𝑂𝑇 case from [6] and
Sudoku solver from [4]. Since Grover search is a hard-to-
optimize case, we inspected the optimized code and got the
following findings:
Although examples do not revealmuch optimization chance,
there is a pitfall for ASAP optimizers that may cause a di-
agonal 𝑇 † gate to be scheduled at the first tick alone. This
A Preprint, December 23, 2020 Guo, et al.
Test case Input Loop Output Loop Known range resultsASAP 𝐶 𝐶-ASAP Pre K Post #Iter K-ASAP Unroll Cirq QSP#Iter QSP
Cluster 4 2 5 4 1 4 200 800 203 203 96 104
Array 1 5 2 10 8 4 5 100 500 500 500 48 205
Array 2 3 2 5 4 1 4 100 300 201 201 46 54
Array 3 11 2 17 12 12 17 100 1100 605 606 48 605
Grover 1 13 2 26 26 24 871 99 1287 1287 1288 15 1257
Grover 2 71 2 141 141 135 40881 1000 71000 70001 71001 207 68967
QAOA-Hard 1 21 2 41 41 40 2021 1001 21021 20021 20021 449 20022
QAOA-Hard 2 21 2 41 41 40 2061 1001 21021 20021 20021 448 20022
QAOA-Hard 3 16 2 27 41 18 1121 1001 16016 11016 11016 448 9226
QAOA-Hard 4 33 2 47 60 31 3882 1000 33000 14019 14019 360 15102
QAOA-Par 1 15 2 26 46 20 943 201 3015 2215 2215 56 2109
QAOA-Par 2 15 2 26 45 20 1009 201 3015 2215 2215 53 2114
QAOA-Par 3 18 2 29 43 18 1080 201 3618 2218 2218 50 2023
QAOA-Par 4 15 2 29 29 25 3668 1000 15000 14001 14001 368 12897
Table 2. Evaluation results. ASAP is the minimal depth of original loop body. 𝐶-ASAP is the minimal depth of the original
loop body unrolled by 𝐶 times. Pre, K and Post represents prologue, kernel and epilogue. For each test case a range sized #Iter









(b) New program by acci-
dental inversion of two𝐶𝑍s,
depth=2.
Figure 14. The accidental inversion of 𝐶𝑍s reduced kernel
depth by 1.
is prevented in our approach by performing bidirectional
compactions. Moreover, the depth cut mainly comes from in-
version of a pair of 𝐶𝑍 s while scheduling, which indeed our
approach does not consider. (see Figure 14). This inspires us
to find more optimization chances while placing instructions
without dependency, like a program with many 𝐶𝑍s.
5.2 QAOA
The QAOA programs in [8] (in Figure 15), as well as the
QAOA example in [22] are used in our experiment, but with
a 𝑝 (i.e. the number of iterations) large enough. Since the
decomposition of QAOA into gates affects how it can be op-
timized on our architecture, we consider two different ways:
QAOA-Par where QAOA is decomposed to expose more
commutativity (see the details in Appendix J), and QAOA-
Hard, where QAOA is decomposed into a harder form, with
a long dependency chain formed by cross-qubit operations
that is unable to be detected by gate-level optimizers.
(a) (b) (c)
Figure 15. QAOA-MaxCut examples in [8].
The evaluation results in Table 2 show that in all cases,
our approach can reduce the loop kernel size compared with
Kernel-ASAP, and can sometimes outperform unrolling
results. This advantage is more evident in the QAOA-Par
cases than in the QAOA-Hard cases, since QAOA-Par reveals
more commutativity chances than QAOA-Hard. Another
finding is that QAOA-Hard generates larger code thanQAOA-
Par, and thus requires more iterations for software-pipelining
to take effect.
[More discussions on examples are in Appendix M.]
6 Conclusion
We proposed a compilation flow for optimizing quantum
programs with control flow of for-loops. In particular, data
dependencies and resource dependencies are redefined to
exposes more chances for optimization algorithms. Our ap-
proach is tested against several important quantum algo-
rithms, revealing code-size advantages over the existing ap-
proaches while keeping depth advantage close to loop rolling.
Yet there is still gap for optimization of more complex quan-
tum programs, on different architectures, and with lower
complexity, which could be filled in future works.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
References
[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.
Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-
Wesley Longman Publishing Co., Inc., USA.
[2] Alexander Aiken and Alexandru Nicolau. 1988. Optimal Loop Paral-
lelization. Technical Report. 308–317 pages. https://doi.org/10.1145/
53990.54021
[3] Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan.
1995. Software Pipelining. ACM Comput. Surv. 27, 3 (1995), 367–432.
https://doi.org/10.1145/212094.212131
[4] Abraham Asfaw, Luciano Bello, Yael Ben-Haim, Sergey Bravyi,
Nicholas Bronn, Lauren Capelluto, Almudena Carrera Vazquez, Jack
Ceroni, Richard Chen, Albert Frisch, Jay Gambetta, Shelly Garion,
Leron Gil, Salvador De La Puente Gonzalez, Francis Harkins, Takashi
Imamichi, David McKay, Antonio Mezzacapo, Zlatko Minev, Ramis
Movassagh, Giacomo Nannicni, Paul Nation, Anna Phan, Marco Pis-
toia, Arthur Rattew, Joachim Schaefer, Javad Shabani, John Smolin,
Kristan Temme, Madeleine Tod, Stephen Wood, and James Woot-
ton. 2020. Learn Quantum Computation Using Qiskit. http:
//community.qiskit.org/textbook
[5] Adi Botea, Akihiro Kishimoto, and Radu Marinescu. 2018. On the
Complexity of Quantum Circuit Compilation. In Proceedings of the
Eleventh International Symposium on Combinatorial Search, SOCS 2018,
Stockholm, Sweden - 14-15 July 2018, Vadim Bulitko and Sabine Storandt
(Eds.). AAAI Press, 138–142. https://aaai.org/ocs/index.php/SOCS/
SOCS18/paper/view/17959
[6] Patrick J. Coles, Stephan J. Eidenbenz, Scott Pakin, Adetokunbo Ade-
doyin, John Ambrosiano, Petr M. Anisimov, William Casper, Gopinath
Chennupati, Carleton Coffrin, Hristo Djidjev, David Gunter, Satish
Karra, Nathan Lemons, Shizeng Lin, Andrey Y. Lokhov, Alexander Ma-
lyzhenkov, David Dennis Lee Mascarenas, Susan M. Mniszewski, Balu
Nadiga, Dan O’Malley, Diane Oyen, Lakshman Prasad, Randy Roberts,
Philip Romero, Nandakishore Santhi, Nikolai Sinitsyn, Pieter Swart,
Marc Vuffray, Jim Wendelberger, Boram Yoon, Richard J. Zamora, and
Wei Zhu. 2018. Quantum Algorithm Implementations for Beginners.
CoRR abs/1804.03719 (2018). arXiv:1804.03719 http://arxiv.org/abs/
1804.03719
[7] Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT
Solver. In Tools and Algorithms for the Construction and Analysis of
Systems, C. R. Ramakrishnan and Jakob Rehof (Eds.). Springer Berlin
Heidelberg, Berlin, Heidelberg, 337–340.
[8] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum
Approximate Optimization Algorithm. arXiv:quant-ph/1411.4028
[9] Lov K. Grover. 1996. A Fast Quantum Mechanical Algorithm for
Database Search. In Proceedings of the Twenty-Eighth Annual ACM
Symposium on Theory of Computing (Philadelphia, Pennsylvania, USA)
(STOC ’96). Association for ComputingMachinery, New York, NY, USA,
212–219. https://doi.org/10.1145/237814.237866
[10] Gian Giacomo Guerreschi and Jongsoo Park. 2018. Two-step approach
to scheduling quantum circuits. Quantum Science and Technology 3, 4
(Jul 2018), 045003. https://doi.org/10.1088/2058-9565/aacf0b
[11] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, Alexey
Lvov, Frederic T. Chong, and Margaret Martonosi. 2015. ScaffCC:
Scalable compilation and analysis of quantum programs. Parallel
Comput. 45 (2015), 2–17. https://doi.org/10.1016/j.parco.2014.12.001
[12] Monica S. Lam. 1988. Software Pipelining: An Effective Scheduling
Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88
Conference on Programming Language Design and Implementation
(PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat
(Ed.). ACM, 318–328. https://doi.org/10.1145/53990.54022
[13] Prakash Murali, David C. McKay, Margaret Martonosi, and Ali Javadi-
Abhari. 2020. Software Mitigation of Crosstalk on Noisy Intermediate-
Scale Quantum Computers. In ASPLOS ’20: Architectural Support for
Programming Languages and Operating Systems, Lausanne, Switzerland,
March 16-20, 2020 [ASPLOS 2020 was canceled because of COVID-19],
James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1001–1016.
https://doi.org/10.1145/3373376.3378477
[14] Michael A. Nielsen and Isaac L. Chuang. 2011. Quantum Computa-
tion and Quantum Information: 10th Anniversary Edition (10th ed.).
Cambridge University Press, USA.
[15] Bill Pottenger. [n.d.]. Loop Rotation. http://polaris.cs.uiuc.edu/projects/
rec/node8.html
[16] Robert Raussendorf, Daniel E Browne, and Hans J Briegel. 2003.
Measurement-based quantum computation on cluster states. Physical
review A 68, 2 (2003), 022312.
[17] Vivek V. Shende, Stephen S. Bullock, and Igor L. Markov. 2006. Synthe-
sis of quantum-logic circuits. IEEE Trans. on CAD of Integrated Circuits
and Systems 25, 6 (2006), 1000–1010. https://doi.org/10.1109/TCAD.
2005.855930
[18] Yunong Shi, Nelson Leung, Pranav Gokhale, Zane Rossi, David I. Schus-
ter, Henry Hoffmann, and Frederic T. Chong. 2019. Optimized Compi-
lation of Aggregated Instructions for Realistic Quantum Computers.
In Proceedings of the Twenty-Fourth International Conference on Archi-
tectural Support for Programming Languages and Operating Systems,
ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, Iris Bahar, Maurice
Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 1031–1044.
https://doi.org/10.1145/3297858.3304018
[19] Seyon Sivarajah, Silas Dilkes, Alexander Cowtan, Will Simmons, Alec
Edgington, and Ross Duncan. 2020. t |𝑘𝑒𝑡 ⟩: a retargetable compiler
for NISQ devices. Quantum Science and Technology 6, 1 (nov 2020),
014003. https://doi.org/10.1088/2058-9565/ab8e92
[20] Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2016. A
Practical Quantum Instruction Set Architecture. CoRR abs/1608.03355
(2016). arXiv:1608.03355 http://arxiv.org/abs/1608.03355
[21] Bochen Tan and Jason Cong. 2020. Optimal Layout Synthesis for
Quantum Computing. arXiv:cs.AR/2007.15671
[22] Quantum AI team and collaborators. 2020. Cirq. https://doi.org/10.
5281/zenodo.4062499
[23] Qiskit Development Team. [n.d.]. Qiskit Terra basic sched-
ulers. https://qiskit.org/documentation/stubs/qiskit.scheduler.
methods.basic.html#module-qiskit.scheduler.methods.basic
[24] Mingsheng Ying. 2009. Commutativity between CNOT and one-qubit
gates (Unpublished notes). (2009).
[25] Mingsheng Ying. 2016. Foundations of Quantum Programming. Mor-
gan Kaufmann, Boston. https://doi.org/10.1016/B978-0-12-802306-
8.00002-1
A Preprint, December 23, 2020 Guo, et al.
A Basic quantum gates
The following are the frequently-used one-qubit gates repre-
sented in 2 × 2 unitary matrices:

























































































are universal for quantum computing; that is, they can be
used to construct arbitrary quantum gate of any size.
Beside the above, we will use the following auxiliary gates
to simplify the presentation of our approach:
𝑅−𝑋 (𝛼) =
[
cos \2 −𝑖 sin
\
2
























= 𝑅𝑍 (𝛼)𝐻𝑍 .
Note that parameter 𝛼 in the above gates is a real number.
The 𝑅+
𝑍
(𝛼) gate can represent all single qubit gates that are
anti-diagonal, i.e. only anti-diagonal entries are not 0. The
other three notations are used in Appendix I.
For real-world quantum computers, a quantum device
may only support a discrete or contiguous set of single qubit
gates while keeping the device universal. For example, IBM’s
devices allow the following three kinds of single qubit gates
















𝑈3 (\, 𝜙, _) =
[
𝑐𝑜𝑠 ( \2 ) −𝑒
𝑖_𝑠𝑖𝑛( \2 )
𝑒𝑖𝜙𝑠𝑖𝑛( \2 ) 𝑒
𝑖_+𝑖𝜙𝑐𝑜𝑠 ( \2 )
]
Note that 𝑈2 (𝜙, _) = 𝑈3 ( 𝜋2 , 𝜙, _) and 𝑈1 (_) = 𝑈3 (0, 0, _).
Also note that gate 𝑈3 itself is universal for single-qubit
gates, and the main reasons for supporting𝑈1 and𝑈2 is to
mitigate error, which is beyond our consideration.
B More Examples for quantum loop
programs
We hereby presents more quantum algorithms that can be
written into quantum loop programs and can thus be poten-
tially optimized by our approach.
B.1 One-way quantum computing
Preparation circuit for simulating one-way quantum com-
putation on quantum circuit is another example that allows
each iteration to be performed on different qubits.
Example 11. One-way quantum computing 𝑄𝐶C[16] is a
quantum computing scheme that is quite different from the
commonly used quantum-circuit based schemes. Instead of
starting from |0⟩,𝑄𝐶C initializes all qubits (on a 2-dimensional
qubit grid) in a highly-entangled state, called cluster state.
After the preparation step, 𝑄𝐶C performs single-qubit mea-
surements on all qubits and extract the computation result
from these measurement outcomes.
To simulate one-way quantum computing with quantum
circuit, we first need to prepare the cluster state from |0⟩. This
can be done by first performing Hadamard gates on all qubits,
then performing 𝐶𝑍 gate on each pair of adjacent qubits on
the qubit grid.
The preparation circuit can be written in a nested loop man-
ner. If we assume the grid has a fixed width (3 in our case), we




𝐶𝑍 [𝑞 [0], 𝑞[1]]
𝐶𝑍 [𝑞 [1], 𝑞[2]]
for i=1 to (L-1) do
𝐻 [𝑞 [3𝑖]]
𝐻 [𝑞 [3𝑖 + 1]]
𝐻 [𝑞 [3𝑖 + 2]]
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
𝐶𝑍 [𝑞 [3𝑖], 𝑞[3𝑖 + 1]]
𝐶𝑍 [𝑞 [3𝑖 + 1], 𝑞[3𝑖 + 2]]
𝐶𝑍 [𝑞 [3𝑖], 𝑞1 [3𝑖 − 3]]
𝐶𝑍 [𝑞 [3𝑖 + 1], 𝑞2 [3𝑖 − 2]]
𝐶𝑍 [𝑞 [3𝑖 + 2], 𝑞3 [3𝑖 − 1]]
end for
Figure 16 shows the gates and qubits involved in each iter-
ation where 𝐿 = 5. The optimization of this program will be
discussed in Appendix M.
B.2 Quantum Approximate Optimization
Algorithm
Example 12. QuantumApproximate Optimization Algorithm
(QAOA)[8] can be used to solve MaxSat problems, for example,
MaxCut problems on 3-regular graphs, say 𝐺 = ⟨𝑉 , 𝐸⟩. QAOA
performs quantum computation and classical computation al-





𝑈 (𝐵, 𝛽𝑖 )𝑈 (𝐶,𝛾𝑖 ) |+⟩ (5)
where:









𝑈 (𝐵,𝛾𝑖 ) =
𝑛−1∏
𝑗=0
𝑅𝑋 (𝛽𝑖 , 𝑗). (7)
The sets of parameters {𝛽𝑖 } and {𝛾𝑖 } are computed in the
classical computation between every two quantum epochs. This
requires the optimizer to support compilation of the circuit
above without knowing all parameters in advance.
𝑈 (𝐵,𝛾𝑖 ) are products of Pauli𝑋 rotations on all qubits. Since







|b〉 ⊕ RZ(−ωabγi) ⊕
,
(8)
we can define parametric gate arrays 𝑈𝐶 [𝑖] = 𝑅𝑋 (𝛽𝑖 , 𝑗) and
𝑈𝐵 [𝑖] = 𝑅𝑍 (−𝜔𝑎𝑏𝛾𝑖 ), and the QAOA quantum part can be
written as a parametric quantum loop program:
for i=0 to (N-1) do
𝐻 [𝑞 [𝑖]]
end for
for i=1 to p do
for (𝑎, 𝑏) ∈ 𝐸 do
𝐶𝑁𝑂𝑇 [𝑞 [𝑎], 𝑞[𝑏]]
𝑈𝐵 [𝑖] [𝑞 [𝑏]]
𝐶𝑁𝑂𝑇 [𝑞 [𝑎], 𝑞[𝑏]]
end for
for j=0 to (N-1) do
𝑈𝐶 [ 𝑗] [𝑞 [ 𝑗]]
end for
end for
The two nested loops can be fully unrolled by hand, and
the outcome loop satisfies our requirements for optimization.
C Output language
If the input range of the loop program is unknown, we may
have to add guard statements into the orginal program, for
example, when we want to check if the range is large enough
for us to use the software-pipelined version. Those features
such as guard statements, unfortunately, are not supported
in our definition of input language. So we have to define the
following language for the optimization result:
program :=header statement∗
header :=[(qdef | udef)∗]
qdef :=𝑞𝑢𝑏𝑖𝑡 ident[N];
udef :=𝑑𝑒 𝑓 𝑔𝑎𝑡𝑒 ident[N] = gate;
gate :=[(C2×2)∗] | 𝑅𝑍 | 𝑅+𝑍 | 𝑈𝑛𝑘𝑛𝑜𝑤𝑛
gateref :=ident[expr]
qubit :=ident[expr]
op :=𝑆𝑄 (gateref) qubit;
| 𝐶𝑍 qubit, qubit;
statement :=op






expr :=ident | 𝑒𝑥𝑝𝑟 + 𝑒𝑥𝑝𝑟 | 𝑒𝑥𝑝𝑟 − 𝑒𝑥𝑝𝑟
| 𝑒𝑥𝑝𝑟 ∗ 𝑒𝑥𝑝𝑟 | 𝑒𝑥𝑝𝑟/𝑒𝑥𝑝𝑟 | 𝑒𝑥𝑝𝑟%𝑒𝑥𝑝𝑟 | Z
compare :=expr ordering expr
ordering := == | ! = | > | < | >= | <=
The main differences between the input language and the
output language are:
1. The 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 notation is added to explicitly point out
which instructions are scheduled together.
2. The 𝑔𝑢𝑎𝑟𝑑 statement is added to check whether the
input range is suitable for the software-pipelined ver-
sion if the range is unknown at compilation time, and
to separate cases with different (𝑚𝑚𝑜𝑑 𝐶). The 𝑔𝑢𝑎𝑟𝑑
statement executes the first statement block with a
satisfied guard condition.
A Preprint, December 23, 2020 Guo, et al.
Figure 16. Converting cluster state preparation circuit into loop program. Fig (a) is a 3 × 5 two-dimensional qubit network.
The preparation is done by performing a layer of Hadamard gates (Fig (b)) and a layer of𝐶𝑍 gates (Fig (c)). One way to perform
those 𝐶𝑍 gates without qubit conflict is to split them into four non-overlapping groups and execute each group separately, as
in Fig (d) to Fig (g). The procedure can also be written into loop program, as in Fig (h) to Fig (l).
3. The 𝑒𝑥𝑝𝑟 allows for more general indexing into qubit
arrays and gate arrays. Note that the division and mod-
ulo operators are Euclidean, i.e. it always holds that{
𝑠𝑖𝑔𝑛(𝑎%𝑏) = 𝑠𝑖𝑔𝑛(𝑏)
𝑎%𝑏 + (𝑎/𝑏) ∗ 𝑏 = 𝑎 (9)
D Solving Diophantine equations
In this appendix we focus on solving the Diophantine equa-
tion:
(𝑘2 −𝑘1)𝑖 +𝑘2 (Δ𝑖) = 𝑏1 −𝑏2, 𝑖 ∈ 𝑇, 𝑖 +Δ𝑖 ∈ 𝑇,Δ𝑖 ⩾ 1. (10)
We rewrite it into:
𝑎𝑥 + 𝑏𝑦 = 𝑐, 𝑥 ∈ 𝑇, 𝑥 + 𝑦 ∈ 𝑇,𝑦 ⩾ 1. (11)
We recall the solutions 𝑆 for linear Diophantine equations
with two variables:
Lemma 1. Solutions for linear Diophantine equations
with two variables
𝑎𝑥 + 𝑏𝑦 = 𝑐, 𝑥 ∈ Z, 𝑦 ∈ Z. (12)
1. If 𝑎 = 0 and 𝑏 = 0, 𝑆 = Φ if 𝑐 ≠ 0 and 𝑆 = Z × Z if
𝑐 = 0.
2. If 𝑎 = 0 but 𝑏 ≠ 0 (similar for 𝑏 = 0 but 𝑎 ≠ 0),






b. Otherwise, 𝑆 = Φ.
3. If 𝑎 ≠ 0 and 𝑏 ≠ 0:
a. If 𝑐 = 𝑑 · 𝑔𝑐𝑑 (𝑎, 𝑏),
• Special solution (𝑥0, 𝑦0) where
𝑎𝑥0 + 𝑏𝑦0 = 𝑔𝑐𝑑 (𝑎, 𝑏) (13)










𝑎𝑥 + 𝑏𝑦 = 0 (14)
is known.












We rewrite the equation into:
𝑆 = {(𝑥0 + 𝑘Δ𝑥,𝑦0 + 𝑘Δ𝑦) |𝑘 ∈ Z} . (16)
b. Otherwise, 𝑆 = Φ.
For our original question with constraints, we only con-
sider the cases where 𝑎 ≠ 0 and 𝑏 ≠ 0.
When 𝑇 = Z, the constraints no longer exist and we only
need to find the minimal positive integer in set {𝑦0 + 𝑘Δ𝑦},
which can be solved by an Euclidean division. With loss of
generality, we can just let 𝑘 = 0 by choosing 𝑦0 to be exactly
the smallest positive integer in {𝑦0 + 𝑘Δ𝑦} and adjust 𝑥0
accordingly, without affecting the solution set 𝑆 .
When 𝑇 = [𝑎, 𝑏], the corresponding 𝑥0 may not lie in
𝑇 . In this case we may want to find a secondary-minimal
positive integer. Without loss of generality we assume Δ𝑦 >
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
0 (otherwise choose Δ𝑥 = −Δ𝑥 and Δ𝑦 = −Δ𝑦). Then the
problem becomes: find minimal 𝑘 ∈ 𝑁+ s.t.{
𝑥0 + 𝑘Δ𝑥 >= 𝑎
𝑥0 + 𝑘Δ𝑥 <= 𝑏
, (17)
which is equivalent to{
𝑘Δ𝑥 >= 𝑎 − 𝑥0
𝑘Δ𝑥 <= 𝑏 − 𝑥0
(18)
which can thus be solved by a routine calculation: a minimal
𝑘 exists, or does not exist at all.
E Proofs of Theorems 1 (CZ conjugation
rules)
In this section we give out proof for our new rules of instruc-
tion data dependency. We will show that our definition of
dependency is “sufficient and necessary” for quantum gate
sets using 𝐶𝑍 .
We first restate Theorem 1 as follows:
𝐶𝑍𝑈𝐴𝑈𝐵𝐶𝑍 = 𝑉𝐴𝑉𝐵,
if and only if𝑈𝐴 and𝑈𝐵 are diagonal or anti-diagonal. That
is,𝑈𝑖 = 𝑅𝑍 (\ ) or𝑈𝑖 = 𝑅+𝑍 (\ ) for 𝑖 ∈ {𝐴, 𝐵}.
Proof. We here introduce our methodology of proving quan-
tum gate algebra equations: first we give a necessary condi-
tion by trying several input states, and show that the condi-
tion is also sufficient for the equation to hold.
The first lemma is a criteria for deciding whether a state
is separable or entangled:
Lemma 2. Two-qubit state |𝜓 ⟩ = (𝑎, 𝑏, 𝑐, 𝑑)𝑇 is separable if
and only if:
𝑎𝑑 − 𝑏𝑐 = 0. (19)
Proof. (Necessity) If |𝜓 ⟩ is separable, there exists two single
qubit states |𝜓1⟩ and |𝜓2⟩, s.t.
|𝜓 ⟩ = |𝜓1⟩ ⊗ |𝜓2⟩ (20)
Suppose
|𝜓1⟩ = (𝛼1, 𝛽1)𝑇 , (21)
|𝜓2⟩ = (𝛼2, 𝛽2)𝑇 , (22)
We have
|𝜓 ⟩ = (𝛼1𝛼2, 𝛼1𝛽2, 𝛽1𝛼2, 𝛽1𝛽2)𝑇 , (23)
and it can be easily verified that 𝑎𝑑 − 𝑏𝑐 = 0.
(Sufficiency) If
|𝜓 ⟩ = (𝑎, 𝑏, 𝑐, 𝑑)𝑇 (24)
with 𝑎𝑑 − 𝑏𝑐 = 0,
1. If 𝑏 = 0, this indicates 𝑎 = 0 or 𝑑 = 0. If 𝑎 = 0, let{
|𝜓1⟩ = |1⟩
|𝜓2⟩ = 𝑐 |0⟩ + 𝑑 |1⟩
; (25)
otherwise 𝑑 = 0, and let{
|𝜓1⟩ = 𝑎 |0⟩ + 𝑐 |1⟩
|𝜓2⟩ = |0⟩
. (26)
2. If 𝑐 = 0, this indicates 𝑎 = 0 or 𝑑 = 0. If 𝑎 = 0, let{
|𝜓1⟩ = 𝑏 |0⟩ + 𝑑 |1⟩
|𝜓2⟩ = |1⟩
; (27)
otherwise 𝑑 = 0, and let{
|𝜓1⟩ = |0⟩
|𝜓2⟩ = 𝑎 |0⟩ + 𝑏 |1⟩
. (28)





















It can be verified that ∥ |𝜓1⟩ ∥ = ∥ |𝜓2⟩ ∥ = 1, and that
|𝜓1⟩ ⊗ |𝜓2⟩ =
(𝑎, 𝑏, 𝑐, 𝑑)𝑇√︃




which is exactly (𝑎, 𝑏, 𝑐, 𝑑)𝑇 since tensor product pre-
serves norm.
□
Lemma 3. (Necessity) For the equation to hold,𝑈𝐴 and𝑈𝐵
have to be diagonal or anti-diagonal. This means𝑈𝑖 transforms
|0⟩ to |0⟩ or |1⟩, up to a global phase.
Proof. Suppose |𝜙⟩ = 𝑈𝐴 |0⟩ = (𝑎, 𝑏)𝑇 , thus
𝐶𝑍𝑈𝐴𝑈𝐵𝐶𝑍 ( |0⟩ ⊗ (𝑈 †𝐵 |𝜙⟩)) (31)
=𝐶𝑍 |𝜙⟩ ⊗ |𝜙⟩ (32)
=(𝑎2, 𝑎𝑏, 𝑎𝑏,−𝑏2)𝑇 , (33)
which should be a separable state since this is also𝑉𝐴𝑉𝐵 ( |0⟩⊗
(𝑈 †
𝐵
|𝜙⟩)), which is separable. Thus 𝑎2𝑏2 = 0, so 𝑎 = 0 (𝑅+
𝑍
case) or 𝑏 = 0 (𝑅𝑍 case). This is the same for𝑈𝐵 . □
Lemma 4. (Sufficiency) 𝑅𝑍 and 𝑅+𝑍 satisfies the conjugation
rules.
Proof. Note that 𝑅+
𝑍
= 𝑋𝑅𝑍 and𝐶𝑍𝑋𝐴 = 𝑋𝐴𝑍𝐵𝐶𝑍 . By simple
computation we can see the conjugation holds. □
□
F Proof of Theorem 3 (Convergence of
compaction)
We show that compaction procedure will converge after
applying the procedure three times.
If we look at the factors that prevents compaction proce-
dure from reaching its fixpoint, there are two main reasons:
A Preprint, December 23, 2020 Guo, et al.
1. Single qubit merging results in new diagonal gates
or antidiagonal gates, which is not recognized when
the first gate is placed. Compacting #1 in Figure 7
shows an example where three gates merge into an
antidiagonal 𝑋 gate, which can merge through the𝐶𝑍
gate on next compaction.
2. Antidiagonal and 𝐶𝑍 changing order will add 𝑍 gates
to the circuit. Compacting #2 in Figure 7 shows an
example.
Fortunately, these problems will not occur at the third
time of compaction. This is because diagonal gates and an-
tidiagonal gates forms a subgroup of𝑈2:
Lemma 5. Let
𝐺𝑍 = {𝑅𝑍 (\ ) |\ ∈ [0, 2𝜋)} , (34)
𝐺+𝑍 =
{
𝑅+𝑍 (\ ) |\ ∈ [0, 2𝜋)
}
, (35)
𝐺 = 𝐺𝑍 ∪𝐺+𝑍 , (36)
thus𝐺𝑍 ,𝐺 are subgroups of𝑈2, while∀𝑔1, 𝑔2 ∈ 𝐺+𝑍 , 𝑔1𝑔2 ∈ 𝐺𝑍 .
Corollary 6. ∀𝑔1 ∈ 𝑈2\𝐺,𝑔2 ∈ 𝐺,𝑔2𝑔1 ∈ 𝑈2\𝐺 .
On #2 compaction, single qubit gates can only merge
when they are on different sides of a 𝐶𝑍 gate and one is
diagonal or antidiagonal (otherwise they should have been
merged on #1 compaction). According to corollary 6, this
merging will not add new diagonals or antidiagonals, and
all new gates from compaction #2 come from moving an-
tidiagonal through 𝐶𝑍 . The last compaction merges these
additional 𝑍 gates to their left.
G Proof of Theorem 5 (Remove multiple
edges)
In the QDG defined in Section 4, Theorem 5 is proposed so
that multiple edges can be removed before 𝐼 𝐼 is assigned.
The proof of Theorem 5 is listed below:
Proof. Since 𝑑𝑖 𝑓1 and 𝑑𝑖 𝑓2 are integers,
1 + 𝑑𝑖 𝑓2 ⩽ 𝑑𝑖 𝑓1, (37)
Since 𝐼 𝐼 ⩾ 1,
− 𝐼 𝐼 · 𝑑𝑖 𝑓1 ⩽ −𝐼 𝐼 − 𝐼 𝐼 · 𝑑𝑖 𝑓2 ⩽ −1 − 𝐼 𝐼 · 𝑑𝑖 𝑓2. (38)
Since𝑚𝑖𝑛1 ⩽ 1 and𝑚𝑖𝑛2 ⩽ 1,
𝑚𝑖𝑛1 ⩽ 𝑚𝑖𝑛2 + 1. (39)
Adding up Equation 38 and 39 shows the result. □
H Resource scheduling complexity
analysis
In Secion IV we mentioned that we can keep retrying if there
is a “resource conflict” and the death countdown is not timed-
out (i.e. resource conflict are all caused by false conflicts),
which may lead to too many retries that may dominate the
complexity of the algorithm. This requires us to give an
upper bound of maximum number of retries to estimate the
total complexity.
Recall how we perform resource checking when inserting
instructions into the schedule:
• For every time slot, we have scheduled a bunch of
instructions in this time slot.
• When adding an instruction or a group of instructions,
we check the operands of each instruction to be added
against instructions in the time slot where it will be
added.
• If there is a resource conflict, we have to try next tick
(and perhaps start a death countdown).
We first show that if there is only false conflict, the loop
can be written into an equivalent form where all 𝑘 = 1. In
fact, this is achieved by the fact:
𝑘𝑖 + 𝑏 = 𝑘 (𝑖 + (𝑏/𝑘)) + (𝑏 mod 𝑘), (40)
where
(𝑏 mod 𝑘) ∈ [0, ∥𝑘 ∥) , 𝑘 (𝑏/𝑘) + (𝑏 mod 𝑘) = 𝑏. (41)
According to this fact, the array can be split into ∥𝑘 ∥ slices,
and resource conflict can occur if the two qubit references
fall into the same slice. Figure 17 is an example for 𝑘 = 3.
Offsets 3𝑖 and (3𝑖 − 1) will never conflict with each other,
since they fall into different slices 𝑞0 and 𝑞2.
This splitting allows us to use one integer 𝑏 ′ = (𝑏/𝑘) to
represent an expression in the slice: in the Figure 17 case we
can use 0 for 𝑞 [3𝑖] in slice 𝑞0, 0 for 𝑞 [3𝑖 + 1] in slice 𝑞1, and
(−1) for 𝑞 [3𝑖 − 1] in slice 𝑞2.
Corollary 7. For the modulo scheduling, if a resource is sched-
uled 𝐼 𝐼 ticks later, the integer 𝑏 ′ representing the resource de-
creases by 1.
This allows to use a stricter model for upper-bound esti-
mation:
• For the entire schedule, we use a universal set to store
all integer representations {𝑏 ′} of linear expressions.
• When adding an instruction or a group of instructions,
we check the operands to be added against the univer-
sal set, rather than the time-slot set. This means two
instructions with the same operand but scheduled at
different ticks will also be seen as conflicted.
• If the integer representation of operand is already in
the set, there is a resource conflict. To find the worst
case, we suppose the next (𝐼 𝐼 − 1) tries will definitely
fail. The next retry that will possibly success is the
𝐼 𝐼 -th retry where the instruction is going to be placed
in the same time slot again.
• The array index𝑞 and slice index𝑏 mod 𝑘 are ignored.
For example, operands 𝑞 [3𝑖] and 𝑞 [3𝑖 + 1] will be seen
as conflicted since they have the same representation
0, even though the two expressions will never be equal
to each other.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 17. Example for splitting the qubit array when 𝑘 = 3. Resource conflict can only occur inside each slice, and resources
in each slice can be represented by one integer.
This strict set of rules reduces our upper bound problem
to a clearer problem:
Theorem 8. For finite set 𝐴 ⊂ 𝑍 standing for resources (inte-
gers representing each resource) already scheduled, and 𝐵 ⊂ 𝑍
being resources to be scheduled. Define
𝐵 − (𝑘 ∈ 𝑁 ) = {𝑥 − 𝑘 |𝑥 ∈ 𝐵} (42)
to be the resource set of 𝐵 after 𝑘𝐼𝐼 retries. Let 𝑘𝑚𝑖𝑛 be the
minimal 𝑘 , s.t.
𝐴 ∩ (𝐵 − 𝑘) = Φ, (43)
then 𝑘𝑚𝑖𝑛𝐼 𝐼 retries is required at most in our algorithm.
Figure 18. Resource 3(𝑞 [𝑥 + 3]) and 5(𝑞 [𝑥 + 5]) are now
occupied, and resource 4 to 6 required to scheduled. Now
𝑘𝑚𝑖𝑛 = 4.
A naive estimation of 𝑘𝑚𝑖𝑛 would be
𝑘𝑚𝑖𝑛 ⩽ 𝑚𝑎𝑥 (𝐵) −𝑚𝑖𝑛(𝐴), (44)
which is not acceptable. Fortunately, we can give out a more
precise estimation not in the values in 𝐴 or 𝐵, but only in
the size of sets.
Theorem 9. Let ∥𝑆 ∥ be size of set 𝑆 ,
𝑘𝑚𝑖𝑛 ⩽ ∥𝐴∥∥𝐵∥. (45)
Proof. Consider the set
𝐷 = {𝑏 − 𝑎 |𝑎 ∈ 𝐴,𝑏 ∈ 𝐵, (𝑏 − 𝑎) ⩾ 0} . (46)
thus 𝑘 ∉ 𝐷 if and only if 𝐴 ∩ (𝐵 − 𝑘) = Φ. Thus 𝑘𝑚𝑖𝑛 is the
first natural number not appearing in 𝐷 . However, ∥𝐷 ∥ ⩽
∥𝐴∥∥𝐵∥ according to its definition, so 𝑘 ⩽ ∥𝐴∥∥𝐵∥. □
Corollary 10. Inserting𝑚 instructions at one time (e.g. merg-
ing to scheduled blocks) into a schedule with 𝑛 instructions
requires at most 𝑂 (𝑚𝑛𝐼𝐼 ) retries. If each retry takes 𝑂 (𝑚𝑛)
queries to find a conflict, the total complexity is atmost𝑂 (𝑚2𝑛2𝐼 𝐼 ).
According to the theorem, we can get some several impor-
tant results on the complexity:
Corollary 11. 1. Inserting one instruction into the mod-
ulo scheduling table sized 𝑏 requires 𝑂 (𝑏𝐼𝐼 ) retries and
𝑂 (𝑏2𝐼 𝐼 ) time. Thus inserting all 𝑏 instructions require
𝑂 (𝑏3𝐼 𝐼 ) time.
2. The span of themodulo scheduling table above is bounded
by 𝑂 (𝑏2𝐼 𝐼 ).
3. Suppose the loop kernel sized 𝑛 is split into 𝑎 ⩾ 2 strong
connected components sized 𝑏, the total complexity for
scheduling all SCCs is 𝑎𝑂 (𝑏3𝐼 𝐼 ) = 𝑂 (𝑎𝑏3𝐼 𝐼 ) = 𝑂 (𝑛4),
and the total time required to merge all SCCs together is
𝑎−1∑︁
𝑖=1
𝑂 (𝑏2 (𝑖𝑏)2𝐼 𝐼 ) = 𝑂 (𝑎3𝑏4𝐼 𝐼 ) = 𝑂 (𝑛5). (47)
4. The span of the total schedule is
𝑎𝑂 (𝑏2𝐼 𝐼 ) +
𝑎−1∑︁
𝑖=1
𝑏 (𝑖𝑏)𝐼 𝐼 = 𝑂 (𝑎𝑏2𝐼 𝐼 +𝑎2𝑏2𝐼 𝐼 ) = 𝑂 (𝑛2𝐼 𝐼 ). (48)
Thus we expect the length of prologue and epilogue to be
𝑂 (𝑛2)∑︁
𝑖=1
𝑖 · 𝐼 𝐼 = 𝑂 (𝑛3). (49)
A Preprint, December 23, 2020 Guo, et al.
I CNOT conjugation rules
These results are taken directly from [24].
Theorem 12. (𝐶𝑁𝑂𝑇 conjugation) 𝐶𝑁𝑂𝑇 conjugates single
qubit gates if and only if the conjugation satisfies one of the
following eight cases:
1.




























|a〉 ⊕ H(α) •






|a〉 ⊕ H−(α) •
|b〉 • H(β)† ⊕
=
|a〉 H−(α)
|b〉 H(β + π)†
(55)
7.
|a〉 ⊕ H(α) •






|a〉 ⊕ H−(α) •
|b〉 • H−(β)† ⊕
=
|a〉 H−(α+ π)
|b〉 H−(β + π)†
(57)
It is easy to check that 𝐶𝑁𝑂𝑇 conjugation rules and 𝐶𝑍
conjugation rules are equivalent to each other, by converting
𝐶𝑁𝑂𝑇 to 𝐶𝑍 and vice versa.
J Parallel QAOA Decomposition
QAOA is one of the fashionable algorithms in NISQ era. We
will use the QAOA program for solving MaxCut problems
as our optimization test cases.
However, we face the problem of lacking commutativ-
ity when optimizing 𝑄𝐴𝑂𝐴 programs: our device can’t ex-
ecute 𝑈 (𝐵, 𝛽𝑖 ) operation directly and it has to be decom-
posed into basic gates according to Equation 8, and the block-
commutativity optimization chances by commutativity be-
tween𝑈 (𝐵, 𝛽𝑖 ) matrices are missed.
There have been different ways to optimize QAOA cir-
cuits with 𝑈 (𝐵, 𝛽𝑖 ) commutable with each other in mind.
For example, [18] detects all two-qubit diagonal structures
in the circuit and aggregate them, so that commutativity
detection can be performed on aggregated blocks. Another
layout synthesis algorithm (scheduling considering device
layout) QAOA-OLSQ[21] schedules QAOA circuits twice,
the first time on a large granularity (named TB-OLSQ) and
the second time on a small granularity (named OLSQ). The
large-granularity pass allows block commutativity to be con-
sidered and gates are placed in blocks. The small-granularity
pass finishes the scheduling.
However, these two approaches both require the optimiza-
tion algorithm to perform coarse-grain block-level schedul-
ing in addition to fine-grain gate-level scheduling. We may
want to find another way to give commutativity hints to a
gate-scheduling algorithm without modifying the algorithm
itself.
Equation 8 inspires us with the fact that the shape of
decomposed form of 𝑈 (𝐵, 𝛽𝑖 ) is a bit like 𝐶𝑁𝑂𝑇 gate: it
has a “controller” qubit and a “controlled” qubit; multiple
blocks with the same “controller” qubit can be commuted
and interleaved freely at gate level, and can be finished in 2
ticks on average instead of 3, as in Figure 19.
|a〉 ⊕ RZ(−ωabγi) ⊕
|b〉 • • • •
|c〉 ⊕ RZ(−ωbcγi) ⊕
Figure 19. The two blocks can be executed interleavingly.
The level of “blocks” according to the discovery above
can be derived by directing and coloring all edges in the
undirected graph 𝐺 = ⟨𝑉 , 𝐸⟩:
• First, we assign every edge with the direction in which
we would perform the 8 decomposition (i.e. assign
the graph with an orientation). Suppose the direction
points from the controller qubit to the controlled qubit.
• Then, we colour all edges with minimal number of
colours under the following constraints:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
(a) Graph for QAOA. (b) One orientation for thegraph.
(c) One coloring satisfying
the constraints.
(d) The equialent vertex-
coloring problem.
Figure 20. Example for one possible orientation and layering
of a graph.
1. All in-degree edges of a vertex should be coloured
differently from each other.
2. Out-degree edges of a vertex should be coloured
differently from all in-degree edges of the vertex.
The minimal number of required colors over all possible
orientations is the minimal number of layers we can put
these gates into.
Note that finding the minimal edge colouring under the
constraints can be reduced to the problem of finding minimal
vertex colouring of a new graph. In the new graph, vertices
represent original edges; vertices for out-degree edges are
fully connected; vertices for in-degree edges are connected
with those for out-degree edges. Figure 20 is an example
of assigning directions and colours for edges in the graph,
and the equivalent vertex-colouring problem to the edge-
colouring one.
One direct way to compute the block placement strategy
is to use an SMT solver, for example, 𝑄𝐴𝑂𝐴 − 𝑃𝑎𝑟 test cases
in our evaluation are generated using Z3 Solver[7]. We leave
it as an open problem whether there is an efficient approach.
K Complexity Analysis
In this section we give a rough estimation of complexity of
the scheduling algorithm above. We put the main complexity
results in table 3, with some notes below to explain.
K.1 Complexity of loop compaction
Complexity for compacting a piece of loop program sized
𝑂 (𝑛) once is𝑂 (𝑛2), since when adding every instruction we
check it against all instructions that are previously added.
K.2 Complexity of loop unrolling
Finding merging or cancelling candidates requires 𝑂 (𝑛2)
time. Suppose the loop range is unknown, we have to per-
form the following steps on 𝐶 loops sized𝑚 = 𝑂 (𝐶𝑛).
Step Time Code Size
Compaction 𝑂 (𝑛2) 𝑂 (𝑛)
Unrolling 𝑂 (𝑛2 +𝐶2𝑛) 𝐶 loops sized 𝑂 (𝐶𝑛)
For each loop sized 𝑂 (𝑚)
Rotation 𝑂 (𝑚3) 𝑂 (𝑚3)
Try 𝐼 𝐼 𝑂 (𝑙𝑜𝑔𝑚) -
Tarjan 𝑂 (𝑚2) -
Floyd 𝑂 (𝑚3) -
Scheduling 𝑂 (𝑚5) Span=𝑂 (𝑚3)
Add 𝑍 𝑂 (𝑚4) -
Codegen 𝑂 (𝑚6) 𝑂 (𝑚3)
Total 𝑂 (𝑚6)𝑙𝑜𝑔𝑚 𝑂 (𝑚3)
In Total
Overall 𝑂 (𝐶6𝑛6 (𝑙𝑜𝑔𝐶𝑛)) 𝑂 (𝐶4𝑛3)
Table 3. Complexity of our software pipelining approach.
K.3 Complexity of loop rotation
A loop sized 𝑂 (𝑛) can be rotated for at most 𝑂 (𝑛2) times,
since loop rotation will not introduce new “qubit” into the
loop, and the 𝑂 (𝑛) qubits can be placed in an partial order:
𝑞𝑎 ≺ 𝑞𝑏 if a single qubit gate on𝑞𝑎 will be on𝑞𝑏 after rotation.
This will create a prologue sized 𝑂 (𝑛2), an epilogue sized
𝑂 (𝑛3) and a new loop sized 𝑂 (𝑛). Each rotation requires
𝑂 (𝑛2) time (to find a rotatable gate) so the total complexity
is 𝑂 (𝑛4).
K.4 Complexity of modulo scheduling
We need 𝑂 (𝑙𝑜𝑔𝑚) retries to binary-search the minimal 𝐼 𝐼 .
Complexity of Tarjan algorithm on a dense graph is 𝑂 (𝑚2),
and complexity of Floyd algorithm is 𝑂 (𝑚3).
We leave the proof of complexity from retrying due to
resource conflict in Appendix H.
K.5 Inversion pair detection
The complexity for detecting in-loop inversion pair if𝑂 (𝑚2).
The complexity for detecting across-loop inversion depends
on the span of the total schedule. Note that according to
Definition 8:




where 𝑝1, 𝑝2 = 𝑂 (𝑚2). Thus
𝑟 = 𝑂 (𝑚2). (59)
The total complexity of checking 𝑂 (𝑚2) pairs of instruc-
tions across 𝑟 iterations is 𝑂 (𝑚4).
K.6 Code generation
The complexity for code generation is just the length of pro-
logue and epilogue, 𝑂 (𝑚3). The compaction is of quadratic
complexity so the total complexity is 𝑂 (𝑚6). However, for
cases where the loop range is known, using a hash set to store
A Preprint, December 23, 2020 Guo, et al.
the last operation on each qubit can reduce the complexity
to 𝑂 (𝑚3).
Theorem 13. The total time complexity for our algorithm is
𝑂 (𝐶6𝑛6 (𝑙𝑜𝑔𝐶𝑛)), (60)
and the size of the generated code is
𝑂 (𝐶4𝑛3). (61)
L Adapting to existing architectures
Note that we are building our approach of optimization based
on a specific quantum circuit model as specified in Section
2.2. Recall some of the features of the model that we use:
• Classical computation and loop guards can be carried
out instantly.
• The hardware can execute arbitrary single qubit oper-
ations and 𝐶𝑍 gates between arbitrary qubit pairs. All
instructions can finish in one cycle.
• Instructions on totally different qubits can be carried
out at the same time.
L.1 Powerful classical control
A quantum processor is usually split into classical part and
quantum part, and all the classical logics (i.e. branch state-
ments) are run on the classical part.
To implement fast classical guard for 𝑓 𝑜𝑟 -loops, we can
use several classical architecture mechanisms, such as su-
perscalar, classical branch prediction and speculative exe-
cution. As long as classical part commits instructions faster
than quantum part executing instructions, we may keep the
quantum part fully-loaded without introducing unnecessary
bubbles.
If we want classical operations that affect the control flow
of quantum part (e.g. classical branch statements), one way
would be converting them to their quantum version. One
practical example would be measurements with feedback:
if we want to use the measurement outcome to control the
following operations, we can just use a qubit array to replace
classical memory, use 𝐶𝑁𝑂𝑇 gate to replace measurement,
and use controlled gate to replace classical control. The clas-
sical trick of register renaming can be adopted when convert-
ing measurement to quantum gates: different iterations can
“measure to” different qubits to prevent unnecessary name
dependency.
Also on real quantum processors the full-parallelism is
not likely to be achieved, for example, there may be a limit
of instruction issuing width on the device. For this case, we
can just limit the maximal issuing width in resource conflict
checking.
L.2 CNOT-based instruction set
One major difference between our assumptions and the real-
world architectures is that most existing models and archi-
tectures adopt a 𝐶𝑁𝑂𝑇 -based instruction set, instead of a
𝐶𝑍 -based one. We provide two possible approaches for ex-
tending our method to the 𝐶𝑁𝑂𝑇 -architecture case.
One approach is to convert the original circuit to 𝐶𝑍 -
version directly, using the equation 𝑋 [𝑏]𝐶𝑍 [𝑎, 𝑏]𝑋 [𝑏] =
𝐶𝑁𝑂𝑇 [𝑏]. After optimization, an additional step is required
to convert each𝐶𝑍 gate into𝐶𝑁𝑂𝑇 gates by addingHadamard
gates. Note that the way of adding Hadamard gates can affect
the depth of the kernel.
Example 13. Adding Hadamard gates on the same qubit of
two adjacent 𝐶𝑍 gates saves gate depth by 1, compared to the










|a〉 H ⊕ H •
|b〉 •
|c〉 H ⊕ H
(62)
However, deciding all directions of 𝐶𝑁𝑂𝑇 gates can be
a hard problem. We can formulate the problem as an ILP
problem. A rough description is as follows:
• Each 𝐶𝑍 is given a boolean variable, indicating the di-
rection of 𝐶𝑁𝑂𝑇 (and where to add Hadamard gates).
• If one 𝐶𝑍 is adjacent to a single qubit gate, the 𝐻 can
be absorbed.
• If one 𝐶𝑍 is adjacent to another 𝐶𝑍 and if they add
Hadamard on the same qubit, the two Hadamard can
be cancelled and no depth is added.
• Otherwise the depth is added by 1 from Hadamard.
If there is an aliasing, the depth need to be added by
more than 1 so that 𝐻 gates on qubits with aliasing
will be placed at two different ticks.
• The objective is to minimize the depth on all qubits.
We leave the best conversion from𝐶𝑍 program into𝐶𝑁𝑂𝑇
program with minimal depth as a remaining problem.
Another way to port our approach is to modify our QDG
definition to the 𝐶𝑁𝑂𝑇 -based instruction set. But in fact,
the most commonly used 𝐶𝑁𝑂𝑇 commutation rules that
are based on intuition are only part of the complete 𝐶𝑁𝑂𝑇
conjugation rules:
Lemma 6. (𝐶𝑁𝑂𝑇 conjugation rules)[24] There are 8 rules
in total for 𝐶𝑁𝑂𝑇 conjugation rules, similar to 𝐶𝑍 rules. See
Appendix I.
If we want to exploit full power of these rules, we have
to consider all these rules while building QDG, instead of
considering only the intuitive rules (usually the first 4 rules).
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
(a) Before. (b) After. The numbers correspond to the inter-
cept 𝑏 in expression 𝑞 [6𝑖 + 𝑏].
Figure 21. Loop kernel for cluster state preparation (𝑁 = 3).
Shaded dots are qubits for Hadamard operands and closed
dots are 𝐶𝑍 operands.
But this time, the rewriting trick in Theorem 2 no longer
works for 𝐶𝑁𝑂𝑇 rules. How to use these rules directly for
QDG construction remains an open problem.
L.3 Working with device topology
One problem about a controlled-Z architecture is that it
can be hard to perform long-distance 𝐶𝑍 operation. For the
𝐶𝑁𝑂𝑇 case, a long distance 𝐶𝑁𝑂𝑇 gate with length 𝑘 can
be implemented using (4𝑘 − 4) according to [17]. However,
this is not true for 𝐶𝑍 gates, as “amplitude” can’t propagate
through 𝐶𝑍 gates.
A direct conversion approach can be taken by converting
𝐶𝑍 to𝐶𝑁𝑂𝑇 and back forth. Since every𝐶𝑁𝑂𝑇 is on critical
path and no adjacent controlled bits can be found on critical
path, this would require (8𝑘 − 8 + 1) = (8𝑘 − 7) gates on
critical path. The exception is 𝑘 = 2, since the last𝐻𝑎𝑑𝑎𝑚𝑎𝑟𝑑
on the critical path should be removed and total depth is 8.
M Optimization of Cluster State
Preparation, etc.
This chapter introduces the Cluster and Array test cases
used in our evaluation.
Cluster is an example of cluster-state preparation pro-
gram, which is a for-all loop: increasing count of iterations
does not add to the overall depth of the program, which on
the 2-dimensional grid is a constant 5 (4 for 𝐶𝑍s in four
directions and 1 for Hadamard). Despite that, we can still
perform loop optimization on this program to get a loop with
kernel sized 1.
For 𝐶 = 2, the loop kernels before and after rotation fol-
lowed by software-pipelining is given in Figure 21. Our ap-
proach split 𝐶𝑍 gates that conflicts with each other into
different iterations so that they can be executed together,
and the kernel size is reduced to 1, the best result for any
loop-optimization approach except fully-unrolling.
Array series are several artificially-crafted loop programs
on qubit arrays.Array 1 performs three𝐶𝑍 gates as in Figure
12, while two Hadamard gates are added between 𝐶𝑍s to
prevent cancellation. Array 2 performs non-cancelling 𝐶𝑍
gates so that they can be parallelized maximally. Array 3
constructs a huge Toffoli gate using Toffoli gates and ancillas:
in each iteration, a Toffoli is performed on a source qubit, an
ancilla and the next ancilla.
The instruction operands of these examples contain the
iteration variable and are thus simpler to optimize compared
with those on fixed set of qubits.
