Quantum Fan-out: Circuit Optimizations and Technology Modeling by Gokhale, Pranav et al.
Quantum Fan-out: Circuit Optimizations and Technology Modeling
Pranav Gokhale∗
University of Chicago
Samantha Koretsky
University of Chicago
Shilin Huang
Duke University
Swarnadeep Majumder
Duke University
Andrew Drucker
University of Chicago
Kenneth R. Brown
Duke University
Frederic T. Chong
University of Chicago
ABSTRACT
Instruction scheduling is a key compiler optimization in quan-
tum computing, just as it is for classical computing. Current
schedulers optimize for data parallelism by allowing simul-
taneous execution of instructions, as long as their qubits do
not overlap. However, on many quantum hardware platforms,
instructions on overlapping qubits can be executed simul-
taneously through global interactions. For example, while
fan-out in traditional quantum circuits can only be imple-
mented sequentially when viewed at the logical level, global
interactions at the physical level allow fan-out to be achieved
in one step. We leverage this simultaneous fan-out primitive
to optimize circuit synthesis for NISQ (Noisy Intermediate-
Scale Quantum) workloads. In addition, we introduce novel
quantum memory architectures based on fan-out.
Our work also addresses hardware implementation of the
fan-out primitive. We perform realistic simulations for trapped
ion quantum computers. We also demonstrate experimental
proof-of-concept of fan-out with superconducting qubits. We
perform depth (runtime) and fidelity estimation for NISQ
application circuits and quantum memory architectures under
realistic noise models. Our simulations indicate promising
results with an asymptotic advantage in runtime, as well as
7–24% reduction in error.
1. INTRODUCTION
Instruction scheduling is a powerful compiler technique in
both classical and quantum computing. In the classical realm,
scheduling techniques such as pipelining, Single Instruction
Multiple Data (SIMD), and Out-of-order execution have led
to continued gains in processing power. These scheduling
techniques are designed to preserve a program’s logical cor-
rectness by respecting constraints known as hazards.
Just as in the classical setting, quantum computing is also
amenable to instruction scheduling. In fact, due to the short
lifetimes of qubits in the NISQ (Noisy Intermediate-Scale
Quantum) era [73], scheduling to reduce latency is critical
for successful program execution. The potential of quantum
instruction scheduling was recently exemplified by Google’s
Quantum Supremacy result [7], which experimentally demon-
strated a task soluble in seconds on a 53 qubit computer that
∗pranav@super.tech
is argued to likely require days [72] on a supercomputer. A
core aspect of the Supremacy result is a coupler activation
schedule that maximizes simultaneous quantum resource uti-
lization.
A number of papers [37, 38, 47, 63] in the architecture
community have studied quantum scheduling, inspired by
techniques from the classical setting. One principle under-
lying these papers is exclusive activation: a qubit can be
involved in at most one operation per timestep [38]. In ar-
chitectural terms, this is a structural hazard [44]. Under
exclusive activation, schedulers optimize for data parallelism
by simultaneously executing instructions on disjoint qubits.
However, there are natural limits to such schedulers, since
instructions on overlapping qubits must be serialized.
Our work begins with a simple but consequential obser-
vation: the structural hazard of exclusive activation is not
actually enforced by most quantum hardware. In fact, it can
be more natural for a quantum processor to simultaneously
execute multiple operations on shared qubits through global
interactions. The building block of our work is the fan-out
operation depicted in Figure 1. This operation can be under-
stood purely classically. The depiction on the left in Figure 1
has four CNOT (Controlled-NOT) gates, each comprising a
control (•) and a target (⊕). The target is flipped iff the con-
trol qubit is 1. This operation performs fan-out for classical
input states: when the targets are initialized to 0, the state of
the control gets copied to the targets.
While exclusive activation would serialize the four CNOT
instructions as depicted on the left, underlying quantum hard-
ware can naturally perform these interactions simultaneously,
• • • • •
=⇒
Figure 1: Device level fan-out allows a NOT to the bottom
four targets iff the top control is on. While exclusive ac-
tivation induces serialization (left), quantum hardware can
implement fan-out simultaneously (right) in a single step.
1
ar
X
iv
:2
00
7.
04
24
6v
1 
 [q
ua
nt-
ph
]  
8 J
ul 
20
20
as depicted on the right. This form of Single Instruction
Multiple Data (SIMD) parallelism arises only after discard-
ing structural hazards that don’t manifest in hardware. As
we demonstrate later, the fan-out building block generalizes
to efficiently-scheduled circuit synthesis for the ubiquitous
Controlled-U operation. Henceforth in this paper, fan-out
will refer to simultaneous operation on the right of Figure 1.
We begin in Section 2 with sufficient background on quan-
tum computation so that this paper is self-contained. Sec-
tion 3 surveys relevant prior work. The three subsequent
sections capture our core contributions:
• Section 4: We generalize the simultaneous fan-out prim-
itive into a circuit synthesis procedure to schedule
Controlled-U operations with an asymptotic depth ad-
vantage.
• Section 5: We leverage this circuit synthesis procedure
to optimize NISQ circuits, and we introduce novel
quantum memory architectures.
• Section 6: We perform technology modeling of simul-
taneous fan-out on trapped ion qubits.
Section 7 presents results for several benchmarks. Sec-
tion 8 proposes an implementation of fan-out on supercon-
ducting qubits and demonstrates experimental proof-of-concept.
Sections 9 concludes.
2. BACKGROUND
In order to keep this paper self-contained, we begin with
necessary background on quantum computing. To maintain
accessibility, we emphasize the circuit model of quantum
computing, rather than its linear algebraic formulation.
2.1 Qubits
A qubit (quantum bit) is defined by two states, denoted |0〉
and |1〉. Just as classical bits can be implemented by a variety
of underlying physical representations like magnetization in
disks or capacitor charge in RAM, qubits can be fabricated
from a variety of underlying quantum technologies. This
includes discrete charge levels in superconducting qubits or
motional modes in trapped ion qubits.
The state of a qubit can be written as the linear com-
bination a |0〉+ b |1〉, subject to a normalization condition,
|a|2+ |b|2 = 1. This is richer state space that can be captured
by a classical bit, which is either |0〉 or |1〉. For example,
1√
2
[|0〉+ |1〉] is a superposition qubit state: the qubit has
equal components in |0〉 and |1〉.
The state of a qubit can be changed by a gate. Figure 2
depicts the gates we use in this paper. Each gate has an in-
put wire(s) and an output wire(s). The first gate is just the
classical NOT gate, which interchanges |0〉 and |1〉. The next
gate is the Hadamard gate, which is an intrinsically quantum
gate that creates superposition. For example, applying H to a
|0〉 creates the equal superposition 1√
2
[|0〉+ |1〉]. The Rz(θ)
gate is another quantum gate which applies a phase. Phase
is helpfully visualized as a θ displacement in longitude on a
sphere; however, for our purposes its underlying meaning is
unimportant. Gates (d) and (e) act on pairs of qubits (wires).
The CNOT is a Controlled-NOT, which applies a NOT to the
bottom qubit, iff the top qubit is |1〉. The top qubit itself is un-
affected. The Controlled-U is simply a generalization, where
U represents some single qubit gate that is only activated if
the top qubit is |1〉.
(a) NOT
H
(b) Hadamard
Rz(θ)
(c) Rz rotation
•
(d) CNOT
•
U
(e) Controlled-U
Figure 2: Gates used in this paper.
While a qubit can carry a rich state space, it snaps to either
|0〉 or |1〉 upon measurement. This process is fundamentally
stochastic: the probability of measuring |0〉 is |a|2 and |1〉 is
|b|2, which justifies the normalization condition. For example,
under the 1√
2
[|0〉+ |1〉] state, the probability of measuring |0〉
is ( 1√
2
)2 = 12 which is indeed an equal superposition. The
measurement operation is visually denoted as , which
terminates a wire.
2.2 Quantum circuits
Quantum programs are expressed as quantum circuits which,
like Boolean circuits, carry wires representing qubits through
a sequence of gates. An example quantum circuit is shown
below in Figure 3. It can be read as a timeline from left
to right. As indicated, the width is the number of qubits
the circuit acts on. In addition to data qubits that encode
input/output, quantum circuits often use extra ancilla qubits
that store temporary results. The depth is the length of the
critical path. Thus, width and depth respectively capture the
space and time costs of a quantum circuit.
Rz(θ1) •
width U•
Rz(θ2) • Rz(θ3)
 ︸ ︷︷ ︸
depth
Figure 3: An example quantum circuit with a width of 4
qubits and a depth of 3 layers.
Figure 3 exemplifies data parallelism—no qubit is ever
idle. Such speedups are especially important in quantum
computing because qubits generally have short coherence
windows for useful computation.
2.3 Commutativity
Every quantum circuit has an underlying program de-
pendency graph that enforces the execution order of gates.
Naively, one can construct a program dependency graph that
simply adds forward dependencies from each quantum gate
to subsequent quantum gates in the circuit-timeline. However,
this dependency graph can often be relaxed due to commu-
tativity, where two quantum gates can be applied in either
order.
2
Many commutativity relationships exist between gates. For
our work, we only rely on the two relationships depicted in
Figure 4. The left equivalence shows that two Controlled-Ui
gates commute when they have different targets. This is clear
because controlled gates leave the control qubit unchanged,
so their order is unimportant. The right equivalence shows
that Rz-type gates commute with controls of controlled gates
(such as CNOTs). This relationship has no classical analogue,
but the underlying intuition is that Rz gates don’t affect the |0〉
vs. |1〉 balance of a qubit, so their order relative to a control
is unimportant. Both of these commutativity rules are used in
our Controlled-U circuit synthesis procedure in Section 4.
• • • •
U1 = U1
U2 U2
(a) Different targets.
Rz(θ) • • Rz(θ)
=
(b) Rz gates commute with controls.
Figure 4: Two commutativity rules encountered in this paper.
3. PRIORWORK
Our work builds on top of prior work from the (1) computer
architecture, (2) computer science theory, and (3) physics
communities. At a high level, the priorities of the work in
each community can be characterized as follows:
1. architects have devised intelligent schedulers/circuit
synthesis tools, but they assume a false structural hazard
by overlooking global interactions
2. theorists have devised intelligent circuit constructions
assuming global gates, but they don’t consider NISQ
workloads or device-level operation
3. physicists have studied global interactions, but usually
in an ad hoc fashion separated from computation and
NISQ workloads
Our work unites insights from all three disciplines to devise
a circuit synthesis tool that leverages global interactions to
accelerate NISQ workloads.
3.1 Computer Architecture
Amongst architects, a number of papers [37, 38, 43, 47, 63,
88] have studied instruction scheduling in quantum comput-
ers. These papers all assume some structural hazard against
simultaneous execution of overlapping qubits. [37, 38] pro-
vides the most formal description of this hazard, terming
it as the principle of exclusive activation which forbids a
qubit from being involved in more than one operation per
timestep. Moreover, hardware-dependent considerations such
as crosstalk [56, 67] further narrow the scope of when opera-
tions can be parallelized. For example, on superconducting
hardware, CNOT(a,b); CNOT(c,d); may be forbidden si-
multaneously if they are neighboring pairs, even though the
CNOT gates are disjoint.
In other architectural work such as [63] and [43], the au-
thors provide examples for obtaining data parallelism on
disjoint instructions. However, in both papers, the exam-
ples ultimately incur serialization upon encountering gates
on overlapping qubits. As we will demonstrate in Section 4,
this serialization is unnecessary.
Finally, [47] describes exclusive activation as a data depen-
dency, since the no-cloning theorem [95] prevents copying
a qubit to participate in multiple instructions simultaneously.
This is indeed a valid perspective. Regardless, we will demon-
strate that the underlying problem is in fact addressable with
the fan-out primitive.
3.2 Computer Science Theory
Quantum fan-out has also been studied from a complexity
theory lens. [45] proved that the QNC0f class of circuits with
unbounded fan-out is powerful for fault-tolerant applications
such as Shor’s Algorithm [81] for factoring. Other applica-
tions of fan-out to arithmetic operations such as addition, OR,
and modulus are considered in [86], [85], and [33] respec-
tively. Finally, [87] shows that under widely-held complexity
theory assumptions, fan-out in quantum circuits can increase
the hardness of classical simulability. Our work revisits these
theory results with NISQ workloads and underlying device
physics in mind.
3.3 Physics
The engineering of global interactions on N qubits has
been well studied in device physics communities. A com-
mon benchmark for global interactions is the preparation of
the GHZ state [34], a task which is essentially equivalent
to fan-out. Experimentally, global interactions have been
used to prepare the GHZ state on a variety of leading qubit
technologies including Trapped Ion [57], Neutral Atom [69],
and NMR [21]. Implementation on NV center qubits has
been proposed as well [32]. Notably, superconducting qubits,
which are the current leader in hardware scale, were not pre-
viously known to support simultaneous overlapping interac-
tions. However, in Section 8, we experimentally demonstrate
simultaneous fan-out on superconducting qubits.
Global interactions have already been noted by physicists
for their application to Hamiltonian simulation, an important
quantum algorithm. As early as 2005 [100], it was noted
that global interactions enable constant depth parity measure-
ment, an important building block for Hamiltonian simulation.
Later work [55, 61] further optimized and clarified the pro-
cedure. Recently this year, three papers [35, 74, 98] have
applied global interactions to building blocks of longer-term
fault tolerant quantum computers. These papers demonstrate
that the Generalized Toffoli operation can be performed in
constant time with global interactions, whereas otherwise
linear or logarithmic depth is required [28, 29].
Very recent papers have adopted an interdisciplinary ap-
proach, combining insights from physics and architecture.
For example, [61]—which inspired our work—describes fan-
out as SIMD parallelism. Also, a recent trapped ion hardware
paper [36] describes global interactions as a form of Multiple
Instruction Multiple Data (MIMD) parallelism. Our work
continues this architectural perspective, while also focusing
on NISQ circuit optimizations and further refining the un-
derlying technology models based on recent experimental
developments.
3
4. CONTROLLED-U SYNTHESIS
The basic building block of our work is the fan-out oper-
ation depicted on the right side of Figure 1. Two important
considerations arise in evaluating the applicability of fan-
out. The first is whether the simultaneous implementation
via global interactions truly achieves a linear speedup over
serialized CNOTs1. As described in Sections 6 and 8, ex-
perimental results from hardware assert this is indeed the
case. The second consideration is how fidelity is affected by
simultaneous fan-out versus serialized CNOTs. Our results
in Sections 6 and 8 indicate a modest improvement in fidelity
from simultaneous fan-out.
We focus on a circuit synthesis procedure that uses fan-
out to optimize the Controlled-U operation, described below.
This operation is ubiquitous in NISQ algorithms, and each
application in Section 5 is an instance of Controlled-U . As we
will describe in this section, our circuit synthesis procedure
yields a Controlled-U implementation that is scheduled to
align CNOT gates into a single fan-out step. This yields
asymptotic improvements in circuit depth. Our code was
implemented as a fork of Qiskit Terra [2].
The controlled-U operation is depicted at the left of Fig-
ure 5. As in other controlled operations like CNOT, the U
operation should be applied if and only if the top control
qubit is |1〉. However, unlike the single-qubit U in Figure 2e,
here we consider the case when U is an operation on multiple
qubits. Therefore, U itself has a decomposition into gates,
shown under the blue overlay. Our results are applicable for
any decomposition basis, but we focus on the decomposition
into the universal set of single-qubit + CNOT gates, since
quantum algorithms are typically expressed in this form. In
the example, U has a width of four qubits and a depth of two
layers. The first layer contains four disjoint single qubit gates,
and the second layer contains two disjoint CNOTs.
Under exclusive activation, implementation of Controlled-
U is bottlenecked by the dependence of each controlled gate
on the single control qubit. Thus, the parallel two-layer im-
plementation of U collapses into a serial implementation of
Controlled-U as depicted at the right of Figure 5. The amount
of serialization is proportional to the width of U , so that the
effective depth of a Controlled-U operation under serializa-
tion is O(Depth × Width). In many workloads, the width
greatly exceeds depth, so this serialization is very harmful.
It is not immediately obvious how fan-out can help speed
up Controlled-U . Whereas fan-out is a SIMD operation,
Controlled-U is a MIMD operation, since the gates in U are
arbitrary. However, we can resolve this difficulty be decom-
posing gates into a form amenable to ‘alignment’ of CNOTs
into a single fan-out step. This circuit synthesis procedure
has two underlying cases. The first, Shared-Control Single
Qubit Gates, supports the simultaneous execution of mul-
tiple Controlled-Ui gates with a shared control qubit. This
procedure applies to the first layer of U in Figure 5. The
second, Shared-Control Toffoli’s, supports the simultaneous
execution of multiple Controlled-CNOTs with a shared con-
trol qubit. These double-controlled NOTs are referred to as
Toffoli’s. The Shared-Control Toffoli’s case applies to the
1Fan-out can also be implemented in logarithmic depth via a re-
cursive tree structure, but we focus on the serialized case which is
dominant in prior work.
Figure 5: Left: general form of controlled-U . Right: under
exclusive activation, adding the control induces serialization
and multiplies the effective Depth by the Width.
second layer of U in Figure 5.
In practice, arbitrary U’s will also contain mixed layers
that contain both single-qubit gates and CNOTs. This general
case can be handled by unifying the synthesis procedures for
Shared-Control Single Qubit Gates and Shared-Control Tof-
foli’s. It is not presented here for brevity, but is implemented
in our code.
Table 1 compares the time (depth) and space (ancilla qubits)
costs of implementing Controlled-U . Our work, which uses
fan-out, is optimal with O(D) depth (and very small con-
stants) and 0 ancilla qubits. The status quo approach of
serialization incurs O(ND) depth which is harmful because
N >> D in many applications. Past work in [45] and [59]
has proposed alternative approaches for parallelizing circuits
using global interactions. In the best case, where a “basis-
change” is cheap and efficiently computable, [45] matches
our O(D) depth. However, it is extremely expensive in space,
requiring O(N2) ancilla qubits. Finally, [59] provides a nu-
merical optimization technique for compiling Controlled-U
down to the minimal possible depth. In this sense, it could
achieve the O(D) lower bound. However, the numerical opti-
mization for compilation has exponential cost—simply defin-
ing the optimization problem involves specifying a 2N×2N
sized matrix. Moreover, the optimization itself is expensive,
and convergence to O(D) depth is not guaranteed.
Depth Ancilla Qubits
Our Work (with fan-out) O(D) 0
Serialization O(ND) 0
[45] (if cheap basis-change) O(D) O(N2)
[59] (Ω(2N) compile time) O(D)? 0
Table 1: Cost of implementing a controlled-U operation in
time (depth) and space (ancilla qubits). U has a depth of D
and width of N qubits.
While our procedure achieves the best possible asymptotic
spacetime costs, it is not as general as [45,59]. Our procedure
only addresses the special case of Controlled-U paralleliza-
tion, whereas [45] and [59] apply to the parallelization of any
commuting gates or the depth reduction of any unitary, respec-
tively. Nonetheless, our specialization is justified because the
Controlled-U template is ubiquitous in NISQ workloads.
4
4.1 Shared-Control Single Qubit Gates
Here, we consider how to simultaneously execute con-
trolled single-qubit gates with a shared control, as in the first
layer of U in Figure 5. This is a form of MIMD parallelism
with overlapping data, but we only have access to the fan-out
SIMD primitive. However, we can make progress by invok-
ing the following well-known identity [68] for decomposing
controlled single-qubit gates. It shows that for any single-
qubit gate U , the Controlled-U operation can be implemented
by using CNOT as the only two-qubit gate. Specifically there
exist (trivially computable) single-qubit gates A, B, C, and an
angle θ , such that
•
=
• • Rz(θ)
U C B A
Let us consider applying this identity to a small example:
attempting to parallelize Controlled-U1 and Controlled-U2
targeting two different qubits. The result is shown below,
with colors used for disambiguation.
• • • Rz(θ1) • • Rz(θ2)
U1 = C1 B1 A1
U2 C2 B2 A2
It appears that applying the circuit identity led to minimal
improvements—only C2 can slide left to execute simulta-
neously with controlled-U1. The rest of the blue gates are
unable to parallelize, because they are blocked by an appar-
ent dependency on the Rz(θ1) gate. However, recalling the
commutativity rule in Figure 4b, we see that the apparent
dependence of the blue CNOTs on the Rz(θ1) is actually a
false dependence. By commuting the Rz(θ1) gate to the end
of the circuit, we attain the final result in Figure 6.
• • • Rz(θ1+θ2)
U1 = C1 B1 A1
U2 C2 B2 A2
Figure 6: Simultaneous execution of shared-control single
qubit gates, using the fan-out primitive. This decomposition
has constant (5 layer) depth, independent of width.
We have now demonstrated simultaneous execution of
shared-control U1 and U2 on overlapping data (top+middle
and top+bottom qubits respectively), using the fan-out primi-
tive. This pattern extends ad infinitum to more qubits—the
total depth will always consist of five layers: two fan-out lay-
ers and three single-qubit gate layers. For certain gates, the
cost could be reduced even further. For instance, for U = Z, it
is known that the Controlled-Z operation can be implemented
with just a single CNOT [68].
4.2 Shared-Control Toffoli’s
The second piece needed for optimized Controlled-U syn-
thesis is simultaneous execution of shared-control Toffoli’s.
Here, we seek to simultaneously execute multiple Toffoli
(Controlled-CNOT) gates, where the CNOTs are disjoint but
the additional control is shared across the CNOTs, as in the
second layer of U in Figure 5. Since Toffoli is a three-qubit
operation, it must first be decomposed into single-qubit gates
and CNOTs. The standard [68] decomposition is shown next.
T and T † are shorthand for Rz(pi8 ) and Rz(
−pi
8 ) respectively.
• • • • T •
• = • • T T †
H T † T T † T H
The boxed group with T and T † is one example of data
parallelism. This level of data parallelism is referred to as a
coarse-grained schedule in past architectural work [43]. Next,
let us consider applying the Toffoli decomposition to a small
example: attempting to simultaneously execute two shared-
control Toffoli’s, where the CNOTs are disjoint. This exact
example is also considered in Figure 4 of [43]. The result is
shown below, with colors used again for disambiguation.
• • • • T • • • • T •
• • • T T †
= H T † T T † T H
• • • T T †
H T † T T † T H
As indicated by the boxed layers, only three gates from
the blue Toffoli were able to parallelize with the execution
of the red Toffoli. This level of parallelization, which results
in 21 layers of depth, is referred to as fine-grained schedul-
ing in [43]. While it is slightly better than coarse-grained
scheduling, it still linearly serializes the depth. However, we
can again leverage commutativity relationships to proceed
further and exploit our fan-out primitive.
Notice that the dependency between the right-most red
CNOT and the subsequent blue CNOT is in fact a false de-
pendency. These two gates commute per the rule in Figure 4a
since their targets are different. After transposing the two
gates, we encounter a T gate that commutes with the control
of the blue CNOT, per the rule in Figure 4b. Repeating such
commutative transpositions, we can push the blue CNOT to
the left to align into a single fan-out. The rest of the blue
circuit can be handled similarly, resulting in the final form
presented in Figure 7. Since T = Rz(pi8 ), the T ×T gate at
the top right is just a single Rz(pi4 ) gate.
The design in Figure 7 extends naturally to more qubits.
Regardless of the number of qubits, the depth of the circuit is
always 12 layers. Since the depth of a single Toffoli operation
is also 12 layers, this means that our shared-control Toffoli’s
synthesis is optimal. For the circuits we will encounter in the
following sections, the number of Toffoli’s spans the entire
circuit. Therefore the depth cost of the other approaches is
O(N), versus our 12 = O(1) constant depth.
5
• • • • T ×T •
• • • T T †
= H T † T T † T H
• • • T T †
H T † T T † T H
Figure 7: Simultaneous execution of shared-control Toffoli’s
using the fan-out primitive. This decomposition has constant
(12 layer) depth, independent of width. Quirk Link.
The combination of simultaneous shared-control single
qubit gates and Toffoli’s enables a depth-optimized execution
schedule for any Controlled-U . Moreover, the multiplica-
tive constants for our circuit synthesis are small. Shared-
control single qubit gates incur a depth of just 5 layers, which
matches worst case depth. Shared-control Toffoli’s incur
no depth expansion relative to a single Toffoli and are thus
optimal. The resulting Controlled-U circuit synthesis proce-
dure is implemented in our code. In the following section,
we apply the Controlled-U procedure to optimize several
NISQ-important quantum circuits, which are all fundamen-
tally Controlled-U operations. While our approach is already
asymptotically optimal with low constants, in some cases we
can reduce the depth constants even further. This is exempli-
fied by the SWAP Test, which we discuss next.
5. APPLICATIONS
We now examine how Controlled-U circuit synthesis can
be leveraged to optimize NISQ circuits. We also apply fan-
out to develop novel quantum memory architectures. Table 2
summarizes the spacetime advantages of our work (using
simultaneous fan-out) for the applications surveyed in this
Section.
5.1 SWAP Test
One of the most important [79] procedures in quantum
computing, especially NISQ machine learning algorithms,
is the calculation of inner products between quantum states.
This inner product reports the overlap or similarity between
states. For two qubit registers |A〉 and |B〉, this overlap is
denoted by | 〈A|B〉 |2. For equal states | 〈A|B〉 |2 = 1, and for
orthogonal states | 〈A|B〉 |2 = 0.
The calculation of this overlap is a procedure known as
the SWAP Test. The SWAP Test features heavily in NISQ
applications such as quantum kernel classification, which
was introduced in [78] and realized experimentally on IBM’s
quantum hardware in [42]. These quantum kernel methods
are noise resilient and amenable to noise mitigation [42].
Further work [25] has introduced kernels that have strong
complexity theory foundations for hardness of classical sim-
ulability. All of these kernel methods require the evaluation
of inner product overlaps. The SWAP Test is also integral
to cost function evaluation in NISQ-friendly deep quantum
neural networks [9]. In the near-term (and in fact current-
term), experimental sequences in quantum sensing [99] are
Application Spacetime costs
SWAP Test between two
k = N−12 qubit registers
(0 ancilla for all)
Our work 14 = O(1) depth
Serialized ∼ 14k = O(N) depth
Coarse-grained sched. [45] ∼ 12k = O(N) depth
Fine-grained sched. [45] ∼ 9k = O(N) depth
Hadamard Test; N-qubit
circuit; U has depth D
Our work O(D) depth, 0 ancilla
Other approaches (Table 1) O(ND) depth, O(N2) ancilla,
or Ω(2N) compile time
Explicit Memory with n
index qubits and bitwidth W
Our work O(n) depth, 0 ancilla
Bucket-Brigade QRAM [6] O(W2n) depth, 0 ancilla
Parallel QRAM [19] O(Wn) depth, O(2n) ancilla
Implicit Memory with n
index qubits and bitwidth W
(∼ 1 ·n ancilla for both)
Our work O(2n) depth
QROM [8] O(W2n) depth
Table 2: Summary of space (ancilla qubits) and time (depth)
costs for different applications. Our work leverages simulta-
neous fan-out to attain asymptotic advantages.
essentially overlap measurements.
The SWAP Test has a very simple form. It is essentially
just the case of Controlled-U with U = SWAP. First, we
examine the decomposition of a SWAP between two qubits:
SWAP :=
×
=
•
× • •
This decomposition is equivalent to the triple XOR sequence
for in-place SWAPs of classical bits. For a SWAP Test, we
need to perform this U = SWAP sequence not just between
two individual qubits, but between two registers of qubits.
Moreover, the SWAP is controlled on an ancilla qubit. The
SWAP Test also requires a Hadamard-sandwich around the
controls, and a measurement of the ancilla. After executing
such a circuit, the overlap between the two registers is related
by a simple function to the probability of measuring |0〉 on
the ancilla. Repeated executions can therefore estimate the
overlap to a desired precision.
Let us concretely consider the example of a SWAP Test
on two two-qubit registers, |A = A1A0〉 and |B = B1B0〉. To
disambiguate the gates, we have used colors and interleaved
the bit ordering of the |A〉 and |B〉 registers below:
|0〉 H • • • H
|A0〉 •
|B0〉 • •
|A1〉 •
|B1〉 • •
6
Under standard serialization of the shared-control gates,
the depth is 63 at best from fine-grained scheduling. However,
our Controlled-U synthesis procedure, specifically the shared-
control Toffoli’s decomposition, is directly applicable here.
The resulting SWAP Test depth is 3×12 = 36 (ignoring the
two Hadamard gates). Moreover, our procedure always yields
a constant depth of 36 layers regardless of the circuit width
N, whereas serialized approaches scale as O(N).
While this asymptotic advantage is already appealing, we
can attain even further cost reductions to our constants via a
circuit identity. It can be shown that the outer two controls
on the ancilla qubit can be removed [24, 68]. After this
optimization, the final circuit has a depth of just 14 layers,
regardless of the size of the SWAP Test. To illustrate for
larger N, this Quirk Link shows an interactive SWAP Test
circuit for computing the overlap of two four-qubit registers,
with an ancilla qubit on the top.
5.1.1 Interference Circuit
Recent work has explored alternatives to the traditional
SWAP Test, with the aim of reducing spacetime costs. The
most promising one is the interference circuit [77, 79], which
halves the qubit width requirement. Whereas the traditional
SWAP Test requires 2k+1 qubits to compute the overlap of
two k-qubit registers, the interference circuit only requires
k+1 qubits. In order to use the interference circuit, we must
know the sequences of gates UA and UB that can create |A〉
and |B〉, respectively. In practice, this is indeed the case for
useful applications. The interference circuit has the following
simple form shown in Figure 8. As in the traditional SWAP
Test, the overlap is a simple function of the probability of
measuring |0〉 on the ancilla.
The open-control (open circle) on UB activates on |0〉 and
can therefore be replaced with an ordinary control surrounded
by NOT (⊕) gates. Therefore our Controlled-U is directly
applicable to the interference circuit, and it allows overlap
calculation with no asymptotic depth overhead relative to UA
and UB. This is again a linear O(N) speedup via fan-out.
H • H
UA UB...
...
Figure 8: The interference circuit computes the overlap be-
tween k-qubit states, |A〉 and |B〉, with just k+1 qubits.
5.2 Hadamard Test
The SWAP Test is a specific case of a more general pro-
cedure called the Hadamard Test. The Hadamard Test has
a very simple and familiar form shown in Figure 9. This
is essentially just the Controlled-U operation we focused
on in Section 4. Moreover, the SWAP Test is just the case
where U = SWAP. Selecting other U makes the Hadamard
Test give rise to a wide variety of applications. We list our
benchmarked applications in Table 3. There are numerous
additional applications of the Hadamard Test, such as train-
H • H
U...
...
Figure 9: Circuit for the Hadamard Test.
ing Quantum Boltzmann Machines [93], gradient evalua-
tion [20, 39, 64, 76], and Jones polynomial approximation [3].
Application Description
Variational Quantum Linear
Solver [12, 46, 97]
Algorithm for solver large linear sys-
tems using NISQ hardware
Matrix elements of group
representation [16, 50]
Group theory problem; U is essentially
the Quantum Fourier Transform
Entanglement
spectroscopy [49]
Computation of entanglement spectrum
of arbitrary quantum states
Controlled Density Matrix
Exponentiation (DME) [52]
Several appliations, e.g. for private
quantum software [60]
Table 3: Applications of the Hadamard Test. Each corre-
sponds to a different choice of U .
5.3 Quantum Memory Architectures
Next, we investigate the use of fan-out to improve the
implementation of quantum memory, which speeds up or
enables many quantum algorithms [89, 92]. The high-level
function of a quantum memory is similar to that of a classi-
cal memory: n index bits enumerate over 2n memory cells.
Following the notation of [5], we denote the n index bits as
the |b〉 register and the 2n memory cells as the |m〉 register.
As in the classical case, we expect that setting the index reg-
ister to |i〉 should allow us to retrieve the ith memory cell,
|mi〉. However, for a quantum memory, we also require the
retrieval to work over superpositions of index qubits. For
example, setting |b〉 to 1√
2
[|000〉+ |111〉] should retrieve the
superposition, 1√
2
[|m0〉+ |m7〉].
In this section, we apply the fan-out primitive to both ex-
plicit and implicit quantum memories, which we define below.
We demonstrate considerable improvements—exponential
and linear respectively—over prior work, as summarized in
Table 2. These improvements are important because the cost
of quantum memory is often the principal bottleneck for real-
izing practical speedups. While it remains unclear if quantum
memory architectures will be feasible [1, 6, 11, 73] even for
future fault-tolerant devices, our proposed improvements at
least justify a re-assessment of the feasibility.
5.3.1 Explicit Quantum Memory
In an explicit quantum memory, the 2n memory cells are
each explicitly stored in qubit registers. In this sense, an
explicit quantum memory is akin to a 2n–to–1 multiplexer
or data selector from classical electronics. As discussed, the
quantum variant should extend to the case where select lines
are in superposition. Moreover, each of the 2n memory cells
7
|b0〉 • •
|b1〉 • •
|b2〉 • •
|m0〉 × × × × × × ×|m1〉 × × × × × ×|m2〉 × × × ×|m3〉 × × × ×|m4〉 × ×|m5〉 × ×|m6〉 × ×|m7〉 × ×|load/store〉 ×
Figure 10: Architecture for an explicit quantum memory
with n = 3 index qubits and 2n = 8 memory cells of bitwidth
W = 1. Quirk demo.
is stored in a qubit register, so each memory cell can itself
contain a quantum (superposition) state.
The dominant architecture for this explicit quantum mem-
ory is termed Quantum Random Access Memory. The bucket
brigade design of QRAM was introduced in [26, 27] and cast
to the quantum circuit model in [6]. This bucket brigade
QRAM requires ∼ 2 · 2n qubits and O(W2n) depth. Later
work [19] was able to parallelize execution to achieve O(Wn)
depth, but requires an additional ∼ 6 ·2n ancilla qubits. We
now present a novel architecture for explicit quantum memory
that requires only O(n) depth, with 0 ancilla qubits.
Figure 10 shows our architecture for n = 3 index qubits.
There are 23 = 8 explicit memory cells, each of single-qubit
bitwidth W = 1. At a high level, the circuit performs a “mi-
gration” of the target memory cell into |m0〉. Consider for
example |~b = 101〉, which should access |m5〉. The control
on the MSB performs a SWAP between |m7654〉 and |m3210〉,
moving |m5〉 into |m1〉. The control on the middle index
does not activate, but the control on the LSB is activated and
SWAPs |m1〉 into the |m0〉 destination. Finally, this qubit
is swapped into the |load/store〉 register. The right half of
the circuit reverses the earlier migrations, restoring the other
memory cells to their original locations.
The efficiency of this architecture is enabled by the si-
multaneous execution of controlled SWAPs, which in turn
is enabled by the fan-out primitive. As a result, the circuit
depth is only O(n). Moreover, while our example shows the
W = 1 bitwidth case, it is apparent that with simultaneous
fan-out, W is irrelevant to depth. By contrast, serialization
would impose an additional linearity in W .
During the preparation of this paper, another proposal was
published for O(n)-depth and ancilla-free explicit quantum
memory [70], which matches our asymptotic costs.
5.3.2 Implicit Quantum Memory
Next we consider implicit quantum memory. In this model,
the 2n memory cells represent classical (non-superposition)
data that is known in advance. In such a case, there is no need
to waste qubits to represent the classically-known memory
cells. Instead, the memory can be stored implicitly through
the classical control, a memory architecture that has been
referred to as Quantum Read Only Memory [8].
Figure 11 shows an example implicit memory storing the
first four prime numbers: {00→ 2,01→ 3,10→ 5,11→
7}. The resulting circuit has a simple form, enumerating all
2n indices and associating each index with a corresponding
pattern of ⊕ gates. Without fan-out, implicit memory has
O(W2n) depth via the unary iteration optimization in [8].
However, simultaneous fan-out obviates the scaling in W .
This is appealing, because for datasets such as images, the
bitwidth (W ) of each record exceeds the number of records.
|b0〉 • •|b1〉 • •
|~m〉

~m = 2 ~m = 3 ~m = 5 ~m = 7
Figure 11: Implicit memory storing the first four prime num-
bers. The W = 3 bitwidth memory is implicitly defined
through classical control, based on the pattern of ⊕’s. For
anticipated applications, W can be large.
6. TECHNOLOGYMODELING:
TRAPPED ION
In this section, we model the implementation of fan-out
on trapped ion quantum computers. Trapped ions feature
long qubit coherence times [91] and gate fidelities exceed-
ing 99.99% and 99.9% for single- and two- qubit gates on
current hardware [14, 23]. Furthermore, all N qubits can be
simultaneously entangled via a global interaction known as
the Global Mølmer–Sørensen (GMS) gate [65, 83]. Recent
work [10, 61] has explicitly demonstrated how GMS is essen-
tially equivalent to simultaneous fan-out. Moreover, in the
past year, experimental work has merged demonstrating pulse
shaping for global interactions [22, 36, 57] to support the use
of GMS both for fan-out and for parallel two-qubit gates on
disjoint qubits. Our focus here is on studying differences in
speed and fidelity between simultaneous fan-out versus N−1
serialized CNOTs. For brevity and to maintain a focus on
architectural themes, we omit many physical implementation
details here.
Regarding the potential speedup, [10, 36, 84] assert that
simultaneous fan-out via GMS is indeed linearly faster than
serialized CNOTs. To evaluate the fidelity impact, we per-
formed numerical simulations of fan-out via GMS for N = 2
to N = 8 qubits. We constructed a realistic error model that
accounts for two sources of noise: overrotation and laser
dephasing. Overrotation occurs due to the fact that the angle
θ of the Mølmer-Sørensen rotation is sensitive to motional
frequency drifts, and it has higher-order dependence on the
motional states [18, 90, 96]. An overrotation error can be
modeled by replacing θ by (1+ ε)θ , where ε denotes the
overrotation rate. Laser dephasing results from fluctuations
of the optical path length [54, 90, 96].
For current trapped ion hardware, we conservatively es-
timate typical overrotation rates of 5%. We modeled GMS
interaction times of 100 µs [10], contrasted against 80 ms
laser coherence time [90]. To evaluate the sensitivity of our
8
results to these parameters, we also modeled under three fu-
ture scenarios: 5x lower overrotation rate, 5x longer laser
coherence, and both improvements. Our simulations were
performed using master-equation simulation in QuTiP [48].
We performed stochastic simulation over 100k runs per sce-
nario. The fidelity results are shown in Figure 12.
2 3 4 5 6 7 8
N
98.0%
98.5%
99.0%
99.5%
100.0%
Fidelity for N qubit trapped ion fan-out
Simultaneous
Serial
1% overrotation, 400 ms
5% overrotation, 400 ms
1% overrotation, 80 ms
5% overrotation, 80 ms
Figure 12: Simulation results for fan-out on trapped ion hard-
ware. Sensitivity analysis performed under four {overrotation
rate, laser coherence time} scenarios. For each scenario, we
simulated fidelity for simultaneous versus serial. Results av-
eraged across 100k stochastic runs per scenario, executed
with 50k CPU-core hours on a large computing cluster.
Conceptually, overrotation errors affect simultaneous and
serial equally. Meanwhile, laser dephasing affects serial more
adversely, because the laser dephasing effect on the control
qubit accumulates over the additional time required for N−1
consecutive CNOTs. Although simultaneous always outper-
forms serial on our simulations, the exact fidelity advantage is
dependent on the parameter settings. For current technology
(•), simultaneous has an almost 1% higher fidelity for N = 8.
For the scenario with 5x longer laser coherence (H), simul-
taneous has almost no fidelity advantage over serial. For the
scenario with 5x lower overrotation (), simultaneous again
has a nearly 1% fidelity advantage over serial. Also, across
all scenarios, the advantage of simultaneous fan-out increases
for larger N, which is encouraging. While our simulation
results are based on a realistic noise model, experimental
evaluation is necessary to conclude any definitive fidelity ad-
vantage. As cloud access to trapped ion hardware emerges
over the coming year, it will be possible to experimentally
test these simulated results.
7. RESULTS
7.1 Methodology
We evaluated the exact depth reduction for eight applica-
tions: SWAP Tests (both traditional and interference circuit),
Hadamard Tests (all four applications in Table 3), and mem-
ory architectures (both explicit and implicit). We compiled
each benchmark, across a wide range of circuit widths, using
both our fan-out based approach (Simultaneous) and the stan-
dard serialized approach with no fan-out (Serial). The results
are plotted in Figure 13.
We also evaluated the fidelity advantage of simultaneous
fan-out for the five most NISQ-friendly benchmarks. For
each benchmark type, we found the largest circuit instance
with fan-out of at most 8 qubits, matching the largest fan-out
we simulated in Figure 12. Then, we estimated fidelity with
a coarse metric: for each circuit, we assigned each gate a
fidelity based on the current hardware “5% overrotation, 80
ms laser coherence” simulation in Figure 12. Multiplying
together these gate fidelities gives an approximation for the
total circuit fidelity (i.e. 1 - infidelity). We also performed
this multiplication under the “1% overrotation, 80 ms laser
coherence” future scenario with 5x lower overrotation. While
these estimates are less accurate than full density matrix sim-
ulation, as we performed in Figure 12, they are informative
from an Amdahl’s Law perspective. In particular, single- and
two- qubit gates are equally penalized in the Simultaneous
and Serial circuits, so the Simultaneous circuits can only
perform better when there are large fan-out gates.
7.2 Discussion
As mentioned in Section 6, simultaneous fan-out does
genuinely give a linear speedup over serialization. Therefore,
the depth reductions in Figure 13 translate directly to faster
time-to-solution. This is particularly important since NISQ
algorithms need to be variationally executed millions of times
[30]. For four of the eight benchmarked applications, the
underlying U has constant depth, so our Simultaneous circuit
also has constant depth. For the other four benchmarks, the
underlying U has Ω(N) depth, so both Simultaneous and
Serial have increasing depth with N. However, Simultaneous’
scaling is still lower than Serial’s by a linear factor.
The infidelity estimates in Figure 14 have an Amdahl’s
Law interpretation. The reduction in infidelity from Serial
to Simultaneous is greater when the Simultaneous circuit is
dominated by fan-out layers. Among our benchmarks, Varia-
tional Quantum Linear Solver and Controlled Density Matrix
Exponentiation have particularly high fidelity advantages.
Our results also demonstrate the sensitivity to the underlying
trapped ion hardware’s error parameters. For example, VQLS
has a 13.9% Serial→Simultaneous infidelity reduction on cur-
rent hardware and a 20.9% reduction on future hardware with
5x lower overrotation.
On current hardware, fidelity is the primary system bot-
tleneck. As such, the fidelity improvement of simultaneous
fan-out justifies its use in NISQ machines. 7–24% reduc-
tions in infidelity on 8-qubit circuits are equivalent to months
of hardware progress, but our optimization requires no new
hardware. As a practical message to hardware providers, we
emphasize that exposing global interactions to software will
lead to substantial improvements in both fidelity and speed
for NISQ applications.
8. FUTUREWORK: SUPERCONDUCTING
QUBITS
Global interaction can be realized for many technologies,
but superconducting qubits—which are currently the fron-
trunner in commercial activity—are a notable exception. To
the best of our knowledge, there are no prior implementations
of fan-out on superconducting devices. In this section, we
demonstrate that superconducting quantum computers can in
9
0 20 40 60 80 100
N (circuit width)
0
50
100
150
200
250
300
350
400
De
pt
h
SWAP Test
Simultaneous
Serial
0 20 40 60 80 100
N (circuit width)
0
25
50
75
100
125
150
175
200
De
pt
h
Interference Circuit (for Sensing)
Simultaneous
Serial
0 20 40 60 80 100
N (circuit width)
0
100
200
300
400
De
pt
h
Variational Quantum Linear Solver
Simultaneous
Serial
0 10 20 30 40 50
N (circuit width)
0
10000
20000
30000
40000
50000
60000
70000
De
pt
h
Matrix Elements of Group Representation
Simultaneous
Serial
10 20 30 40 50
N (circuit width)
0
50
100
150
200
250
300
De
pt
h
Entanglement Spectroscopy for Kth Renyi Entropy
Simultaneous K=2
Simultaneous K=3
Simultaneous K=4
Serial K=2
Serial K=3
Serial K=4
10 20 30 40 50
N (circuit width)
0
200
400
600
800
1000
De
pt
h
Controlled Density Matrix Expon. for K copies
Simultaneous K=2
Simultaneous K=3
Simultaneous K=4
Serial K=2
Serial K=3
Serial K=4
2 3 4 5 6 7 8 9
n (number of index qubits)
0
2000
4000
6000
8000
10000
De
pt
h
Explicit Quantum Memory, W = 1
Simultaneous
Serial
2 3 4 5 6 7 8 9
n (number of index qubits)
0
1000
2000
3000
4000
5000
De
pt
h 
(le
ad
ing
 co
ef
fic
ien
ts 
se
t t
o 
1)
Implicit Quantum Memory (Asymptotic Scaling)
Simultaneous (no W dependence)
Serial W=2
Serial W=5
Serial W=10
Figure 13: Depth (lower is better) for SWAP Test, Hadamard Test, and memory architecture benchmarks. We compare circuits
compiled with our Controlled-U circuit synthesis procedure (which uses simultaneous fan-out) versus circuits that serialize the
CNOTs.
SWAP Test Intrf. Circ. VQLS Ent. Spec. Ctrl-DME
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Inf
ide
lit
y (
low
er
 is
 b
et
te
r)
(7.44%)
(11.8%)
(13.9%)
(6.89%)
(8.31%)
Current Trapped Ion Params
Simultaneous
Serial
(% Simultaneous advantage)
SWAP Test Intrf. Circ. VQLS Ent. Spec. Ctrl-DME
0.00
0.05
0.10
0.15
0.20
0.25
(12.6%)
(20.9%)
(23.6%)
(12%)
(14.6%)
5x Lower Overrotation
Simultaneous
Serial
(% Simultaneous advantage)
Figure 14: Infidelity estimates for five benchmarks.
fact perform simultaneous fan-out. Physical implementation
details will be presented in a follow-up paper.
We first examine the implementation of a CNOT with
superconducting qubits. The CNOT is not a natural physical
interaction between qubits. Instead, it is performed through
a sequence of more primitive physical interactions between
qubits. On Google and Rigetti superconducting quantum
hardware, CNOT can be realized by a sequence of iSWAP
interactions, which are similar to ordinary SWAPs. However,
this seems incompatible with simultaneous fan-out, which
conceptually requires concurrent reads on the control qubit.
By contrast, iSWAP performs both reads and writes on the
control qubit since its state is swapped with the target.
An alternative two-qubit interaction called Cross-Resonance
[71, 75] is better suited. The Cross-Resonance interaction
is used to perform CNOT gates on IBM’s devices. It has
less restrictive hardware requirements than iSWAP, so Cross-
Resonance could be performed on Google and Rigetti hard-
ware as well. Critically, the Cross-Resonance interaction only
reads the control qubit, so it does not suffer the immediate
barrier to fan-out that iSWAP does.
Although the control qubit state is unaffected during Cross-
Resonance, the interaction requires (somewhat counterintu-
itively) driving the control qubit with a microwave pulse.
However, by setting this microwave pulse to the frequency
of the target qubit, the target qubit will rotate conditioned on
the state of the control qubit. This physical interaction easily
converts to CNOT through a single-qubit postprocessing gate.
Qubit 3, freq. 𝜔!Qubit 2, freq. 𝜔" Qubit 5, freq. 𝜔#
+𝜔" 𝜔#
Figure 15: Schematic of fan-out using Cross-Resonance on
superconducting qubits. The control qubit (3) is driven with
the sum of waves at the targets’ frequencies, ω2 and ω5.
Figure 15 illustrates how we can extend this Cross-Resonance
interaction to engineer fan-out. In this example, qubit 3 is the
control and qubits 2 and 5 are the two targets. To perform
the CNOT from 3 to 2 (5), we would drive qubit 3 with mi-
crowave at frequency ω2 (ω5). However, if we instead drive
qubit 3 with the summation of two sine waves at frequencies
ω2 and ω5, then we effectively perform both CNOTs simulta-
neously. The resulting pulse sequences has a linear speedup
over serialization, as desired. Again, this technique only
works because Cross-Resonance merely “reads” the control
qubit, unlike the iSWAP interaction.
We experimentally realized this specific example of fan-
out from qubit 3 to qubits 2 and 5 using IBM’s Paris quan-
tum computer. We performed our experiment using Open-
Pulse [4, 31, 62], an interface that enables low-level access
of quantum computers through Arbitary Waveform Genera-
tors. This level of access is required since we need to drive
qubit 3 with an unconventional sum-of-waves pulse. We also
use sideband modulation, which is needed since the qubit 3
10
drive is configured to oscillate at ω3 by default. Moreover, in
practice, high fidelity Cross-Resonance interactions require
an echo sequence [17] and active cancellation pulses on the
target qubits [58, 80]. Additionally, we had to calibrate a
phase offset for the sideband to compensate for accumulated
phase on the coaxial cable transitioning from room tempera-
ture electronics to the fridge [4,53]. All of these experimental
details were handled and will be explained in a follow-up
paper.
Figure 16 shows our experimental results. We attempted
to generate the GHZ state, |000〉+|111〉√
2
, by first performing a
NOT gate on qubit 3 and then fanning out its state to qubits 2
and 5. Ideally, this would result in |000〉 and |111〉 each with
50% probability. With simultaneous fan-out, we achieved
31% and 29% respectively. Serialization achieved 42% and
36% respectively.
|000> |001> |010> |011> |100> |101> |110> |111>
0%
25%
50%
Pr
ob
ab
ilit
y
GHZ on qubits 2-3-5 (IBM Paris)
Simultaneous
Serial
Ideal
Figure 16: OpenPulse results from 8000×2 repetitions on
IBM Q Paris. The ideal output is 50% |000〉 and 50% |111〉.
While the GHZ state produced with serial fan-out is better
than the one produced with simultaneous fan-out, we empha-
size that the simultaneous version ran almost twice as fast.
This is encouraging, because superconducting qubits have
short coherence lifetimes, so faster operations lead to signifi-
cant fidelity improvements [30]. Moreover, when we consider
larger width circuits, faster fan-out on a subset of qubits can
improve the quality of the other qubits which decohere for
less time. Finally, anticipated increases to the sampling rate
of Arbitrary Waveform Generators should improve the fidelity
of the simultaneous fan-out operation.
Most importantly, our experiment affirms that simultane-
ous fan-out is possible at all on superconducting quantum
hardware. Recent papers have also proposed different tech-
niques that could be used to realize many-body interactions
in superconducting systems [15, 51, 74, 94], but to our knowl-
edge, our work is the first experimental proof-of-concept. Our
work can be viewed a way to engineer crosstalk (unwanted
interference between neighboring qubits) for good.
8.1 Scalability
An immediate barrier to scaling this simultaneous fan-out
procedure to more target qubits is that each control-target pair
must be connected in hardware. On superconducting qubit
platforms, connectivity is typically sparse. For example, on
IBM Q Paris’s device topology, the maximum degree is 3,
and most qubits are connected to just one or two neighbors.
Scaling the connectivity will be a challenge. However, we
note that fan-out does not require all-to-all connectivity. In-
stead, we require a star topology, where a single (control)
qubit is connected to every other qubit. Such star topologies
have been realized experimentally with 10 qubits connected
to a single bus [82]. Moreover, star topology is also useful for
Hamiltonian simulation circuits [40], so there are numerous
other quantum subroutines that would also benefit.
A second consideration is that summing waves for each
target qubit’s frequency (as in Figure 15) will not scale since
the maximum amplitude of Arbitrary Waveform Generators
is power-constrained. We propose two possible solutions to
this. On frequency tunable devices (where ωq for each qubit
can be controlled), we can simply tune all target qubits to a
common frequency during fan-out. Then, the control qubit
can be driven at this single common frequency, bypassing the
summation of multiple waves. The other solution pertains to
fixed-frequency devices. Here, we propose that rectangular-
topology qubits could be fabricated with frequencies accord-
ing to a checkerboard pattern. In such an arrangement, just
two colors (frequencies) are needed to ensure no frequency
collisions between neighboring qubits. During fan-out, the
control qubit can be driven at the sum of just two frequencies,
averting the scalability issue.
While these proposed solutions are sound in theory, prac-
tical realization will be challenging due to experimental nu-
ances. For example, current qubit fabrication technologies are
imprecise and stochastic [13], so fabricating qubit frequen-
cies in a checkerboard pattern will be difficult. Thus, more
experimental progress will be needed to scale fan-out on su-
perconducting hardware. These hardware-software codesign
considerations are valuable in closing the gap from NISQ
hardware to practical applications. We propose further work
to evaluate simultaneous fan-out with superconducting qubits.
9. CONCLUSION
At a high level, this work validates the importance of
hardware-software codesign. Our core result is driven from
the hardware→ software observation that the exclusive acti-
vation structural hazard is not necessary in quantum comput-
ing. By exploiting simultaneous fan-out, we are able to syn-
thesized optimized circuit schedules for Controlled-U , which
is important in NISQ workloads. In the software→ hardware
direction, our results suggest a number of priorities for future
hardware development—in particular, the importance of ex-
posing global interactions. Moreover, our demonstration of
simultaneous fan-out in superconducting qubits suggests that
the star architecture could bring superconducting systems to
parity with hardware platforms such as trapped ions.
In current systems, our results affirm a linear speedup from
fan-out. In the NISQ era, algorithms will require millions
of iterations [29], so quantum execution speedups translate
to direct reductions in time-to-solution. This opportunity is
particularly pronounced on trapped ions, which operate at rel-
atively slow kHz speeds. In addition to the circuit execution
speedup, our simulations show 7–24% infidelity reductions
from simultaneous fan-out. This is validated by our trapped
ion simulation with a realistic noise model. Our experimen-
tal results from superconducting qubits are also promising,
though our emphasis is on the mere fact that simultaneous
fan-out is possible at all on superconducting qubits.
A number of interesting future directions arise from this
work. On the hardware side, we propose experimental re-
alization of our circuits on larger machines, especially in
light of recent work noting power law decays for interaction
strengths between distant qubits [41]. In addition to super-
11
conducting and trapped ion qubits, neutral atom qubits may
be promising since global interactions via ‘Rydberg gates’
are natural [66]. On the software side, we propose further in-
vestigation of compilation in view of global interactions. [36]
suggests that global interactions could in fact give an O(N2)
speedup, whereas we only explore linear speedups in this
work. Finding such quadratic speedups could further acceler-
ate the realization of practical quantum computing.
Acknowledgements
We are grateful to Ali Javadi-Abhari, Dave Schuster, and
Dripto Debroy for helpful suggestions. This work is funded
in part by EPiQC, an NSF Expedition in Computing, un-
der grants CCF-1730449/1832377; in part by STAQ under
grant NSF Phy-1818914; and in part by DOE grants DE-
SC0020289, DE-SC0020331, and DE-SC0019294. We also
acknowledge the University of Chicago’s Research Comput-
ing Center for their support of this work. Pranav Gokhale is
supported by the Department of Defense (DoD) through the
National Defense Science & Engineering Graduate Fellow-
ship (NDSEG) Program.
REFERENCES
[1] S. Aaronson, “Read the fine print,” Nature Physics, vol. 11, no. 4, pp.
291–293, 2015.
[2] H. Abraham, I. Y. Akhalwaya, G. Aleksandrowicz, T. Alexander,
G. Alexandrowics, E. Arbel, A. Asfaw, C. Azaustre, AzizNgoueya,
P. Barkoutsos, G. Barron, L. Bello, Y. Ben-Haim, D. Bevenius, L. S.
Bishop, S. Bosch, S. Bravyi, D. Bucher, F. Cabrera, P. Calpin,
L. Capelluto, J. Carballo, G. Carrascal, A. Chen, C.-F. Chen,
R. Chen, J. M. Chow, C. Claus, C. Clauss, A. J. Cross, A. W. Cross,
S. Cross, J. Cruz-Benito, C. Culver, A. D. Córcoles-Gonzales,
S. Dague, T. E. Dandachi, M. Dartiailh, DavideFrr, A. R. Davila,
D. Ding, J. Doi, E. Drechsler, Drew, E. Dumitrescu, K. Dumon,
I. Duran, K. EL-Safty, E. Eastman, P. Eendebak, D. Egger,
M. Everitt, P. M. Fernández, A. H. Ferrera, A. Frisch, A. Fuhrer,
M. GEORGE, J. Gacon, Gadi, B. G. Gago, J. M. Gambetta,
A. Gammanpila, L. Garcia, S. Garion, J. Gomez-Mosquera, S. de la
Puente González, I. Gould, D. Greenberg, D. Grinko, W. Guan, J. A.
Gunnels, I. Haide, I. Hamamura, V. Havlicek, J. Hellmers, Ł. Herok,
S. Hillmich, H. Horii, C. Howington, S. Hu, W. Hu, H. Imai,
T. Imamichi, K. Ishizaki, R. Iten, T. Itoko, A. Javadi-Abhari, Jessica,
K. Johns, T. Kachmann, N. Kanazawa, Kang-Bae, A. Karazeev,
P. Kassebaum, S. King, Knabberjoe, A. Kovyrshin, V. Krishnan,
K. Krsulich, G. Kus, R. LaRose, R. Lambert, J. Latone, S. Lawrence,
D. Liu, P. Liu, Y. Maeng, A. Malyshev, J. Marecek, M. Marques,
D. Mathews, A. Matsuo, D. T. McClure, C. McGarry, D. McKay,
D. McPherson, S. Meesala, M. Mevissen, A. Mezzacapo, R. Midha,
Z. Minev, A. Mitchell, N. Moll, M. D. Mooring, R. Morales,
N. Moran, P. Murali, J. Müggenburg, D. Nadlinger, G. Nannicini,
P. Nation, Y. Naveh, P. Neuweiler, P. Niroula, H. Norlen, L. J.
O’Riordan, O. Ogunbayo, P. Ollitrault, S. Oud, D. Padilha, H. Paik,
S. Perriello, A. Phan, M. Pistoia, A. Pozas-iKerstjens, V. Prutyanov,
D. Puzzuoli, J. Pérez, Quintiii, R. Raymond, R. M.-C. Redondo,
M. Reuter, J. Rice, D. M. Rodríguez, M. Rossmannek, M. Ryu,
T. SAPV, SamFerracin, M. Sandberg, N. Sathaye, B. Schmitt,
C. Schnabel, Z. Schoenfeld, T. L. Scholten, E. Schoute, J. Schwarm,
I. F. Sertage, K. Setia, N. Shammah, Y. Shi, A. Silva, A. Simonetto,
N. Singstock, Y. Siraichi, I. Sitdikov, S. Sivarajah, M. B. Sletfjerding,
J. A. Smolin, M. Soeken, I. O. Sokolov, SooluThomas, D. Steenken,
M. Stypulkoski, J. Suen, H. Takahashi, I. Tavernelli, C. Taylor,
P. Taylour, S. Thomas, M. Tillet, M. Tod, E. de la Torre, K. Trabing,
M. Treinish, TrishaPe, W. Turner, Y. Vaknin, C. R. Valcarce,
F. Varchon, A. C. Vazquez, D. Vogt-Lee, C. Vuillot, J. Weaver,
R. Wieczorek, J. A. Wildstrom, R. Wille, E. Winston, J. J. Woehr,
S. Woerner, R. Woo, C. J. Wood, R. Wood, S. Wood, J. Wootton,
D. Yeralin, R. Young, J. Yu, C. Zachow, L. Zdanski, C. Zoufal,
Zoufalc, azulehner, bcamorrison, brandhsn, chlorophyll zz, dan1pal,
dime10, drholmie, elfrocampeador, faisaldebouni, fanizzamarco,
gruu, kanejess, klinvill, kurarrr, lerongil, ma5x, merav aharoni,
ordmoj, sethmerkel, strickroman, sumitpuri, tigerjack, toural, vvilpas,
welien, willhbang, yang.luh, yelojakit, and yotamvakninibm, “Qiskit:
An open-source framework for quantum computing,” 2019.
[3] D. Aharonov, V. Jones, and Z. Landau, “A polynomial quantum
algorithm for approximating the jones polynomial,” Algorithmica,
vol. 55, no. 3, pp. 395–421, 2009.
[4] T. Alexander, N. Kanazawa, D. J. Egger, L. Capelluto, C. J. Wood,
A. Javadi-Abhari, and D. McKay, “Qiskit pulse: Programming
quantum computers through the cloud with pulses,” arXiv preprint
arXiv:2004.06755, 2020.
[5] I. Arad and Z. Landau, “Quantum computation and the evaluation of
tensor networks,” SIAM Journal on Computing, vol. 39, no. 7, pp.
3089–3121, 2010.
[6] S. Arunachalam, V. Gheorghiu, T. Jochym-O’Connor, M. Mosca, and
P. V. Srinivasan, “On the robustness of bucket brigade quantum ram,”
New Journal of Physics, vol. 17, no. 12, p. 123010, 2015.
[7] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,
R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum
supremacy using a programmable superconducting processor,”
Nature, vol. 574, no. 7779, pp. 505–510, 2019.
[8] R. Babbush, C. Gidney, D. W. Berry, N. Wiebe, J. McClean, A. Paler,
A. Fowler, and H. Neven, “Encoding electronic spectra in quantum
circuits with linear t complexity,” Physical Review X, vol. 8, no. 4, p.
041015, 2018.
[9] K. Beer, D. Bondarenko, T. Farrelly, T. J. Osborne, R. Salzmann, and
R. Wolf, “Efficient learning for deep quantum neural networks,”
arXiv preprint arXiv:1902.10445, 2019.
[10] A. Bermudez, X. Xu, R. Nigmatullin, J. O’Gorman, V. Negnevitsky,
P. Schindler, T. Monz, U. Poschinger, C. Hempel, J. Home et al.,
“Assessing the progress of trapped-ion processors towards
fault-tolerant quantum computation,” Physical Review X, vol. 7,
no. 4, p. 041061, 2017.
[11] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and
S. Lloyd, “Quantum machine learning,” Nature, vol. 549, no. 7671,
pp. 195–202, 2017.
[12] C. Bravo-Prieto, R. LaRose, M. Cerezo, Y. Subasi, L. Cincio, and P. J.
Coles, “Variational quantum linear solver: A hybrid algorithm for
linear systems,” arXiv preprint arXiv:1909.05820, 2019.
[13] M. Brink, J. M. Chow, J. Hertzberg, E. Magesan, and S. Rosenblatt,
“Device challenges for near term superconducting quantum
processors: frequency collisions,” in 2018 IEEE International
Electron Devices Meeting (IEDM). IEEE, 2018, pp. 6–1.
[14] K. R. Brown, A. C. Wilson, Y. Colombe, C. Ospelkaus, A. M. Meier,
E. Knill, D. Leibfried, and D. J. Wineland, “Single-qubit-gate error
below 10- 4 in a trapped ion,” Physical Review A, vol. 84, no. 3, p.
030303, 2011.
[15] N. Chancellor, S. Zohren, and P. A. Warburton, “Circuit design for
multi-body interactions in superconducting quantum annealing
systems with applications to a scalable architecture,” npj Quantum
Information, vol. 3, no. 1, pp. 1–7, 2017.
[16] P. J. Coles, S. Eidenbenz, S. Pakin, A. Adedoyin, J. Ambrosiano,
P. Anisimov, W. Casper, G. Chennupati, C. Coffrin, H. Djidjev et al.,
“Quantum algorithm implementations for beginners,” arXiv preprint
arXiv:1804.03719, 2018.
[17] A. D. Córcoles, J. M. Gambetta, J. M. Chow, J. A. Smolin, M. Ware,
J. Strand, B. L. Plourde, and M. Steffen, “Process verification of
two-qubit quantum gates by randomized benchmarking,” Physical
Review A, vol. 87, no. 3, p. 030301, 2013.
[18] D. Debroy, M. Li, M. Newman, and K. R. Brown, “Stabilizer slicing:
Coherent error cancellations in ldpc codes,” arXiv preprint
arXiv:1810.01040, 2018.
[19] O. Di Matteo, V. Gheorghiu, and M. Mosca, “Fault-tolerant resource
estimation of quantum random-access memories,” IEEE Transactions
on Quantum Engineering, vol. 1, pp. 1–13, 2020.
[20] M. Q. Documentation, “Estimategradient operation,” Available at
https://docs.microsoft.com/en-us/qsharp/api/qsharp/microsoft.
quantum.machinelearning.estimategradient.
[21] S. Dogra, K. Dorai et al., “Experimental construction of generic
three-qubit states and their reconstruction from two-party reduced
states on an nmr quantum information processor,” Physical Review A,
vol. 91, no. 2, p. 022312, 2015.
[22] C. Figgatt, A. Ostrander, N. M. Linke, K. A. Landsman, D. Zhu,
D. Maslov, and C. Monroe, “Parallel entangling operations on a
universal ion-trap quantum computer,” Nature, vol. 572, no. 7769, pp.
12
368–372, 2019.
[23] J. P. Gaebler, T. R. Tan, Y. Lin, Y. Wan, R. Bowler, A. C. Keith,
S. Glancy, K. Coakley, E. Knill, D. Leibfried et al., “High-fidelity
universal gate set for be 9+ ion qubits,” Physical review letters, vol.
117, no. 6, p. 060505, 2016.
[24] J. C. Garcia-Escartin and P. Chamorro-Posada, “Swap test and
hong-ou-mandel effect are equivalent,” Physical Review A, vol. 87,
no. 5, p. 052330, 2013.
[25] R. Ghobadi, J. S. Oberoi, and E. Zahedinejhad, “The power of one
qubit in machine learning,” arXiv preprint arXiv:1905.01390, 2019.
[26] V. Giovannetti, S. Lloyd, and L. Maccone, “Architectures for a
quantum random access memory,” Physical Review A, vol. 78, no. 5,
p. 052310, 2008.
[27] V. Giovannetti, S. Lloyd, and L. Maccone, “Quantum random access
memory,” Physical review letters, vol. 100, no. 16, p. 160501, 2008.
[28] P. Gokhale, J. M. Baker, C. Duckering, N. C. Brown, K. R. Brown,
and F. Chong, “Extending the frontier of quantum computers with
qutrits,” IEEE Micro, 2020.
[29] P. Gokhale, J. M. Baker, C. Duckering, N. C. Brown, K. R. Brown,
and F. T. Chong, “Asymptotic improvements to quantum circuits via
qutrits,” in Proceedings of the 46th International Symposium on
Computer Architecture, 2019, pp. 554–566.
[30] P. Gokhale, Y. Ding, T. Propson, C. Winkler, N. Leung, Y. Shi, D. I.
Schuster, H. Hoffmann, and F. T. Chong, “Partial compilation of
variational algorithms for noisy intermediate-scale quantum
machines,” in Proceedings of the 52nd Annual IEEE/ACM
International Symposium on Microarchitecture, 2019, pp. 266–278.
[31] P. Gokhale, A. Javadi-Abhari, N. Earnest, Y. Shi, and F. T. Chong,
“Optimized quantum compilation for near-term algorithms with
openpulse,” arXiv preprint arXiv:2004.11205, 2020.
[32] G. Goldstein, P. Cappellaro, J. Maze, J. Hodges, L. Jiang, A. S.
Sørensen, and M. Lukin, “Environment-assisted precision
measurement,” Physical review letters, vol. 106, no. 14, p. 140502,
2011.
[33] F. Green, S. Homer, C. Moore, and C. Pollett, “Counting, fanout, and
the complexity of quantum acc,” arXiv preprint quant-ph/0106017,
2001.
[34] D. M. Greenberger, M. A. Horne, and A. Zeilinger, “Going beyond
bell’s theorem,” in Bell’s theorem, quantum theory and conceptions
of the universe. Springer, 1989, pp. 69–72.
[35] K. Groenland, F. Witteveen, K. Schoutens, and R. Gerritsma,
“Sequences of molmer-sorensen gates can implement controlled
rotations using quantum signal processing techniques,” arXiv
preprint arXiv:2001.05231, 2020.
[36] N. Grzesiak, R. Blümel, K. Beck, K. Wright, V. Chaplin, J. M.
Amini, N. C. Pisenti, S. Debnath, J.-S. Chen, and Y. Nam, “Efficient
arbitrary simultaneously entangling gates on a trapped-ion quantum
computer,” arXiv preprint arXiv:1905.09294, 2019.
[37] G. G. Guerreschi, “Scheduler of quantum circuits based on
dynamical pattern improvement and its application to hardware
design,” arXiv preprint arXiv:1912.00035, 2019.
[38] G. G. Guerreschi and J. Park, “Two-step approach to scheduling
quantum circuits,” Quantum Science and Technology, vol. 3, no. 4, p.
045003, 2018.
[39] G. G. Guerreschi and M. Smelyanskiy, “Practical optimization for
hybrid quantum-classical algorithms,” arXiv preprint
arXiv:1701.01450, 2017.
[40] K. Gui, T. Tomesh, P. Gokhale, Y. Shi, F. T. Chong, M. Martonosi,
and M. Suchara, “Term grouping and travelling salesperson for
digital quantum simulation,” arXiv preprint arXiv:2001.05983, 2020.
[41] A. Y. Guo, A. Deshpande, S.-K. Chu, Z. Eldredge, P. Bienias,
D. Devulapalli, Y. Su, A. M. Childs, and A. V. Gorshkov,
“Implementing a fast unbounded quantum fanout gate using
power-law interactions,” arXiv preprint arXiv:2007.00662, 2020.
[42] V. Havlícˇek, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala,
J. M. Chow, and J. M. Gambetta, “Supervised learning with
quantum-enhanced feature spaces,” Nature, vol. 567, no. 7747, pp.
209–212, 2019.
[43] J. Heckey, S. Patil, A. JavadiAbhari, A. Holmes, D. Kudrow, K. R.
Brown, D. Franklin, F. T. Chong, and M. Martonosi, “Compiler
management of communication and parallelism for quantum
computation,” in Proceedings of the Twentieth International
Conference on Architectural Support for Programming Languages
and Operating Systems, 2015, pp. 445–456.
[44] J. L. Hennessy and D. A. Patterson, Computer architecture: a
quantitative approach. Elsevier, 2011.
[45] P. Høyer and R. Špalek, “Quantum fan-out is powerful,” Theory of
computing, vol. 1, no. 1, pp. 81–103, 2005.
[46] H.-Y. Huang, K. Bharti, and P. Rebentrost, “Near-term quantum
algorithms for linear systems of equations,” arXiv preprint
arXiv:1909.07344, 2019.
[47] A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T.
Chong, and M. Martonosi, “Scaffcc: Scalable compilation and
analysis of quantum programs,” Parallel Computing, vol. 45, pp.
2–17, 2015.
[48] J. R. Johansson, P. D. Nation, and F. Nori, “Qutip 2: A python
framework for the dynamics of open quantum systems,” Computer
Physics Communications, vol. 184, no. 4, pp. 1234–1240, 2013.
[49] S. Johri, D. S. Steiger, and M. Troyer, “Entanglement spectroscopy
on a quantum computer,” Physical Review B, vol. 96, no. 19, p.
195136, 2017.
[50] S. P. Jordan, “Fast quantum algorithms for approximating some
irreducible representations of groups,” arXiv preprint
arXiv:0811.0562, 2008.
[51] M. Khazali and K. Mølmer, “Fast multiqubit gates by adiabatic
evolution in interacting excited-state manifolds of rydberg atoms and
superconducting circuits,” Physical Review X, vol. 10, no. 2, p.
021054, 2020.
[52] M. Kjaergaard, M. Schwartz, A. Greene, G. Samach, A. Bengtsson,
M. O’Keeffe, C. McNally, J. Braumüller, D. Kim, P. Krantz et al., “A
quantum instruction set implemented on a superconducting quantum
processor,” arXiv preprint arXiv:2001.08838, 2020.
[53] S. Krinner, S. Storz, P. Kurpiers, P. Magnard, J. Heinsoo, R. Keller,
J. Luetolf, C. Eichler, and A. Wallraff, “Engineering cryogenic setups
for 100-qubit scale superconducting circuit systems,” EPJ Quantum
Technology, vol. 6, no. 1, p. 2, 2019.
[54] P. J. Lee, K.-A. Brickman, L. Deslauriers, P. C. Haljan, L.-M. Duan,
and C. Monroe, “Phase control of trapped ion quantum gates,”
Journal of Optics B: Quantum and Semiclassical Optics, vol. 7,
no. 10, p. S371, 2005.
[55] D. Leibfried and D. J. Wineland, “Efficient eigenvalue determination
for arbitrary pauli products based on generalized spin-spin
interactions,” Journal of Modern Optics, vol. 65, no. 5-6, pp.
774–779, 2018.
[56] G. Li, Y. Ding, and Y. Xie, “Towards efficient superconducting
quantum processor architecture design,” arXiv preprint
arXiv:1911.12879, 2019.
[57] Y. Lu, S. Zhang, K. Zhang, W. Chen, Y. Shen, J. Zhang, J.-N. Zhang,
and K. Kim, “Global entangling gates on arbitrary ion qubits,”
Nature, vol. 572, no. 7769, pp. 363–367, 2019.
[58] E. Magesan and J. M. Gambetta, “Effective hamiltonian models of
the cross-resonance gate,” arXiv preprint arXiv:1804.04073, 2018.
[59] E. A. Martinez, T. Monz, D. Nigg, P. Schindler, and R. Blatt,
“Compiling quantum algorithms for architectures with multi-qubit
gates,” New Journal of Physics, vol. 18, no. 6, p. 063029, 2016.
[60] I. Marvian and S. Lloyd, “Universal quantum emulator,” arXiv
preprint arXiv:1606.02734, 2016.
[61] D. Maslov and Y. Nam, “Use of global interactions in efficient
quantum circuit constructions,” New Journal of Physics, vol. 20,
no. 3, p. 033018, 2018.
[62] D. C. McKay, T. Alexander, L. Bello, M. J. Biercuk, L. Bishop,
J. Chen, J. M. Chow, A. D. Córcoles, D. Egger, S. Filipp et al.,
“Qiskit backend specifications for openqasm and openpulse
experiments,” arXiv preprint arXiv:1809.03452, 2018.
[63] T. S. Metodi, D. D. Thaker, A. W. Cross, F. T. Chong, and I. L.
Chuang, “Scheduling physical operations in a quantum information
processor,” in Quantum Information and Computation IV, vol. 6244.
International Society for Optics and Photonics, 2006, p. 62440T.
13
[64] K. Mitarai and K. Fujii, “Methodology for replacing indirect
measurements with direct measurements,” Physical Review Research,
vol. 1, no. 1, p. 013006, 2019.
[65] K. Mølmer and A. Sørensen, “Multiparticle entanglement of hot
trapped ions,” Physical Review Letters, vol. 82, no. 9, p. 1835, 1999.
[66] M. Müller, I. Lesanovsky, H. Weimer, H. Büchler, and P. Zoller,
“Mesoscopic rydberg gate based on electromagnetically induced
transparency,” Physical Review Letters, vol. 102, no. 17, p. 170502,
2009.
[67] P. Murali, D. C. McKay, M. Martonosi, and A. Javadi-Abhari,
“Software mitigation of crosstalk on noisy intermediate-scale
quantum computers,” arXiv preprint arXiv:2001.02826, 2020.
[68] M. A. Nielsen and I. Chuang, “Quantum computation and quantum
information,” 2002.
[69] A. Omran, H. Levine, A. Keesling, G. Semeghini, T. T. Wang,
S. Ebadi, H. Bernien, A. S. Zibrov, H. Pichler, S. Choi et al.,
“Generation and manipulation of schrödinger cat states in rydberg
atom arrays,” Science, vol. 365, no. 6453, pp. 570–574, 2019.
[70] A. Paler, O. Oumarou, and R. Basmadjian, “Constant depth bucket
brigade quantum ram circuits without introducing ancillae,” arXiv
preprint arXiv:2002.09340, 2020.
[71] G. Paraoanu, “Microwave-induced coupling of superconducting
qubits,” Physical Review B, vol. 74, no. 14, p. 140504, 2006.
[72] E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, and R. Wisnieff,
“Leveraging secondary storage to simulate deep 54-qubit sycamore
circuits,” arXiv preprint arXiv:1910.09534, 2019.
[73] J. Preskill, “Quantum computing in the nisq era and beyond,”
Quantum, vol. 2, p. 79, 2018.
[74] S. Rasmussen, K. Groenland, R. Gerritsma, K. Schoutens, and
N. Zinner, “Single-step implementation of high-fidelity n-bit toffoli
gates,” Physical Review A, vol. 101, no. 2, p. 022308, 2020.
[75] C. Rigetti and M. Devoret, “Fully microwave-tunable universal gates
in superconducting qubits with linear couplings and fixed transition
frequencies,” Physical Review B, vol. 81, no. 13, p. 134507, 2010.
[76] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and N. Killoran,
“Evaluating analytic gradients on quantum hardware,” Physical
Review A, vol. 99, no. 3, p. 032331, 2019.
[77] M. Schuld, M. Fingerhuth, and F. Petruccione, “Implementing a
distance-based classifier with a quantum interference circuit,” arXiv
preprint arXiv:1703.10793, 2017.
[78] M. Schuld and N. Killoran, “Quantum machine learning in feature
hilbert spaces,” Physical review letters, vol. 122, no. 4, p. 040504,
2019.
[79] M. Schuld and F. Petruccione, Supervised learning with quantum
computers. Springer, 2018, vol. 17.
[80] S. Sheldon, E. Magesan, J. M. Chow, and J. M. Gambetta, “Procedure
for systematically tuning up cross-talk in the cross-resonance gate,”
Physical Review A, vol. 93, no. 6, p. 060302, 2016.
[81] P. W. Shor, “Polynomial-time algorithms for prime factorization and
discrete logarithms on a quantum computer,” SIAM review, vol. 41,
no. 2, pp. 303–332, 1999.
[82] C. Song, K. Xu, W. Liu, C.-p. Yang, S.-B. Zheng, H. Deng, Q. Xie,
K. Huang, Q. Guo, L. Zhang et al., “10-qubit entanglement and
parallel logic operations with a superconducting circuit,” Physical
review letters, vol. 119, no. 18, p. 180511, 2017.
[83] A. Sørensen and K. Mølmer, “Quantum computation with ions in
thermal motion,” Physical review letters, vol. 82, no. 9, p. 1971,
1999.
[84] R. J. Spiteri, M. Schmidt, J. Ghosh, E. Zahedinejad, and B. C.
Sanders, “Quantum control for high-fidelity multi-qubit gates,” New
Journal of Physics, vol. 20, no. 11, p. 113009, 2018.
[85] Y. Takahashi and S. Tani, “Collapse of the hierarchy of
constant-depth exact quantum circuits,” computational complexity,
vol. 25, no. 4, pp. 849–881, 2016.
[86] Y. Takahashi, S. Tani, and N. Kunihiro, “Quantum addition circuits
and unbounded fan-out,” arXiv preprint arXiv:0910.2530, 2009.
[87] Y. Takahashi, T. Yamazaki, and K. Tanaka, “Hardness of classically
simulating quantum circuits with unbounded toffoli and fan-out
gates,” Quantum Information & Computation, vol. 14, no. 13-14, pp.
1149–1164, 2014.
[88] D. Venturelli, M. Do, E. Rieffel, and J. Frank, “Compiling quantum
circuits to realistic hardware architectures using temporal planners,”
Quantum Science and Technology, vol. 3, no. 2, p. 025004, 2018.
[89] G. Verdon, M. Broughton, and J. Biamonte, “A quantum algorithm to
train neural networks using low-depth circuits,” arXiv preprint
arXiv:1712.05304, 2017.
[90] Y. Wang, S. Crain, C. Fang, B. Zhang, S. Huang, Q. Liang, P. H.
Leung, K. R. Brown, and J. Kim, “High-fidelity two-qubit gates
using a mems-based beam steering system for individual qubit
addressing,” arXiv preprint arXiv:2003.12430, 2020.
[91] Y. Wang, M. Um, J. Zhang, S. An, M. Lyu, J.-N. Zhang, L.-M. Duan,
D. Yum, and K. Kim, “Single-qubit quantum memory exceeding
ten-minute coherence time,” Nature Photonics, vol. 11, no. 10, pp.
646–650, 2017.
[92] N. Wiebe, A. Kapoor, and K. M. Svore, “Quantum deep learning,”
arXiv preprint arXiv:1412.3489, 2014.
[93] N. Wiebe and L. Wossnig, “Generative training of quantum
boltzmann machines with hidden units,” arXiv preprint
arXiv:1905.09902, 2019.
[94] S. A. Wilkinson and M. J. Hartmann, “Many-body quantum circuits
for quantum simulation and computing,” arXiv preprint
arXiv:2003.08838, 2020.
[95] W. K. Wootters and W. H. Zurek, “A single quantum cannot be
cloned,” Nature, vol. 299, no. 5886, pp. 802–803, 1982.
[96] Y. Wu, S.-T. Wang, and L.-M. Duan, “Noise analysis for high-fidelity
quantum entangling gates in an anharmonic linear paul trap,”
Physical Review A, vol. 97, no. 6, p. 062325, 2018.
[97] X. Xu, J. Sun, S. Endo, Y. Li, S. C. Benjamin, and X. Yuan,
“Variational algorithms for linear algebra,” arXiv preprint
arXiv:1909.03898, 2019.
[98] D. Yu, Y. Gao, W. Zhang, J. Liu, and J. Qian, “Scalability and
high-efficiency of an (n+1)-qubit toffoli gate sphere via blockaded
rydberg atoms,” arXiv preprint arXiv:2001.04599, 2020.
[99] S. Zaiser, T. Rendler, I. Jakobi, T. Wolf, S.-Y. Lee, S. Wagner,
V. Bergholm, T. Schulte-Herbrüggen, P. Neumann, and J. Wrachtrup,
“Enhancing quantum sensing sensitivity by a quantum memory,”
Nature communications, vol. 7, p. 12279, 2016.
[100] B. Zeng, D. Zhou, and L. You, “Measuring the parity of an n-qubit
state,” Physical review letters, vol. 95, no. 11, p. 110502, 2005.
14
