A flexible high-performance simulator for verifying and benchmarking
  quantum circuits implemented on real hardware by Villalonga, Benjamin et al.
A flexible high-performance simulator for verifying and benchmarking quantum
circuits implemented on real hardware
Benjamin Villalonga,1, 2, 3 Sergio Boixo,4 Bron Nelson,2, 5 Christopher
Henze,2 Eleanor Rieffel,2 Rupak Biswas,2 and Salvatore Mandra`2, 6, ∗
1Institute for Condensed Matter Theory and Department of Physics,
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
2Quantum Artificial Intelligence Lab. (QuAIL), NASA Ames Research Center, Moffett Field, CA 94035, USA
3USRA Research Institute for Advanced Computer Science (RIACS), 615 National, Mountain View, California 94043, USA
4Google Inc., Venice, CA 90291, USA
5ASRC Federal InuTeq, 7000 Muirkirk Meadows Drive, Suite 100, Beltsville, MD 20705
6Stinger Ghaffarian Technologies Inc., 7701 Greenbelt Rd., Suite 400, Greenbelt, MD 20770
(Dated: October 14, 2019)
Here we present qFlex, a flexible tensor network based quantum circuit simulator. qFlex can
compute both exact amplitudes, essential for the verification of the quantum hardware, as well
as low fidelity amplitudes, in order to mimic sampling from Noisy Intermediate-Scale Quantum
(NISQ) devices. In this work, we focus on random quantum circuits (RQCs) in the range of sizes
expected for supremacy experiments. Fidelity f simulations are performed at a cost that is 1/f
lower than perfect fidelity ones. We also present a technique to eliminate the overhead introduced
by rejection sampling in most tensor network approaches. We benchmark the simulation of square
lattices and Google’s Bristlecone QPU. Our analysis is supported by extensive simulations on NASA
HPC clusters Pleiades and Electra. For our most computationally demanding simulation, the two
clusters combined reached a peak of 20 PFLOPS (single precision), i.e., 64% of their maximum
achievable performance, which represents the largest numerical computation in terms of sustained
FLOPs and number of nodes utilized ever run on NASA HPC clusters. Finally, we introduce a novel
multithreaded, cache-efficient tensor index permutation algorithm of general application.
I. INTRODUCTION
Building a universal, noise-resistant quantum com-
puter is to date a long-term goal driven by the strong
evidence that such a machine will provide large amounts
of computational power, beyond classical capabilities
[1–9]. An imminent milestone in that direction is rep-
resented by Noisy Intermediate-Scale Quantum (NISQ)
devices [10] of about 50-100 qubits. Despite the lack
of error correction mechanisms to run arbitrarily deep
quantum circuits, NISQ devices may be able to perform
tasks which already surpass the capabilities of today’s
classical digital computers within reasonable time and
energy constraints [11–13], thereby achieving quantum
supremacy [11–21].
Quantum circuit simulation plays a dual role in demon-
strating quantum supremacy. First, it establishes a clas-
sical computational bar that quantum computation must
pass to demonstrate supremacy. Indeed, formal complex-
ity proofs related to quantum supremacy are asymptotic,
and therefore assume an arbitrarily large number of
qubits [11–21]. This is only possible with a fault tolerant
quantum computer [13, 16, 22–27] (Note that the polyno-
mial approximation algorithms in Refs. [16, 25, 27] never
produce a better approximation than trivially sampling
bit-strings uniformly at random, as shown in Ref. [26].),
∗ salvatore.mandra@nasa.gov
and therefore a near term practical demonstration of
quantum supremacy must rely on a careful comparison
with highly optimized classical algorithms on state-of-
the-art supercomputers. Second, it also provides verifi-
cation that the quantum hardware is indeed performing
as expected up to the limits of classical computational
capabilities.
The leading near-term proposal for a quantum
supremacy experiment on NISQ devices is based on
the sampling of bit-strings from a random quantum cir-
cuit (RQC) [13, 17, 19, 21]. Indeed, under reasonable
assumptions, sampling from large RQCs is classically
unfeasible [11, 13, 14, 16, 17, 19, 21]. Further, these
quantum circuits appear to become difficult to simulate
at relatively small sizes and within error tolerances that
are expected to be implementable on early NISQ hard-
ware [13]. Here, we present a flexible simulator that both
raises the bar for quantum supremacy demonstrations
and provides expanded verification of quantum hardware
through sampling.
It is important to emphasize the difference between the
two tasks at hand: the verification of a NISQ device and
the computational task proposed for quantum supremacy,
as well as the role that a classical simulator plays in both
of them.
On the one hand, the fidelity of NISQ devices can
be estimated by computing the cross-entropy difference
(cross-entropy benchmarking, or XEB) between the ac-
tual output from the hardware given an RQC, and the
ar
X
iv
:1
81
1.
09
59
9v
3 
 [q
ua
nt-
ph
]  
10
 O
ct 
20
19
2corresponding exact output of that particular RQC us-
ing classical simulators, as proposed in Boixo et al. [13].
Note that this calculation requires the sampling of about
one million bit-strings from the device, and the computa-
tion of their exact amplitudes using a classical simulator.
For quantum circuits beyond the ability to compute am-
plitudes classically, XEB can no longer be performed.
Alternatively, close correspondence between experiments,
numerics, and theory up to that point, for a variety of cir-
cuits with combinations of fewer qubits, shallower depth,
or simpler-to-simulate circuits (e.g., more Clifford gates)
or architectures (see Methods B.1) of the same size, may
suggest by extrapolation that the hardware is performing
correctly at a particular fidelity.
On the other hand, the computational task proposed
for supremacy experiments consists of sampling a million
amplitudes from either the NISQ device or its classical
competitor at the same fidelity, for example 0.5%. A
quantum computer performing this sampling task beyond
the capabilities of the best state-of-the-art algorithms in
supercomputers would therefore achieve practical quan-
tum supremacy.
Here, we propose a flexible quantum circuit simulator
(qFlex) that raises the bar in the classical simulation of
RQCs, including the simulation of the Google Bristlecone
QPU. By design, our simulator is “blind” to the ran-
domness in the choice of single-qubit gates of the RQCs.
Therefore, it presents no fluctuations in performance
from one RQC to another. Moreover, by expanding on
a technique introduced in [28], including introducing
fine-grained “cuts” that enable us to judiciously bal-
ance memory requirements with number of independent
computations that can be done in parallel, our simu-
lator can output 1/f amplitudes with a target fidelity
f at the same computational cost to compute a single
perfect-fidelity amplitude; furthermore, we present an
alternative technique to simulate RQC sampling with
target fidelity f with the same speedup factor of 1/f .
In the last few years, many different simulators have
been proposed, either based on the direct evolution of the
quantum wave-function [13, 28–34], Clifford + T gate
sets [35], and tensor network contraction [36–39]. Tensor
network contraction based simulators have been particu-
larly successful in simulating RQCs for sizes close to the
quantum supremacy regime. Some recent simulators ex-
ploited [34, 38, 39] weaknesses in the design of the RQCs
presented in [13], and even introduced small changes
in the circuits that make them significantly easier to
simulate. These designs have been revised to remove
these weaknesses (see Ref. [28] and Methods A). Making
RQCs as difficult as possible to simulate is a key point
in the route towards quantum supremacy. At the same
time, a thorough exploration of optimizations that make
classical simulators as efficient as possible is essential so
that supremacy is not overclaimed when a NISQ device
surpasses classical capabilities. It is also important to
Brist lecone-40
Brist lecone-48 Brist lecone-60Brist lecone-64
Brist lecone-70 Brist lecone-72
Figure 1. Sub-lattices of interest of the full Bristlecone-72
(bottom right), ordered by increasing hardness for a given
depth. Note that Bristlecone-72 (entire lattice) is not harder
to simulate than Bristlecone-70, since the two corner tensors
can be contracted trivially at a negligible cost (see Meth-
ods B). Note also that Bristlecone-64 is similar in hardness
to Bristlecone-48, and substantially easier to simulate than
Bristlecone-60, as is discussed in Methods B and Results.
We identify a family of sub-lattices of Bristlecone, namely
Bristlecone-24, -30, -40, -48, -60 and -70, that are hard to
simulate classically, while keeping the number of qubits as
low as possible.
reiterate that the quantum supremacy computational
task of interest consists of producing a sample of bit-
strings within some variational distance of the output
distribution defined by a quantum circuit [13, 17, 19, 21].
This is very different from computing a single output
amplitude, as done in Ref. [39] (see Methods C).
Among the proposed classical approaches, it is
worth mentioning Markov et al.’s simulator [28]. Their
method is based on splitting I × J grids of qubits in
halves, which are then independently simulated [38].
To make the simulator more competitive, Markov
et al. introduce checkpoint states and reuse them
for different branches of a tree where internal nodes
represent Schmidt decompositions of cross-gates and
leaves represent simulation results for each tree path.
The number of independent circuits to simulate is
exponential in the number of projected CZ-gates that
cross from one half to the other. As part of their study,
the authors propose for the first time a technique to
“match” the target fidelity f of the NISQ device, which
3actually reduces the classical computation cost by a
factor f . By matching the fidelity of a realistic quantum
hardware (f = 0.51%), Markov et al. [36] were able to
simulate 7× 7 and 7× 8 grids with depth 1 + 40 + 1 by
numerically computing 106 amplitudes in respectively
582,000 hours and 1,407,000 hours on single cores.
However, the algorithm in [28] becomes less efficient
than our algorithm for grids beyond 8×8 qubits because
of memory requirements. Moreover, it is not well suited
for the simulation of the Google Bristlecone QPU.
Indeed, as we show here, the Google Bristlecone QPU
implements circuit topologies with a large diameter,
which increases the run time exponentially. In both
cases, one could mitigate the memory requirements
by either using distributed memory protocols like
MPI, or by partitioning the RQCs in more sub-circuits.
However, the aforementioned approaches introduce a
non-negligible slow-down that make them unpractical
(see SI C for more details).
To summarize, our tensor network based simulator
relies on four different points of strength:
Robustness. RQCs are mapped onto regular tensor
networks, where each tensor corresponds to a block of
the circuit enclosing several gates; consequently, 2D
grids of qubits, including the Bristlecone architecture,
are mapped onto 2D grids of tensors. Since the block-
ing operation removes any randomness in the resulting
tensor network topology (the only randomness left is
in the tensor entries themselves), our simulator is ro-
bust against fluctuations from RQC to RQC and to
changes of the rules to generate RQCs.
Flexibility. By computing an appropriate fraction
of “paths”, it is possible to control the “fidelity” of
the simulated RQCs, as first introduced in Ref. [28].
Therefore, our simulator can output 1/f amplitudes
with target fidelity f with the same computational
cost to compute one perfect amplitude, for almost any
f . This property is very important to “mimick” the
sampling from NISQ devices.
Scalability. By carefully choosing which cuts to ap-
ply to the RQCs, we are able to control the maximum
size of tensors seen during tensor contraction. Thanks
to the regularity of the resulting tensor network, to-
gether with a better memory management and a novel
cache-efficient tensor index permutation routine, we
are able to simulate circuits of as many as 72 qubits
and realistic circuit depths on NISQ architectures such
as Bristlecone.
Performance. To the best of our knowledge, our
tensor contraction engine is optimized beyond all the
existing CPU-based alternatives for contracting the
RQCs with the largest number of qubits studied in
this work.
Our analyses are supported by extensive simulations
on Pleiades (27th in the November 2018 TOP500 list)
and Electra (33rd in the November 2018 TOP500 list)
supercomputers hosted at NASA Ames Research Center.
In total, we used over 3.2 million core-hours and ran
six different numerical simulations (see Fig. 1 for nomen-
clature of Google Bristlecone):
[Run1] Bristlecone-64 (1+32+1): 1.2M amplitudes
with target fidelity 0.78%,
[Run2] Bristlecone-48 (1+32+1): 1.2M amplitudes
with target fidelity 0.78%,
[Run3] Bristlecone-72 (1+32+1): 10 amplitudes with
perfect fidelity,
[Run4] Bristlecone-72 (1+24+1): 43K amplitudes with
target fidelity 12.5%,
[Run5] Bristlecone-72 (1+24+1): 6000 amplitudes
with perfect fidelity,
[Run6] Bristlecone-60 (1+32+1): 1.15M amplitudes
with target fidelity 0.51%.
For the most computationally demanding simula-
tion we have run, namely sampling from a 60-qubit
sub-lattice of Bristlecone, the two systems combined
reached a peak of 20 PFLOPS (single precision), that is
64% of their maximum achievable performance, while
running on about 90% of the nodes. To date, this is
the largest computation run on NASA HPC clusters
in terms of peak PFLOPS and number of nodes used.
All Bristlecone simulation data are publicly available
(see Data Availability) and we plan to open source our
simulator in the near future.
This paper is structured as follows. Our results – with
an emphasis on our ability to both simulate the compu-
tational task run on the quantum computer, as well as to
compute perfect fidelity amplitudes for the verification of
the experiments – and discussion are presented in their
respective sections. In Methods A we review the rules for
generating the revised RQCs [28], which are based on the
constraints of the quantum hardware, while attempting
to make classical simulations hard. The hardness of the
revised RQCs motivates in part our simulator’s approach,
which is explained in Methods B, where both conceptual
and implementation details are discussed; here, we also
introduce a novel, cache-efficient algorithm for tensor
index permutation which takes advantage of multithread-
ing. In Methods C we discuss two methods to classically
sample from an RQC mimicking the fidelity f of the
output of a real device, while achieving a speedup in
performance of a factor of 1/f (see Ref. [28]); in addition,
we present a method to speedup the classical sampling
by a factor of about 10× that, under reasonable assump-
tions, is well suited to tensor network based simulators.
4We also discuss the implications of classically sampling
from a non fully-thermalized RQC. Methods D discusses
the hardness of simulating RQCs implemented on the
Bristlecone QPU as compared to those implemented on
square grids of qubits.
II. RESULTS
In this section we review the performance and the
numerical results obtained by running our simulations
[Run1-6] on the NASA HPC clusters Pleiades and Elec-
tra.
In the time of exclusive access to large portions of the
NASA HPC clusters, we were able to run for over 3.2
million core-hours. Although most of the computation
ran on varying portions of the supercomputers, for a
period of time we were able to reach the peak of 20
PFLOPS (single precision), that corresponds to 64%
of the maximum achievable performance for Pleiades
and Electra combined. For a comparison, the peak
for the LINPACK benchmark is 23 PFLOPS (single
precision, projected), which is only 15% larger than the
peak we obtained with our simulator. This is to date
the largest simulation (in terms of number of nodes
and FLOPS rate) run on the NASA Ames Research
Center HPC clusters. This is not a surprise since both
LINPACK and our simulation do the majority of work
in MKL routines (dgemm or cgemm and similar), in our
case due in part to the fact that our cache-efficient
memory reordering routines lower the tensor indexes
permutation bottleneck to a minimum.
Fig. 2 reports the distribution of the runtimes for a
single instance of each of the six simulations [Run1-6] for
both Pleiades and Electra. Interestingly, we observe a
split in the distribution of runtimes (see SI D for further
details). For our simulations run on Pleiades, we used
all the four available node architectures:
• 2016 Broadwell (bro) nodes: Intel Xeon E5-2680v4,
28 cores, 128GB per node.
• 2088 Haswell (has) nodes: Intel Xeon E5-2680v3,
24 cores, 128GB per node.
• 5400 Ivy Bridge (ivy) nodes: Intel Xeon E5-2680v2,
20 cores, 64GB per node.
• 1936 Sandy Bridge (san) nodes: Intel Xeon E5-
2670, 16 cores, 32GB per node.
For the Electra system, we used its two available node
architectures:
• 1152 Broadwell (bro) nodes: same as above.
• 2304 Skylake (sky) nodes: 2 × 20-core Intel Xeon
Gold 6148, 40 cores, 192GB per node.
Note that the Skylake nodes at Electra form a much
smaller machine than Pleiades, but substantially more
efficient, both time and energy-wise.
In Table I we report runtime, memory footprint, and
number of cores (threads) used for all six cases run on
NASA Pleiades and Electra HPC clusters. As we de-
scribe in Methods B, instances (which involve a certain
number of paths given a cut prescription, as well as a
batch size Nc, as introduced in Methods C.2) can be col-
lected for a large number of low fidelity amplitudes or for
a smaller number of high fidelity amplitudes at the same
computational cost. [Run1-6] were performed sharing
the Pleiades and Electra clusters on maintenance period,
which made nodes on both supercomputers become avail-
able and unavailable for our simulations without prior
notice. For this reason, the ZeroMQ software (more
suited for this sort of scheduling than MPI) was used
for scheduling different instances of the simulations; all
instances were scheduled from a master task. In practice,
instances corresponding to different paths of the same
amplitude were grouped together onto a single instance
and run sequentially on the same group of cores of a
single node (except for [Run3], which requires a large
number of paths, whose computations were also paral-
lelized across nodes); this provides some advantage due
to the reuse of tensors across paths (see SI A). Due to
the inhomogeneous nature of our two clusters, with five
different types of nodes across two supercomputers, each
job instance included an estimate of its memory foot-
print (see Table I), and was scheduled on any available
node with enough available memory. Given that the
number of instances per node was always smaller than
the number of cores, each instance was multithreaded,
using as many threads as the number of physical cores
given; cores were assigned proportionally to the memory
footprint of the instance. Note that both matrix multi-
plications and tensor index permutations take advantage
of multithreading (see Methods B.2). All the numerical
data gathered during the simulations [Run1-6], includ-
ing all the amplitudes, are publicly available (see Data
Availability).
Note that, after running our simulations on Pleiades
and Electra we have identified for Bristlecone-48 and -70
a better contraction procedure ([Run2b] and [Run3b],
respectively). This new contraction is is about twice
as fast as the one used in [Run2-3], which was similar
in approach to the contraction used for Bristlecone-60
(see SI B for more details); we include the estimated
benchmark of these new contractions as well.
In Table II we estimate the effective runtime needed
for the computation of 106 batches of amplitudes
(i.e., sampling 106 bit-strings) with a target fidelity
close to 0.5% on a single core, for different node
types. As one can see, the Bristlecone-60 sub-lattice is
almost 10× harder to simulate than the Bristlecone-64
sub-lattice, while Bristlecone-64 is only 2× harder than
5Bristlecone-48.
In the following, we report the (estimated) runtime
and energy consumption for both the tasks of verification
and sampling for rectangular grids of qubits, up to 8× 9,
as well as the full Bristlecone-70 layout. The estimation
is obtained by computing a small percentage of the calcu-
lations required for the full task. We would like to stress
that our simulator’s runtimes are independent of any
particular RQC instance and, therefore, our estimations
are quite robust.
Table III shows the estimated performance (runtimes
and energy consumption) of our simulator in computing
perfect fidelity amplitudes of output bit-strings of an
RQC (rectangular lattices and Bristlecone-70), for both
Pleiades and Electra. Runtimes are estimated assuming
that fractions of the jobs are assigned to each group
of nodes of the same type in a way that they all finish
simultaneously, thus reducing the total real time of the
run. The power consumption of Pleiades is 5MW, and
a constant power consumption per core, regardless of
the node type, is assumed for our estimations. For
Electra, the 2304 Skylake nodes have an overall power
consumption of 1.2MW, while the 1152 Broadwell nodes
have an overall power consumption of 0.44MW.
Classically sampling bit-strings from the output
state of an RQC involves the computation of a large
number (approximately one million) of low-fidelity
(about 0.5%) batches of probability amplitudes, as
better described in Methods C.1. Table IV shows the
estimated performance of our simulator in this task,
with runtimes and energy consumption requirements on
the two HPC clusters, Pleiades and Electra.
Finally, we compare our approach to the two leading
previously existing simulators of RQCs, introduced in
Ref. [39] (Alibaba) and Ref. [28] (MFIB) (see also Ta-
ble V), as well as to the recently proposed simulation
methods of Ref. [40] (Teleportation-Inspired Algorithm,
or TIA) and Ref. [41] (General-purpose quantum circuit
simulator, or GPQS).
Compared to Ref. [39], our simulator is between 3.6×
and 100× slower (see SI D for complementary details),
depending on the case. However, it is important to stress
that Ref. [39] reports the computational cost to simulate
a class of RQCs which is much easier to simulate than
the class of RQCs reported in Ref. [13]. Indeed, Chen
et al. fail to include the final layer of Hadamards in
their RQCs and use more T gates at the beginning of
the circuit. For these reasons, we estimate that such
class is about 1000× easier to simulate than the new
prescription of RQCs we present in this work. The com-
putational cost of simulating a circuit using Alibaba’s
simulator scales as 2TW, where TW is the treewidth of
the undirected graphical model of the circuit [37]. We
show in Fig. 3 the treewidths of the circuits simulated in
Ref. [39], the old prescription of the circuits [13] (with
and without the final layer of Hadamards), and the re-
vised prescription, for RQCs on a 7 × 7 × (1 + 40 + 1)
square grid. Note that with this number of qubits and
depth, the circuits simulated in Ref. [39] are (on average)
1000× easier or more than the revised ones. Here we
are comparing the treewidth of the circuits to simulate,
while Alibaba’s simulator applies first a preprocessing
algorithm that projects a well chosen subset of variables
in the undirected graphical model of the circuit; this
generates a number of simulations that is exponential
in the number of projected variables, but that decreases
the runtime of each of these simulations, which also
allows for parallelization. In our comparison, we are
assuming that the trade-off between the number of simu-
lations generated after the projections and the decrease
in runtime for each simulation is comparable between
different versions of the circuits, and hence the treewidth
is directly a good measure of the hardness of the simu-
lation of the circuits using the Alibaba simulator. It is
worth noting that the revised RQCs have no variation
in treewidth from one instance to another. In addition,
note that Ref. [39] reports runtimes corresponding to the
80 percentile best results, excluding the worst runtimes.
On the contrary, our runtimes have little fluctuations
and are RQC independent. Finally, in the absence of
an implementation of the fast sampling technique intro-
duced in Methods C.2, this simulator would suffer from
a multiplicative runtime overhead when using rejection
sampling (see Methods C).
Compared to Ref. [28], our simulator is 7× less effi-
cient to compute 106 amplitudes with fidelity 0.51% for
7×7 grids of qubits with depth 1 + 40 + 1, using the new
prescription of RQCs. However, it is important to note
that the runtime of MFIB’s simulator and our simulator
scale in completely different ways. Indeed, MFIB’s ap-
proach has the advantage to compute a large number of
amplitudes with a small cost overhead. On the contrary,
our approach performs much better in the computation
of a smaller subset of amplitudes; both methods use com-
parable resources when computing about 105 amplitudes
of a 7× 7× (1 + 40 + 1) RQC. Note also that MFIB’s
approach is limited by memory usage, and it scales un-
favorably compared to our simulator for circuits with a
large number of qubits (e.g., beyond 8× 8 rectangular
grids), with a large diameter (e.g., Bristlecone-60 and
-70), or both. For instance, Bristlecone-70 would require
825GB per node, which is currently unavailable for most
of HPC clusters. To mitigate the memory requirements,
one could either partition the RQCs in more sub-circuits,
or use distributed memory protocols like MPI. However,
both approaches introduce a non-negligible slow-down
that make them unpractical (see SI C for more details
on the impact further partitions have on the runtime,
as well as Ref. [43] for insight on the strong and weak
scaling of distributed wave function simulators).
6(a) (b)
Figure 2. (a) Distribution of the runtimes for a single instance of each of the six simulations [Run1-6] run on different node
architectures. An instance refers to a certain number of paths for a particular number of amplitudes (output bit-strings); see
Methods B.1 and Table I for more details. For clarity, all distributions have been normalized so that their maxima are all at
the same height. The nodes used on NASA HPC clusters Pleiades and Electra are: Broadwell (bro), Intel Xeon E5-2680v4
; Haswell (has), Intel Xeon E5-2680v3; Ivy Bridge (ivy), Intel Xeon E5-2680v2; Sandy Bridge (san), Intel Xeon E5-2670;
Skylake (sky), 2 × 20-core Intel Xeon Gold 6148 processors per node. (b) Same distribution as above, but the runtimes are
multiplied by the number of cores per job on a single node, to provide a fairer comparison. As one can see, Skylake nodes
provide generally the best performance, and belong to Electra, an energy efficient HPC cluster. The split of runtimes into
groups is discussed in SI D.
Sometime after this manuscript appeared on the arXiv,
two other simulators have been posted: TIA [40] and
GPQS [41]. TIA is used to compute single amplitudes
of RQCs. The most challenging case benchmarked us-
ing TIA largest number of logical qubits is the simu-
lation involving the computation of an amplitude for
Bristlecone-72, with depth 1 + 32 + 1; Bristlecone-72 is
trivially equivalent to Bristlecone-70 (see Methods B.1),
and therefore a comparison to our computation times
of Bristlecone-70 is in place. TIA takes 14.1 minutes
to compute one amplitude on 16384 Sunway SW26010
260C nodes, with 256 cores each, and a theoretical peak
performance of 3.05 TFLOPS per node. Our simula-
tor computes an amplitude in 4121.49 core-hours using
Skylake nodes (see Table I). A direct comparison of
core-hours between both simulators running on their
respective architectures results on our simulator being
239× faster than TIA. However, Taihu Sunlight uses
a different architecture, with slower cores compared to
Electra’s cluster of Skylake nodes. Therefore, for a fairer
comparison, we use both node’s theoretical peak perfor-
mance: 3.05 TFLOPS for Sunway SW26010 260C nodes,
and 3.07 TFLOPS for Electra’s Skylake nodes. Both
node types deliver a similar performance, which leads to
the estimation of our simulator being 37× more efficient
than TIA for the circuit considered in this comparison,
i.e., Bristlecone-70 (72) with depth 1 + 32 + 1. Note
that, while potentially adaptable to this simulator, in
the absence of an implementation of the fast sampling
technique, this simulator needs of the computation of a
few amplitudes in order to sample each bitstring, with
the corresponding multiplicative runtime overhead (see
Methods C and Ref. [40]).
GPQS is also used to compute amplitudes one at a
time, although could potentially implement the fast sam-
pling technique, and is closely related to our simulator
in that it first contracts the circuit tensor network in the
time direction. However, it then opts for a distributed
contraction across several nodes of a supercomputer
using the Cyclops Tensor Framework, as opposed to per-
forming cuts to allow for single-node contractions. On
the 7 × 7 × (1 + 40 + 1) computation of single perfect
fidelity amplitudes, which can be directly compared with
our equivalent computation, GPQS takes 31 minutes us-
ing a fraction of the Tianhe-2 supercomputer, delivering
a theoretical peak of 1.73 PFLOPS (double precision), as
reported by the authors. Our simulator takes 1.16×10−2
hours to compute a single perfect fidelity amplitude on
Electra (see Table III), which delivers a theoretical peak
of 8.32 PFLOPS. From a direct comparison between
both simulators, we estimate that our simulator is 9.26×
more efficient than GPQS for the aforementioned cir-
cuit instance. However, GPQS performs calculations
using double precision. If this simulator were to be
adapted for single precision calculations, we estimate
ours would still be 4.63× more efficient. This compar-
ison gives meaningful insight on the cost of relying on
inter-node communication, instead of cuts, for tensor
7Ali 40 Ali 41 v1 no H v1 v2
30.0
32.5
35.0
37.5
40.0
42.5
45.0
47.5
50.0
Tr
ee
wi
dt
h
Figure 3. From left to right, upper bounds of treewidths of
RQCs on a 7×7 square lattice simulated in: [Ali 40] Ref. [39]
with depth (1+40) (note that no layer of Hadamards is added
at the end of the circuit); [Ali 41] Ref. [39] with depth (1+41);
[v1 no H] old prescription of the RQCs [13] without the final
layer of Hadamards and depth (1+41); [v1] old prescription
of the RQCs [13] with the final layer of Hadamards and depth
(1+40+1); [v2] revised prescription of the RQCs [28] with
depth (1+40+1). Note that in all cases the tree width of
the RQCs is substantially larger than that ones simulated in
Ref. [39], making the simulations about 213× or 214× harder
(on average). Moreover, fluctuations in the treewidth for the
revised prescriptions of RQCs are completely absent. The
upper bounds were obtained by running quickbb [42] with
settings --time 60 --min-fill-ordering.
network contractions of the sizes relevant to supremacy
circuit sizes.
III. DISCUSSION
In this work, we introduced a flexible simulator, based
on tensor contraction (qFlex), to compute both exact
and noisy (with a given target fidelity [28]) amplitudes
of the output wave-function of a quantum circuit. While
the simulator is general and can be used for a wide
range of circuit topologies, it is well optimized for quan-
tum circuits with a regular design, including rectangular
grids of qubits and the Google Bristlecone QPU. To
test the performance of our simulator, we focused on
the benchmark of random quantum circuits (RQCs) pre-
sented in Refs. [13, 28] for both the 2-D grids (7 × 7,
8× 8 and 8× 9) and the Google Bristlecone QPU (24,
48, 60, 64, and 70 qubits). Compared to some existing
methods [34, 38, 39], our approach is more robust to
modifications in the class of circuits to simulate and
performs well on the redesigned, harder class of RQCs.
While other benchmarks exploit [34], and sometimes
introduce [38, 39], weaknesses in particular ensembles
of random quantum circuits that affect their reported
performance significantly, our runtimes are directly de-
termined by the number of full lattices of two-qubit gates
at a given depth (see Fig. 6).
Our performance analyses are supported by extensive
simulations on Pleiades (24th in the November 2018
TOP500 list) and Electra (43rd in the November 2018
TOP500 list) supercomputers hosted at NASA Ames
Research Center. Among other “diamond-shaped”
lattices of qubits benchmarked, which are likely to
be used for supremacy experiments, our simulator is
able to compute exact amplitudes for the benchmark
of RQCs using the full Google Bristlecone QPU with
depth 1+32+1 in less than (f · 4200) hours on a single
core, with f the target fidelity. This corresponds to
210 hours in Pleiades or 264 hours in Electra for 106
amplitudes with fidelity close to 0.5,% a computation
needed to perform the RQC sampling task. All our
data are publicly available to use (see Data Availability).
At first sight, compared to Alibaba’s simulator [39],
our simulator is between 3.6× and 100× slower, depend-
ing on the case. However, Alibaba’s simulator heav-
ily exploits the structure of RQCs and its performance
widely varies from one RQC instance to another. Indeed,
Ref. [39] reports only runtimes corresponding to the 80th
percentile best results, excluding the worst runtime. In
contrast, our runtimes have little variation in perfor-
mance between instances and are independent of RQC
class. Moreover, Ref. [39] fails to include the final layer
of Hadamards and uses fewer non-diagonal gates at the
beginning of the circuit which, we estimate, makes the
corresponding circuits much easier to simulate: approxi-
mately 1000× easier for the 7×7×(1+40+1) circuit. We
would like to encourage the reporting of benchmarking
against the circuit instances publicly available in [44] in
order to arrive at meaningful conclusions.
Compared to Ref. [28], our simulator is 7× less efficient
(on Electra Skylake nodes) to compute 106 amplitudes
with fidelity 0.51% for 7× 7 grids of qubits with depth
1+40+1. However, compared to Ref. [28] our simulator
scales better on grids beyond 8× 8 and on circuits with
a large number of qubits and diameter, including the
Bristlecone QPU and its sub-lattices Bristlecone-60 and
-70.
Compared to Ref. [40], our simulator is 37× more
efficient in computing an amplitude of Bristlecone-70
at depth 1+32+1, which is equivalent in hardness to
Bristlecone-72 (see Methods B.1).
Compared to Ref. [41], our simulator is more than
9× more efficient in computing an amplitude of a 7× 7
circuit of depth 1 + 40 + 1.
In addition, we were able to simulate (i.e., compute
over 106 amplitudes) RQCs on classically hard sub-
lattices of the Bristlecone of up to 60 qubits with depth
(1+32+1) and fidelity comparable to the one expected in
8the experiments (around 0.50%) in effectively well below
half a day using both Pleiades and Electra combined.
We also discussed the classical hardness in simulating
sub-lattices of Bristlecone as compared to rectangular
grids with the same number of qubits. Our theoretical
study and numerical analyses show that simulating
the Bristlecone architecture is computationally more
demanding than rectangular grids with the same
number of qubits and we propose a family of sub-lattices
of Bristlecone to be used in experiments that make
classical simulations hard, while keeping the number
of qubits and gates involved as small as possible to
increase the overall fidelity.
As a final remark, we will explore using our approach
and extensions to simulate different classes of quantum
circuits, particularly those with a regular structure, in-
cluding quantum circuits for algorithms with potential
applications to challenging optimization and machine
learning problems arising in aeronautics, Earth science,
and space exploration, as well as to simulate many-body
systems for applications in material science and chem-
istry.
IV. METHODS
A. Revised set of Random Quantum Circuits
In this section, we review the prescription to generate
RQCs proposed originally by Google [13], and its revised
version [28]. This prescription can be used to generate
RQCs for 2D square grids, including the Bristlecone
architecture (which is a diamond shaped subset of a 2D
square grid). The circuit files used for the numerical
simulations in this paper are publicly available in [44].
Given a circuit depth and circuit topology of n qubits,
Google’s RQCs [13, 28] are an ensemble of quantum
circuits acting on a Hilbert space of dimension N = 2n.
The computational task consists of sampling bit-strings
as defined by the final output.
Due to the limitation of the current technology and the
constraints imposed by the quantum hardware, circuits
are randomly generated using the following prescription:
(1) Apply a first layer of Hadamard (H) gates to all
the qubits.
(2) After the initial layer (1), subsequent layers of
two-qubit gates are applied. There are 8 different
layers, which are cycled through in a consistent
order (see Fig. 4).
(3) Within these layers, for each qubit that is not being
acted upon by a two-qubit gate in the current layer,
and such that a two-qubit gate was acting on it
in the previous layer, randomly apply (with equal
probability) a gate in the set {X1/2, Y1/2}.
Figure 4. Layout of two-qubit gates and the corresponding
cycle order (from 1 to 8). This layout can be tiled over 2D
square grids of arbitrary size. The Bristlecone architecture
is a diamond shaped subset of such a 2D grid. For our
simulations, we use CZ gates as the two-qubit gate.
(4) Within these layers, for each qubit that is not being
acted upon by a two-qubit gate in the current
layer, and was acted upon by a gate in the set
{X1/2, Y1/2, H} in the previous layer, apply a T
gate.
(5) Apply a final layer of H gates to all the qubits.
The depth of a circuit will be expressed as 1 + t+ 1,
where the prefix and suffix of 1 explicitly denote the
presence of an initial and a final layer of Hadamard gates.
For our simulations, as was done in prior RQC works,
we use the CZ gate as our two-qubit gate. One of the
differences between the original prescription [13] and
this new prescription [28] for the generation of RQCs
is that we now avoid placing T gates after CZ gates.
If a T gate follows a CZ gate, this structure can be
exploited to effectively reduce the computational cost
to simulate the RQCs, as was done in [34, 38, 39]. The
revised RQC formulation ensures that each T gate is
preceded by a {X1/2, Y1/2, H} gate, which foils this
exploit. In addition, the layers of two-qubit gates have
been reordered, in order to avoid consecutive “horizontal”
or “vertical” layers, which is known to make simulations
easier. Finally, it is important to keep the final layer of
H gates, as otherwise multiple two-qubit gates at the
end of the circuit can be simplified away, making the
simulation easier [13].
Replacing CZ gates with iSWAP =
(|00〉〈00|+ |11〉〈11|+ i |01〉〈10|+ i |10〉〈01|) gates is
known to make the circuits yet harder to simulate. More
precisely, an RQC of depth 1 + t + 1 with CZ gates is
equivalent, in terms of simulation cost, to an RQC of
depth 1 + t/2 + 1 with iSWAPs. In future work, we will
benchmark our approach on these circuits as well.
9(a) (b)
Figure 5. (a) 3D grid of tensors obtained by contracting
8 consecutive layers of CZ gates, including the single qubit
gates. (b) Example of a typical block of 8 layers of gates on
a single qubit; note that the qubit shares one CZ gate with
each of its four neighbors per block.
B. Overview of the simulator
A given quantum circuit can always be represented
as a tensor network, where one-qubit gates are rank-2
tensors (tensors of 2 indexes with dimension 2 each),
two-qubit gates are rank-4 tensors (tensors of 4 indexes
with dimension 2 each), and in general n-qubit gates
are rank-2n tensors. The computational and memory
cost for the contraction of such networks is exponential
with the number of open indexes and, for large enough
circuits, the network contraction is unpractical; nonethe-
less, it is always possible to specify input and output
configurations in the computational basis through rank-
1 Kronecker deltas over all qubits, which can vastly
simplify the complexity of the tensor network. This rep-
resentation of quantum circuits gives rise to an efficient
simulation technique, first introduced in Ref. [36], where
the contraction of the network gives amplitudes of the
circuit at specified input and output configurations.
Our approach allows the calculation of amplitudes
of RQCs through the contraction of their correspond-
ing tensor networks, as discussed above, but with an
essential first step, which we now describe. One of the
characteristics of the layers of CZ gates shown in Fig. 4
is that it takes 8 cycles for each qubit to share one, and
only one, CZ gate with each of its neighbors. This prop-
erty holds for all subsets of a 2D square grid, including
the Bristlecone architecture. Therefore, it is possible to
contract every 8 layers of the tensor network correspond-
ing to an RQC of the form described in Methods A onto
an I × J two-dimensional grid of tensors, where I and J
are the dimensions of the grid of qubits. While in this
work we assume that the number of layers is a multiple
of 8, our simulator can be trivially used for RQCs with a
depth that is not a multiple of 8. The bond dimensions
between each tensor and its neighbors are the Schmidt
rank of a CZ gate, which (as for any diagonal two-qubit
(a) (b)
Figure 6. (a) Contraction of the 3D grid of tensors (see Fig. 5)
in the time direction to obtain (b) a 2D grid of tensors.
gate) is equal to 2 (note that for iSWAP the Schmidt
rank is equal to 4, thus effectively doubling the depth of
the circuit as compared to the CZ case). After contract-
ing each group of 8 layers in the time direction onto a
single, denser layer of tensors, the RQC is mapped onto
an I×J×K three-dimensional grid of tensors of indexes
of bond dimension 2, as shown in Fig. 5, where K = t/8,
and 1 + t+ 1 is the depth of the circuit (see Methods A).
Note that the initial (final) layer of Hadamard gates, as
well as the input (resp. output) delta tensors, can be
trivially contracted with the initial (resp. final) cycle of
8 layers of gates. At this point, the randomness of the
RQCs appears only in the entries of the tensors in the
tensor network, but not in its layout, which is largely
regular, and whose contraction complexity is therefore in-
dependent of the particular RQC instance at hand. This
approach contrasts with those taken in Refs. [34, 37, 39],
which propose simulators that either benefit from an
approach tailored for each random instance of an RQC,
or take advantage of the particular layout of the CZ
layers.
The contraction of the resulting 3D tensor network
described above (see Fig. 5) in order to compute the
amplitude corresponding to specified initial and final
bit-strings is described in the following Methods B.1.
1. Contraction of the 3D tensor network
In this section, we describe the contraction procedure
followed for the computation of single perfect-fidelity
output amplitudes for the 3D grid of tensors described
in the previous section.
Starting from the 3D grid of tensors of Fig. 5, we first
contract each vertical (K direction) column of tensors
onto a single tensor of at most 4 indexes of dimension 2K
each (see left panel of Fig. 6). Note that for the circuit
sizes and depths we simulate, K is always smaller than
I and J , and so this contraction is always feasible in
terms of memory, fast, and preferable to a contraction in
either the direction of I or J . This results in a 2D grid
10
(a)
(b)
Figure 7. (a) Sketch of the contraction procedure followed to
obtain one path of one amplitude of the Bristlecone-70 with
depth (1+32+1). We first make four cuts of dimension 24
each, leaving us with 216 paths; for each path, we contract all
tensors on region A, and all tensors on region B; then tensors
A and B are contracted together; finally, tensor C (which is
independent of chosen path, and can in addition be computed
very efficiently) is contracted with AB, which obtains the
contribution of this path to this particular amplitude. (b)
Corresponding regions A, B, and C for the Bristlecone-24,
-48, -60, and -64. Note that both the Bristlecone-48 and the
Bristlecone-64 need 2 cuts of dimension 24 each, while the
Bristlecone-60 needs three of such cuts, making it a factor
of 24 times harder than Bristlecone-64, even though it has 4
qubits less.
of tensors of size I×J , where all indexes have dimension
2K (see right panel of Fig. 6). Note that contracting in
the time direction first is done at a negligible cost, and
reduces the number of high complexity contractions to
only the ones left in the resulting 2D grid.
While we have focused so far on the steps leading to
the 2D square grid tensor network of Fig. 6, it is easy
to see that the Bristlecone topology (see Bristlecone-72
in Fig. 1) is a sub-lattice of a square grid or qubits,
and so all considerations discussed up to this point are
applicable. Even though Bristlecone has 72 qubits, the
top-left and bottom-right qubits of the network can be
contracted trivially with their respective only neighbor,
adding no complexity to our classical simulation of
RQCs. For this reason, without loss of generality, we
“turn off” those two qubits from the Bristlecone lattice,
and work only with the resulting sub-lattice, which we
call Bristlecone-70 (see Fig. 1). For the remainder of
this section, we will focus on Bristlecone-70 and other
sub-lattices of Bristlecone (see sub-lattices considered in
Fig. 1), and we will refer back to square grids of qubits
in later sections.
Cutting indexes and the contraction of the 2D tensor
network — From Fig. 1, it is easy to see that it is not
possible to contract the Bristlecone-70 tensor network
without generating tensors of rank 11, where each index
has dimension 2K . For a circuit of depth 1 + 32 + 1 and
K = 4, the dimension of the largest tensors is 211×4,
which needs over 140 TB of memory to be stored using
single precision floating point complex numbers, far be-
yond the RAM of a typical HPC cluster node (between 32
GB and 512 GB). Therefore, to avoid the memory bottle-
neck, we decompose the contraction of the Bristlecone-70
tensor network into independent contractions of several
easier-to-compute sub-networks. Each sub-network is
obtained by applying specific “cuts”, as is described
below.
Given a tensor network with n tensors and a set of
indexes to contract {il}l=1,...,
∑
i1,i2,...
T 1T 2 . . . Tn, we
define a cut over index ik as the explicit decomposition of
the contraction into
∑
ik
(∑
{il}l=1,...−{ik} T
1T 2 . . . Tn
)
.
This implies the contraction of dim(ik) many ten-
sor networks of lower complexity, namely each of the∑
{il}l=1,...−{ik} T
1T 2 . . . networks, where tensors involv-
ing index ik decrease their rank by 1, fixing the value
of ik to the particular value given by the term in∑
ik
. This procedure, equivalent to the ones used in
Refs. [28, 34, 38, 39], reduces the complexity of the re-
sulting tensor network contractions to computationally
manageable tasks (in terms of both memory and time),
at the expense of creating exponentially many contrac-
tions. The resulting tensor networks can be contracted
independently, which results in a computation that is
embarrassingly parallelizable. It is possible to make
more than one cut on a tensor network, in which case
ik refers to a multi-index; the contribution to the final
sum of each of the contractions (each of the values of the
multi-index cut) is called a “path”, and the final value
of the contraction is the sum of all path contributions.
For the Bristlecone-70 example with depth (1+32+1),
making four cuts, as shown in Fig. 7 (a), decreases the
size of the maximum tensor stored during the contrac-
tion from 211×4 to 27×4 entries, at the price of 24×4
contractions to be computed. At the same time, the
choice of cuts aims at lowering the number of high com-
plexity contractions needed per path, as well as lowering
the number of largest tensors held simultaneously in
memory. Note that for Bristlecone-60, tensors A and
B are both equally large, and that the number of high
complexity contractions is larger than for a single path
of Bristlecone-70.
After making these cuts, the contraction of each path
is carried out in the following way (see Fig. 7): first, we
contract all tensors within region A onto a tensor of rank
7 (tensor A); we do the same for tensor B; then tensors
A and B are contracted onto a rank-6 tensor, AB; finally,
tensor C is contracted, which does not depend on the
particular path at hand, followed by the contraction of
AB with C onto a scalar. In Fig. 7 (b) we depict the
corresponding A, B, and C regions for the sub-lattices of
Bristlecone we use in our simulations, as well as the cuts
needed to contract the resulting tensor networks using
the described method, in particular for Bristlecone-48,
11
-60, and -64. Note that that Bristlecone-48 and -64 need
both two cuts of depth 4, making them similar to each
other in complexity, while Bristlecone-60 needs three
cuts, making it substantially harder to simulate.
We identify a family of sub-lattices of Bristlecone,
namely Bristlecone-24, -30, -40, -48, -60 and -70, that
are hard to simulate classically, while keeping the number
of qubits and gates as low as possible. Indeed, the fidelity
of a quantum computer decreases with the number of
qubits and gates involved in the experiment [13], and so
finding classically hard sub-lattices with a small number
of qubits is essential for quantum supremacy experi-
ments. It is interesting to observe that Bristlecone-64 is
an example of a misleadingly large lattice that is easy
to simulate classically (see Results for our numerical
results).
Note that the rules for generating RQCs cycle over
the layers of two-qubit gates depicted in Fig. 4. In the
case that the cycles or the layers are perturbed, our
simulator can be trivially adapted. In particular: 1)
if the layers are applied in a different order, but the
number of two-qubit gates between all pairs of neighbors
is the same, then the 2D grid tensor network of Fig. 6
still holds, and the contraction method can be applied
as described; 2) if there is a higher count of two-qubit
gates between some pairs of neighbors than between
others, then the corresponding anisotropy in the bond
dimensions of the 2D tensor network can be exploited
through different cuts.
Remarks on the choice of cuts and contraction order-
ing — The cost of contracting a tensor network depends
strongly on the contraction ordering chosen and is a
topic covered in the literature (see Ref. [36, 37]); deter-
mining the optimal contraction ordering is an NP-hard
problem. Given an ordering, the leading term to the
time complexity of the contraction is given by the most
expensive contraction between two tensors encountered
along the contraction of the full network; the optimal
ordering given this cost model is closely related to the
treewidth of the line-graph of the tensor network. How-
ever, a more practical approach to determining the cost
of a particular contraction ordering is the FLOP count
of the contraction of the entire network (i.e., the num-
ber of scalar additions and multiplications needed); this
accounts for cases where a single high-complexity con-
traction is preferable to a large number of low-complexity
ones. The latter cost model is commonly used in order
to determine the optimal ordering for the contraction
of tensor networks (e.g. in Ref. [39]). Note that, al-
though the choice of a contraction ordering affects also
its memory complexity, by what we mean the memory
required to perform the contraction, this is usually a
smaller concern as compared to time complexity.
An even more realistic approach needs to consider the
performance efficiency of the different tensor contrac-
tions, i.e., the delivered FLOP/s of each contraction. In
particular, modern computing architectures benefit from
high arithmetically intensive contractions, i.e., contrac-
tions with a large ratio between the FLOP count and
the number of reads and writes from and to memory.
Highly unbalanced matrix multiplications, for example,
present low arithmetic intensity, while the multiplication
of large squared matrices shows high arithmetic inten-
sity, and therefore achieves a performance very close
to the theoretical peak FLOP/s of a particular CPU.
This principle (which prioritizes time-to-solution over
FLOP-count-to-solution) is at the heart of our choice to
contract the quantum circuits first into blocks and subse-
quently along the “time” direction. This choice, beyond
“regularizing” the tensor network, serves two purposes,
aimed at decreasing the overall time-to-solution in prac-
tice: 1) the vast majority of contractions are performed
in these first steps, and are done in a negligible amount
of time, leaving most of the computation to a small
number of subsequent high-complexity contractions; and
2) the remaining high-complexity contractions have high
arithmetic intensity and achieve therefore high efficiency.
In addition, contracting all blocks in the time direction
first reduces the memory requirement of the subsequent
contraction as compared to keeping the tensor network
in its three dimensional version.
The cost model discussed here becomes more intricate
when cuts are considered. Each choice of cuts not only
has a different impact on the memory complexity of the
resulting, simplified tensor networks, but it can also sub-
stantially affect their time complexity. As was explained
in Methods B.1, the main purpose of the cuts is reducing
the memory requirement of the resulting contractions to
fit the limits of each computation node; this is indeed
the main factor taken into account when choosing a set
of cuts. However, the right choice of cuts can also allow
us to, again, have a small number of high arithmetic
intensity contractions. This is the case with the con-
tractions depicted in Fig. 7, where the main bottleneck
is given by the contraction of A with B, which is very
arithmetically intensive; see also Fig. A1 for an example
on a square lattice.
Finally, it is worth noting two more factors on the run-
time of a contraction. First, if the memory requirement
of a contraction is well below the limits of a computing
node, then several contractions can be run in parallel.
In these cases, there is a trade-off between the number
of cuts made and the number of contractions run in par-
allel. Second, the choice of cuts can affect the number
of tensors reused across different paths (which is done at
a memory cost), which can have a substantial impact on
the computation time of an amplitude, as can be seen
in the examples of SI A.
Although a cost model involving all the factors dis-
cussed above can be well characterized, automatically
optimizing the cuts chosen and the contraction ordering,
12
together with the tensors reused across paths, is beyond
the scope of this work. In practice, we study different
configurations “by hand”. Note that, once the tensor
networks have a certain level of regularity, it is not hard
to make a choice that is close to optimal.
2. Implementation of the simulator
We implemented our tensor network contraction
engine for CPU-based supercomputers using C++. We
have planned to release our tensor contraction engine in
the near future. During the optimization, we were able
to identify two clear bottlenecks in the implementation
of the contractions: the matrix multiplication required
for each index (or multi-index) contraction, and the
reordering of tensors in memory needed to pass the
multiplication to a matrix multiplier in the appropriate
storage order (in particular, we always use row-major
storage). In addition, to avoid time-consuming alloca-
tions during the runs, we immediately allocate large
enough memory to be reused as scratch space in the
reordering of tensors and other operations.
Matrix multiplications with Intel R©MKL — For the
multiplication of two large matrices that are not
distributed over several computational nodes, Intel’s
MKL library is arguably the best performing library
on Intel CPU-based architectures. We therefore leave
this essential part of the contraction of tensor networks
to MKL’s efficient, hand-optimized implementation
of the BLAS matrix multiplication functions. Note
that, even though we have used MKL, any other BLAS
implementation could be used as well, and other linear
algebra libraries could be used straightforwardly.
Cache-efficient index permutations The permutation of
the indexes necessary as a preparatory step for efficient
matrix multiplications can be very costly for large ten-
sors, since it involves the reordering of virtually all entries
of the tensors in memory; similar issues have been an
area of study in other contexts [45–47]. In this section
we describe our novel cache-efficient implementation of
the permutation of tensor indexes.
Let Ai0,...,ik be a tensor with k indexes. In our imple-
mentation, we follow a row-major storage for tensors,
a natural generalization of matrix row-major storage
to an arbitrary number of indexes. In the tensor net-
work literature, a permutation of indexes formally does
not induce any change in tensor A. However, given a
storage prescription (e.g., row-major), we will consider
that a permutation of indexes induces the corresponding
reordering of the tensor entries in memory. A naive
implementation of this reordering routine will result in
an extensive number of cache misses, with poor perfor-
mance.
We implement the reorderings in a cache-efficient
way by designing two reordering routines that ap-
ply to two special index permutations. Let us di-
vide a tensor’s indexes into a left and a right group:
Ai0, . . . , ij︸ ︷︷ ︸ ij+1, . . . , ik︸ ︷︷ ︸. If a permutation involves only
indexes in the left (right) group, then the permutation
is called a left (resp. right) move. Let γ be the num-
ber of indexes in the right group. We will denote left
(resp. right) moves with γ indexes in the right group by
Lγ (resp. Rγ). The importance of these moves is that
they are both cache-efficient for a wide range of values
of γ and that an arbitrary permutation of the indexes
of a tensor can be decomposed into a small number of
left and right moves, as will be explained later in this
section. Let dγ be the dimension of all γ right indexes
together. Then left moves involve the reordering across
groups of dγ entries of the tensor, where each group of
dγ entries is contiguous in memory and is moved as a
whole, without any reordering within itself, therefore
largely reducing the number of cache misses in the rou-
tine. On the other hand, right moves involve reordering
within all of dγ entries that are contiguous in memory,
but involves no reordering across groups, hence greatly
reducing the number of cache misses, since all reorder-
ings take place in small contiguous chunks of memory.
Fig. 8 shows the efficiency of Rγ and Lγ as compared
to a naive (but still optimized) implementation of the
reordering that is comparable in performance to python’s
numpy implementation. A further advantage of the left
and right moves is that they can be parallelized over
multiple threads and remain cache-efficient in each of
the threads. This allows for a very efficient use of the
computation resources, while the naive implementation
does not benefit from multi-threading.
Let us introduce the decomposition of an arbitrary per-
mutation into left and right moves through an example.
Let Aabcdefg be a tensor with 7 indexes of dimension d
each. Let abcdefg → cfeadgb be the index permutation
we wish to perform. Furthermore, let us assume that
it is known that L2 and R4 are cache-efficient. Let us
also divide the list of 7 indexes of this example in three
groups: the last two (indexes 6 and 7), the next group
of two indexes from the right (indexes 4 and 5), and the
remaining three indexes on the left (1, 2, and 3). We
now proceed as follows. First, we apply an L2 move
that places all indexes in the left and middle groups that
need to end up in the rightmost group in the middle
group; in our case this is index b, and the L2 we have in
mind is abc|de︸ ︷︷ ︸
L2
|fg → cae|bd︸ ︷︷ ︸
L2
|fg; note that if the middle
group is at least as big as the rightmost group, then it
is always possible to do this. Second, we apply an R4
move that places all indexes that need to end up in the
rightmost group in their final positions; in our case, that
13
0 5 10 15 20 25
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
tim
e 
(s
)
Tensor with 25 indexes of dimension 2 each
Naive L R
5 10 15
0.0
0.1
0.2
2 4 6 8
num threads
0.00
0.05
0.10
0.15 L5
R10
Figure 8. Single thread computation times on Broadwell
nodes on Pleiades for an arbitrary permutation of the in-
dexes of a tensor of single precision complex entries (and 25
indexes of dimension 2 each) following an optimized, naive
implementation of the reordering (green), an arbitrary Lγ
move (red), and an arbitrary Rγ move (blue). The optimized,
naive approach performs comparably to python’s numpy im-
plementation of the reordering. Note that, for a wide range of
γ, left and right moves are very efficient. Left inset: zoomed
version of the main plot. For γ ∈ [5, 10], both right and left
moves are efficient. Right inset: computation times for L5
and R10 (used in practice) as a function of the number of
threads used.
is cae| bd|fg︸ ︷︷ ︸
R4
→ cae| fd|gb︸ ︷︷ ︸
R4
; note that, if the first move
was successful, then this one can always be done. Finally,
we take a final L2 move that places all indexes in the
leftmost and middle groups in their final positions, i.e.,
cae|fd︸ ︷︷ ︸
L2
|gb → cfe|ad︸ ︷︷ ︸
L2
|gb. We have decomposed the per-
mutation into three cache-efficient moves, Lµ−Rν−Lµ,
with µ = 2, ν = 4. Note that it is essential for this
particular decomposition that the middle group has at
least the same dimension as the rightmost group. It is
also crucial that both Lµ and Rν are cache-efficient.
In practice, we find that (beyond the above example,
where µ = 2 and ν = 4) for tensors with binary indexes,
µ = 5 and ν = 10 are good choices for our processors
(see Fig. 8). If the tensor indexes are not binary, this
approach can be generalized: if all indexes have a
dimension that is a power of 2, then mapping the
reordering onto one involving explicitly binary indexes
is trivial; in the case where indexes are not all powers of
2, then different values of µ and ν could be found, or
decompositions more general than Lµ−Rν − Lµ could
be thought of. In our case, we find good results for the
L5−R10− L5 decomposition. Note also that in many
cases a single R or a single L move is sufficient, and
sometimes a combination of only two of them is enough,
0 2 4 6 8
num. threads
0
100
200
300
400
500
600
700
tim
e 
(s
)
Bristlecone-60, depth 1+32+1
1 2 3
300
400
500
0 2 4 6 8 10 12
num. threads
0
2000
4000
6000
8000
10000
7x7, depth 1+40+1
12 13 14800
1000
1200
1400
Naive, single job
Cache-eff., single job
Naive, concurrent jobs
Cache-eff., concurrent jobs
Figure 9. Comparison of the runtime of the computation of
an amplitude using a naive index permutation implementa-
tion (comparable to python’s numpy implementation) and
the cache-efficient implementation described in Methods B.2,
using Skylake nodes on Electra. Left: computation of three
paths of an amplitude of Bristlecone-60 with depth 1+32+1,
corresponding to a target fidelity f = 3/212 (see Results).
Right: computation of three paths of an amplitude of a grid
of size 7× 7 with depth 1 + 40 + 1, corresponding to a tar-
get fidelity f = 3/210 (see also Results). Runtimes using a
varying number of threads (one threads per physical core) in
an otherwise idle node are presented (single job), as well as
runtimes for the case where several amplitudes are computed
on the same node, fitting as many concurrent computations
as possible (concurrent jobs, also shown in the insets for
clarity), a number that is constrained by memory require-
ments; the latter is the case for the simulations presented in
Results. Cache-efficient runs show always better runtimes.
The scaling of runtime with number of threads is better for
the cache-efficient runs, given multithreading; note that the
matrix-matrix multiplication part of the contraction is al-
ways multithreaded. Concurrent jobs suffer from contention,
which is larger for the cache-efficient runs, presumably due
to a larger use of bandwidth; however, runtimes are still
substantially improved in this case.
which can accelerate contractions by a large factor.
We apply a further optimization to our index permuta-
tion routines. A reordering of tensor entries in memory
(either a general one or some of Rγ or Lγ moves)
involves two procedures: generating a map between
the old and the new positions of each entry, which
has size equal to the dimension of all indexes involved,
and applying the map to actually move the entries in
memory. The generation of the map takes a large part
of the computation time, and so storing maps that have
already been used in a look-up table (memoization), in
order to reuse them in future reorderings, is a desirable
technique to use. While the size of such maps might
make this approach impractical in general, for left and
right moves memoization becomes feasible, since the size
of the maps is now exponentially smaller than in the
general case due to left and right moves only involving a
subset of indexes. In the contraction of regular tensor
networks we work with maps reappear often, and so
14
memoization proves very useful.
The implementation of the decomposition of general
permutations of indexes into left and right moves, with
all the details discussed above, give us speedups in the
contractions that range from under 5% in single-threaded
contractions that are dominated by matrix multiplica-
tions, to well over 50% in multithreaded contractions that
are dominated by reorderings. A detailed comparison of
runtimes using a naive approach and the cache-efficient
approach discussed is shown in Fig. 9. The contraction
corresponding to the simulation of Bristlecone-60 with
depth 1+32+1 (presented in Methods B.1) is dominated
by reorderings, since a large part of the runtime is spent
in building tensor A, which involves a large number of
“unbalanced” contractions, where the reordering of a large
tensor is followed by the multiplication of a large matrix
with a small one. The contraction corresponding the
a 7× 7 grid of depth 1 + 40 + 1 (presented in SI A) is
dominated by a few “balanced” contractions of large ten-
sors, and so the overall runtime is dominated by matrix-
matrix multiplication. In this case, larger speedups are
still achieved using a large number of threads, due to
multithreading, and the speedup using a single threads is
appreciable but small; in practice, we use 13 threads per
job on Skylake nodes, where we can only fit three jobs
per node due to memory constraints, and a larger num-
ber of threads per job on other node types. Finally, it
is worth mentioning that runtimes are still robust when
using cache-efficient, multithreaded index permutations,
showing small variation among runs.
C. Fast sampling of bit-strings from low fidelity
RQCs
While the computation of perfect fidelity amplitudes
of output bit-strings of RQCs is needed for the verifica-
tion of quantum supremacy experiments [13], classically
simulating sampling from low fidelity RQCs is essential
in order to benchmark the performance of classical su-
percomputers in carrying out the same task as a noisy
quantum computer. Indeed, present day quantum com-
puters suffer from noise and errors in each gate. In the
commonly used digital error model [13, 48–53], the total
error probability for a RQC is the sum of the probability
of error from each gate. The fidelity, or probability of
no errors, of a quantum computer or of a classical algo-
rithm for RQC sampling can be estimated with XEB [13].
Therefore, we only require the same value for the cross-
entropy or fidelity in the classical algorithm as in the
noisy quantum computer. A superconducting quantum
processor with present day technology is expected to
achieve a fidelity of around 0.5% for circuits with the
number of gates considered here [13].
In Methods C.1 we describe two methods to mimic the
fidelity f of the output wave-function of the quantum
computer with our simulator, providing a speedup of
a factor of 1/f to the simulation as compared to the
computation of exact amplitudes [28]. Both methods
can be adjusted to provide the same fidelity, and there-
fore the same cross entropy [13], as a noisy quantum
computer. That is, they result in an equivalent RQC
sampling. In Methods C.2 we describe a way to reduce
the computational cost of the sampling procedure on
tensor contraction type simulators by a factor of almost
10×, under reasonable assumptions. Finally, in Meth-
ods C.3 we discuss the implications of sampling from a
Porter-Thomas distribution that has not fully converged.
1. Simulating low fidelity RQCs
Here, we describe two methods to reduce the compu-
tational cost of classically sampling from an RQC given
a target fidelity.
Summing a fraction of the paths — This method, pre-
sented in Ref. [28], exploits the fact that, for RQCs,
the decomposition of the output wave-function of a cir-
cuit into paths |ψ〉 = ∑p∈{paths} |ψp〉 (see Methods B.1)
leads to terms |ψp〉 that have similar norm and that
are almost orthogonal to each other. For this reason,
summing only over a fraction f of the paths, one obtains
a wave-function |ψ˜〉 with norm 〈ψ˜|ψ˜〉 = f . Moreover,
|ψ˜〉 has fidelity f as compared to |ψ〉, that is:∣∣∣∣∣∣ 〈ψ|ψ˜〉√〈ψ˜|ψ˜〉
∣∣∣∣∣∣
2
= f. (1)
Therefore, amplitudes of a fidelity f wave-function can
be computed at a cost that is only a fraction f of that
of the perfect fidelity case.
We find empirically that, while the different contri-
butions |ψp〉 fulfill the orthogonality requirement (with
a negligible overlap; e.g., in the Bristlecone-60 simula-
tion, the mutual fidelity between pairs out of 4096 paths
is about 10−6), there is some non negligible variation
in their norms (see Results and Fig. 10), and thus the
fidelity achieved by |ψp〉 is equal to:∣∣∣∣∣ 〈ψ|ψp〉√〈ψp|ψp〉
∣∣∣∣∣
2
= 〈ψp|ψp〉, (2)
which is in general different than (#paths)−1. If an
extensive subset of paths is summed over, then the vari-
ations on the norm and the fidelity are suppressed, and
the target fidelity is achieved. This was the case in
Ref. [28]. However, in this work we aim at minimizing
the number of cuts on the circuits, and so low fidelity
simulations involve a small number of paths (between
1 and 21 in the cases simulated). In this case, some
15
(a) (b) (c)
Figure 10. Porter-Thomas distribution for the three different sub-lattices of Bristlecone simulated with depth (1+32+1). (a)
Bristlecone-64 ([Run1]); 1.2× 106 amplitudes with a target fidelity f = 0.78%; 〈ψ˜|ψ˜〉 = 0.87f . (b) Bristlecone-48 ([Run2]);
1.2× 106 amplitudes with target fidelity f = 0.78%; 〈ψ˜|ψ˜〉 = 0.77f . (c) Bristlecone-60 ([Run6]); 1.15× 106 amplitudes with a
target fidelity f = 0.51%; 〈ψ˜|ψ˜〉 = 1.38f . For reference, the theoretical value of the Porter-Thomas distribution is plotted
(solid line). For some simulations, the depth is not sufficient to fully converge to a Porter-Thomas distribution. Furthermore,
summing a small number of paths of low fidelity might lead to worse convergence than expected for a particular depth (see
Methods C.1).
“unlucky” randomly selected paths might contribute with
a fidelity that is below the target, while others might
achieve a higher fidelity than expected.
Finally, the low fidelity probability amplitudes re-
ported in Ref. [28], obtained using the method described
above, follow a Porter-Thomas distribution as expected
for perfect fidelity amplitudes. Again, this is presumably
true only in the case when a large number of paths is
considered. In our case, we find distributions that have
not fully converged to a Porter-Thomas, but rather have
a larger tail (see Results and Fig. 10). We attribute this
phenomenon to the cuts in the circuit acting as removed
gates between qubits, thus increasing the effective di-
ameter of the circuit, which needs higher depth to fully
thermalize. We discuss the implications of these tails for
the sampling procedure in Methods C.3.
Fraction of perfect fidelity amplitudes — There exists a
second method to simulate sampling from the output
wave-function |ψ〉 with a target fidelity f that avoids
summing over a fraction of paths.
The output density matrix of a random quantum cir-
cuit with fidelity f can be written as [13]
ρ = f |ψ〉〈ψ|+ (1− f) 1
N
. (3)
This means that to produce a sample with fidelity f
we can sample from the exact wave-function |ψ〉 with
probability f or produce a random bit-string with prob-
ability 1− f . The sample from the exact wave-function
can be simulated by calculating the required number of
amplitudes with perfect fidelity.
Note that the method presented in this section involves
the computation of the same number of paths as the one
described in Methods C.1 for a given f , circuit topology,
circuit depth, and set of cuts. However, this second
method is more robust in achieving a target fidelity. Note
that by this argument the 6000 amplitudes of [Run5] are
equivalent to 1.2M amplitudes at 0.5% fidelity.
Note also that, even though this method and the one
presented in Methods C.1 have the same computational
cost for tensor network based simulators, for Schro¨dinger-
Feynman simulators like the one presented in Ref. [28] it
is preferable to consider a fraction of paths as opposed
to a fraction of perfect fidelity amplitudes. This is due
to the small cost overhead of computing an arbitrary
number of amplitudes using these simulators.
2. Fast sampling technique
While 106 sampled amplitudes are necessary for cross
entropy verification of the sampling task [13], the fru-
gal rejection sampling proposed in Ref. [28] needs the
numerical computation of 10× 106 = 107 amplitudes in
order to carry out the correct sampling on a classical
supercomputer. This is due to the acceptance of 1/M
amplitudes (on average) of the rejection sampling, where
M = 10 when sampling from a given Porter-Thomas
distribution with statistical distance  of the order of
10−4 (negligible).
In this section, we propose a method to effectively
remove the 10× overhead in the sampling procedure for
tensor network based simulators, which normally com-
pute one amplitude at a time. For the sake of clarity, we
tailor the proposed fast sampling technique to the Bristle-
cone architecture. However, it can be straightforwardly
generalized to different architectures (see SI A). Given
the two regions of the Bristlecone (and sub-lattices) AB
and C of Fig. 7, and the contraction proposed (see Meth-
ods B.1), the construction of tensor C and its subsequent
contraction with AB are computationally efficient tasks
done in a small amount of time as compared to the full
16
(b)
(a)
(c)
Figure 11. (a)-(b): Pearson coefficient as a function of Ham-
ming distance for pairs generated at random on subsystem C
of sub-lattice Bristlecone-24, for samples of size 1000 of ran-
dom strings on subsystem A+B. All pairs between strings
of a set of 32 random strings on subsystem C are considered.
The average and standard deviation (error bars) for each
Hamming distance is plotted with a solid line. We can see
that, for depth (1+24+1) (a), the system has not thermalized,
and the correlation decreases with Hamming distance; for
depth (1+32+1) (b), correlations approach zero, and become
Hamming distance independent (on subsystem C). (c) we
compare the distribution of Pearson coefficients, obtained as
described above, to the distribution of Pearson coefficients
obtained (numerically) from probability amplitudes with the
same sample size as in the simulations above, drawn from
a Porter-Thomas distribution. At large enough depth the
system is expected to thermalize and the two distributions
match, meaning that the probability amplitudes obtained by
varying bit-strings only on subsystem C are uncorrelated.
computation of the particular path. This implies that
one can compute, for a given output bit-string on AB,
sAB, a set of 2
12 amplitudes generated by the concate-
nation of sAB with all possible sC bit-strings on C at a
small overhead cost per amplitude. We call this set of
amplitudes a “batch”, we denote its size by NC , and each
of the (concatenated) bit-strings by sABC . In practice,
we find that for the Bristlecone-64 and -60 with depth
(1+32+1), the computation of a batch of 30 amplitudes
is only around 10% more expensive than the computa-
tion of a single amplitude, while for the Bristlecone-48
and -70 with depth (1+32+1), the computation of a
batch of 256 amplitudes is around 15% more expensive
than the computation of a single amplitude, instead of
a theoretical overhead of 30× and 256×, respectively.
The sampling procedure we propose is a modification
of the frugal rejection sampling presented in Ref. [28]
and proceeds as follows. First, we choose slightly over
106 (see below) random bit-strings on AB, sAB. For
each sAB , we choose NC bit-strings on C, sC , at random
(without repetition). We then compute the probability
amplitudes corresponding to all sABC bit-strings on all
(slightly over) 106 batches. We now shuffle each batch
of bit-strings. For each batch, we proceed onto the bit-
strings in the order given by the shuffle; we accept a
bit-string sABC with probability min [1, p(sABC)N/M ],
where p(sABC) is the probability amplitude of sABC ,
and N is the dimension of the Hilbert space; once a
bit-string is accepted, or the bit-strings of the batch
have been exhausted without acceptance, we proceed to
the next batch. By accepting at most one bit-string per
batch we avoid introducing spurious correlations in the
final sample of bit-strings.
Given an M and a batch size NC , the probability
that a bit-string is accepted from a batch is (on average)
1 − (1 − 1/M)NC . For M = 10 and NC = 30, the
probability of acceptance in a batch is 95.76%, and one
would need to compute amplitudes for 1.045×106 batches
in order to sample 106 bit-strings; for M = 10 and
NC = 60, the probability goes up to 99.82%, and one only
needs 1.002 × 106 batches; for M = 10 and NC = 256,
the probability of acceptance is virtually 100%, and
1.00 × 106 batches are sufficient. There is an optimal
point, given by the overhead in computing batches of
different sizes NC and the probability of accepting a bit-
string on a batch given NC , that minimizes the runtime
of the algorithm.
There is a crucial condition for this sampling algorithm
to work, namely the absence of correlations between
the probability amplitudes of the bit-strings {sABC}
for fixed sAB, so that they are indistinguishable from
probability amplitudes taken from random bit-strings
over ABC. We expect this condition to be satisfied for
chaotic systems that have converged to a Porter-Thomas
distribution. In order to test this, we perform the follow-
ing test: for Bristlecone-24, we choose 1000 bit-strings
over AB (sAB) at random and for each of them we
generate a batch of size NC = 32, where we constrain
17
the bit-strings {sC} to be the same across batches. We
now compute the Pearson correlation coefficient between
the two sets of 1000 amplitudes gotten for each pair of
bit-strings in C, and we do this for all 32× 31/2 pairs.
If the probability amplitudes of each batch are really
uncorrelated to each other, we expect the correlation co-
efficient to vanish. We show the coefficient as a function
of Hamming distance between the pairs in Fig. 11 (a)
and (b). We can see that, for depth (1+24+1) (a) there
is a small but non negligible correlation, which in fact
decreases on average with Hamming distance. For depth
(1+32+1) (b), the correlation is Hamming distance inde-
pendent and approaches zero. In Fig. 11 (c) we compare
the distribution of Pearson coefficients obtained for both
depths analyzed to that one obtained from pairs of sets
of size 1000 sampled from a Porter-Thomas distribution.
While a fairer comparison would involve sampling from
the distribution of the output wave-function of the RQC,
which might differ from the Porter-Thomas in the ab-
sence of convergence, we still see a clear tendency of the
distributions to match for longer depth, i.e., closer to
convergence.
3. Sampling from a non fully-thermalized Porter-Thomas
distribution
In Ref. [28] an error of the order of 10−4 is computed
for a frugal rejection sampling with M = 10, assuming
Porter-Thomas statistics. When the distribution has not
converged to a Porter-Thomas, but rather has a larger
tail, we expect the error to increase. We can estimate
the error in sampling numerically for the cases simulated
here as the sum of the probability amplitudes larger
than M/N with N being the dimension of the Hilbert
space, multiplied by N and divided by the number of
amplitudes computed. For M = 10, we estimate an
error  = 9.3 × 10−4 for [Run1],  = 1.0 × 10−2 for
[Run2], and  = 2.5 × 10−3 for [Run6], respectively. If
instead we consider M = 15, this lowers the error to
 = 1.3 × 10−5 for [Run1],  = 1.0 × 10−3 for [Run2],
and  = 1.15× 10−4 for [Run6], respectively. Increasing
M , in order to reduce the error in the frugal sampling,
implies a lower acceptance rate in the fast sampling,
which is resolved by increasing the size of the batches
NC , which is done at a small cost.
D. Simulation of Bristlecone compared to
rectangular grids
The diamond shape of the topology of Bristlecone
and its hard sub-lattices (see Fig. 1: Bristlecone-24, -30,
-40, -48, -60, and -70) makes them particularly hard
to simulate classically when compared to rectangular
grids of the same (or smaller) number of qubits. Indeed,
these lattices are subsets of large rectangular grids, from
which they inherit their diameter; e.g., Bristlecone-70
is a sublattice of a 10 × 11 grid. When cutting the
lattice (see Methods B.1), one has to apply several cuts
in order to decrease the maximum size of the tensors
in the contraction to manageable sizes; in the case of
Bristlecone-70 and depth (1+32+1), four cuts are needed
in order to have tensors in the contraction of at most
dimension 27×4, while for a rectangular 8 × 9 lattice
(with 72 qubits) only 3 cuts are needed. Note that the
computational cost scales with the dimension of the
indexes cut, i.e., exponentially with the number of cuts.
The same applies to a simulator based on a full split
of the circuit into two parts, as in Refs. [12, 28, 38]. For
instance, the number of CZ gates for RQCs with depth
(1+32+1) which are cut when splitting Bristlecone-60 in
two halves is equal to 40. In comparison, 8× 8 grids of
qubits with the same depth have only 32 CZ gates cut.
See SI C for more details.
As was discussed in Methods B, identifying topologies
that are hard to simulate classically, but that minimize
the number of qubits involved, increases the chances of
success success of quantum supremacy experiments, due
to the decrease of the overall fidelity of the quantum
computer with the number of gates and qubits [13]. For
this reason, we find that Bristlecone is a good setup for
quantum supremacy experiments.
V. DATA AVAILABILITY
The simulation data that support the findings of this
study are available at https://data.nas.nasa.gov/
quail/data.php?dir=/quaildata/quantum/qcSim.
The circuit files used for the numerical simulations in
this paper are publicly available in [44].
ACKNOWLEDGMENTS
The authors would like to thank Edward Farhi, Bryan
A. O’Gorman, Alan Ho, Sergei V. Isakov, Norman M.
Tubman and Dmitry Lyakh for enlightening discussions
and Igor L. Markov for reviewing the manuscript.
The authors also thank Orion Martin, Alan Kao and
David Yonge-Mallo for helping open sourcing qFlex
code. We thank the authors of Ref. [39] for sharing
the instances used in that work. We are grateful for
support from NASA Ames Research Center, from the
NASA Advanced Exploration Systems (AES) program,
and the NASA Transformative Aeronautics Concepts
Program (TACP). We also appreciate support from the
AFRL Information Directorate under grant number
F4HBKC4162G001. We acknowledge support of the
NASA Advanced Division for providing access to the
NASA HPC systems, Pleiades and Electra, during
18
dedicated downtime.
The views and conclusions contained herein are those
of the authors and should not be interpreted as neces-
sarily representing the official policies or endorsements,
either expressed or implied, of the U.S. Government.
The U.S. Government is authorized to reproduce and
distribute reprints for governmental purpose notwith-
standing any copyright annotation thereon.
VI. AUTHOR CONTRIBUTION
B.V., S.B. and S.M. designed qFlex and the study;
B.V, S.B and S.M devised new improvements for approx-
imated sampling; B.V. wrote the original qFlex code
and B.V, B.N. and C.H. further optimized it; B.V, B.N
and C.H. ran simulations on NASA Pleiades and Electra;
B.V., S.B., E.R., R.B and S.M. performed the analysis
of data; all the authors contributed to the discussions
and wrote the paper.
19
Run Circuit Depth Paths per instance / total paths NC
Cores per instance / cores per node
bro has ivy san sky (Electra)
1 Bris.-64 (1+32+1) 2/28 30 - - - - 2/40
2 Bris.-48 (1+32+1) 2/28 30 - - - - 2/40
2b* Bris.-48 (1+32+1) 2/28 256 2/28 2/24 2/20 4/16 2/40
3 Bris.-70 (1+32+1) 1/216 512 2/24 4/20 8/16 2/40
3b* Bris.-70 (1+32+1) 1/216 256 2/28 2/24 2/20 4/16 2/40
4 Bris.-70 (1+24+1) 1/23 62 - 8/24 20/20 - -
5 Bris.-70 (1+24+1) 23/23 62 - 8/24 20/20 - -
6 Bris.-60 (1+32+1) 21/212 30 2/28 2/24 4/20 8/16 2/40
Run Memory footprint (GB)
Num. instances per node
bro has ivy san sky (Electra)
1 10 - - - - 19
2 10 - - - - 19
2b* 7 14 12 8 4 20
3 11 - 11 5 2 16
3b* 7 14 12 8 4 20
4 36 - 3 1 - -
5 36 - 3 1 - -
6 10 11 11 5 2 17
Run
Runtime (s)
bro has ivy san sky (Electra)
1 - - - - 335.8± 12.2
2 - - - - 246.0± 9.4
2b* 204.9± 6.5 215.6± 2.6 256.9± 0.5 149.9± 0.1 150.0± 3.5
3 - 1034.5± 91.3 700.9± 32.8 386.3± 4.0 708.4± 79.9
3b* 156.3± 4.3 161.7± 1.7 190.9± 1.7 113.3± 0.3 113.2± 2.8
4 - 226.5± 15.8 126.8± 5.2 - -
5 - 1683.2± 111.1 885.3± 4.4 - -
6 4555.4± 311.7 5092.2± 311.7 3606.5± 159.8 2040.9± 18.9 3054.0± 230.3
Table I. Number of paths per instance, size of batches of amplitudes (see Methods C.2), number of cores (threads) used
per instance, memory footprint, number of instances fit in a node, and runtime per instance for all six cases run and for all
five node types used on NASA Pleiades and Electra HPC clusters. We report single instances of a run, where an instance
corresponds to the computation of a number of paths given a cut prescription, and the computation of a batch of NC
amplitudes corresponding to output bit-strings chosen at random over subsystem C (see Methods C.2). Note that for [Run3]
NC = 512, and so computing NC amplitudes takes about three times the time of computing a single one. However, this is
strongly mitigated with the contraction used for [Run3b]. Note also that for [Run6] we ran 17 jobs per Skylake node, instead
of 19, as a conservative strategy to stay well below the total memory available on these nodes and hence avoid any unwanted
crash in our largest simulation. Instances can be collected for a large number of low fidelity amplitudes or for a smaller
number of high fidelity amplitudes at the same computational cost. *[Run2b] and [Run3b] refer to the benchmark of the
contraction procedure introduced in Methods B.1 for Bristlecone-48 and Bristlecone-70, respectively; [Run1] and [Run3] were
run were run using a less performing procedure, similar to the one used for Bristlecone-60 (see SI B).
Circuit Depth Fidelity (%)
Runtime × num. cores (h)
bro has ivy san sky (Electra)
Bris-48 (1+32+1) 0.78% 1.14× 105 1.20× 105 1.78× 105 1.67× 105 8.33× 104
Bris-64 (1+32+1) 0.78% - - - - 1.96× 105
Bris-60 (1+32+1) 0.51% 3.22× 106 3.09× 106 4.01× 106 4.54× 106 2.00× 106
Bris-70 (1+24+1) 0.50% - 1.87× 104 2.46× 104 - -
Bris-70 (1+32+1) 0.50% 2.85× 107 2.95× 107 4.35× 107 4.13× 107 2.06× 107
Table II. Estimated effective runtimes on a single core for the computation of 106 amplitudes with a target fidelity of about
0.5% for the Bristlecone sub-lattices (see Fig. 1 for nomenclature). This is an estimate of the computational cost for the
completion of the RQC sampling task. The estimate is based on the runtimes for single instances presented in Table I. For
Bristlecone-70 with depth (1+32+1), [Run3b] is used, since it offers better performance than [Run3].
20
Circuit size Fidelity (%)
Runtime (hours) Energy cost (MWh)
Pleiades Electra Pleiades Electra
7× 7× (1 + 40 + 1) 100 1.22× 10−2 1.16× 10−2 5.35× 10−2 1.89× 10−2
8× 8× (1 + 32 + 1) 100 1.77× 10−4 2.04× 10−4 8.86× 10−4 3.34× 10−4
8× 8× (1 + 40 + 1) 100 2.03 2.12 8.88 3.48
8× 9× (1 + 32 + 1) 100 2.90× 10−3 2.98× 10−3 1.45× 10−2 4.89× 10−3
Bris.-70 ×(1 + 32 + 1) 100 2.89× 10−2 3.57× 10−2 1.45× 10−1 5.85× 10−2
Table III. Estimated effective runtimes and energy cost for the computation of a single amplitude with perfect fidelity on
NASA HPC Pleiades and Electra systems. Note that for the 7× 7× (1 + 40 + 1) and 8× 8× (1 + 40 + 1) grids, jobs do not fit
in Sandy Bridge nodes, due to their memory requirements; for that reason, the portion of Pleiades with Sandy Bridge nodes
is not considered, and the energy cost estimations of these two cases do not include those nodes.
Circuit size Target fidelity (%)
Runtime (hours) Energy cost (MWh)
Pleiades Electra Pleiades Electra
7× 7× (1 + 40 + 1) 0.51 62.4 59.0 2.73× 102 96.8
8× 8× (1 + 32 + 1) 0.78 1.38 1.59 6.91 2.61
8× 8× (1 + 40 + 1) 0.58 1.18× 104 1.23× 104 5.15× 104 2.02× 104
8× 9× (1 + 32 + 1) 0.51 14.8 15.2 73.9 24.9
Bris.-70 ×(1 + 32 + 1) 0.50 145 178 723 293
Table IV. Estimated runtimes and energy cost for the computation of 106 amplitudes with fidelity close to 0.5% on NASA
HPC Pleiades and Electra systems. Note that for the 7× 7× (1 + 40 + 1) and 8× 8× (1 + 40 + 1) grids, jobs do not fit in
Sandy Bridge nodes, due to their memory requirements; for that reason, the portion of Pleiades with Sandy Bridge nodes is
not considered, and the energy cost estimations of these two cases do not include those nodes.
Circuit size Targ. fidelity (%) Num. amps.
Runtime (hours) Energy cost (MWh)
MFIB Electra (sky) MFIB Electra (sky)
7× 7× (1 + 40 + 1) 0.51 1 4.96× 105 6.63 6.46 8.64× 10−5
7× 7× (1 + 40 + 1) 0.51 105 5.05× 105 6.63× 105 6.58 8.64
7× 7× (1 + 40 + 1) 0.51 106 5.82× 105 6.63× 106 7.58 86.4
7× 7× (1 + 40 + 1) 100.0 1 9.73× 107 1.30× 103 1.27× 103 1.69× 10−2
7× 7× (1 + 40 + 1) 100.0 105 9.90× 107 1.30× 108 1.29× 103 1.69× 103
7× 7× (1 + 40 + 1) 100.0 106 1.14× 108 1.30× 109 1.49× 103 1.69× 104
Table V. Estimated runtime and energy cost consumption to compute the specified number of amplitudes for our simulator on
a single processor of the Skylake nodes portion of the NASA Electra system, compared to Ref. [28] (MFIB). The energy cost
for the MFIB simulations is estimated assuming the same power consumption per core as the Skylake nodes. In Ref. [28], the
authors report a number of cores P = 625× 16 = 105, since they use 625 nodes of 16 cores (32 vCPUs or hyper-threads) each.
For ours, P = 2304× 40 = 92,160 on the Skylake nodes of Electra (note that we consider 40 cores per node, even though
we use only 39 in practice for the 7× 7× (1 + 40 + 1) simulations; this is due to the ability of modern Intel processors to
“up-clock” their CPUs in favorable conditions (known as Dynamic Frequency Scaling), thus consuming a similar amount of
energy and achieving a similar performance as in the case where there are no idle cores. Note that MFIB’s approach has the
advantage to compute a large number of of amplitudes with a small cost overhead. On the contrary our approach performs
much better in the computation of a smaller subset of amplitudes; both methods use comparable resources when computing
about 105 amplitudes. The MFIB algorithm becomes less efficient than our algorithm as the number of qubits grows because
of memory requirements.
21
Appendix A: Contraction of RQCs on rectangular
grids
Here we describe the contraction scheme for I×J grids.
To simplify the discussion, we consider 7×7×(1+40+1)
RQCs. Nonetheless, the same procedure applies to other
rectangular circuits. For the 7× 7× (1 + 40 + 1) RQCs,
we use two cuts (22×5 paths) in the grid (see Fig. A1),
in order to divide it into four tensors, A, B, C, and D,
of dimension 26×5 = 230 each. Then, the contraction
proceeds as follows.
1) Tensors in region A are contracted onto tensor A,
which is path independent; we do the same for
tensors in regions pB (partial-B), pC (partial-C),
and ppD (partial-partial-D), which are all path in-
dependent; for this reason, all A, pB, pC, and ppD
are reused over the entire computation. We now
iterate over all paths with two nested iterations:
the outer one iterates over the right cut, while the
inner one iterates over the bottom cut.
2) For each of the 25 right paths, tensors in regions
B and pD are contracted.
3) Tensors A and B are contracted onto AB; this ten-
sor will be reused over all the inner loop iterations.
4) For each bottom path (given a right path), the
tensors on region C and D are contracted.
5) C and D are contracted onto CD.
6) AB and CD are contracted onto a scalar, which is
equal to this path’s contribution to the amplitude.
From here, the iteration over the inner loop takes us
back to step 4), for a different bottom path. After this
loop is exhausted, we go back to step 2), for the next
right path. After iterating over both loops, we have
obtained all path contributions.
Note that there is a large amount of potential reuse
of the tensors built at each step. However, taking
advantage of already built tensors requires a large
amount of memory in order to store all the tensors to be
reused. While there is a clear trade-off between tensor
reuse and memory usage, in practice we always found
the reuse profitable.
Finally, the fast sampling introduced in Methods C.2
can be also applied here, by using a slightly different
contraction than the one presented in Fig. A1. More
precisely, in Fig. A2 we present the final steps of the
alternative contraction. In 4) and 5) the size of the
tensor pC is still small, and we focus, for this particular
path, in contracting first D with AB onto ABD. This
leaves six qubits free (above pC) for the computation of
batches of as much as 26 = 64 amplitudes (or more if pC
Figure A1. Sketch of the contraction of the 7×7×(1+40+1)
tensor network with two cuts. The names of the tensors at key
steps shown in the contraction are referred to in Section A.
Figure A2. Alternative contraction for the use of fast sam-
pling (see Methods C.2).
is shrunk further); for each amplitude, contracting the
tensors in region C (where pC is reused) and computing
the scalar ABCD is left, i.e., steps 6) and 7) of Fig. A2.
Once this is done, then we go back to step 5), in order to
loop over all bit-strings in the batch. After exhausting
this loop, we go back to step 4) to compute the next
bottom path. Note that the contraction following this
procedure is dominated by the contraction of AB with
D; however, the tensor ABD is reused (for a path) for all
bit-strings in a batch, and so the computation of a batch
of amplitudes adds only a small cost to the computation
of a single amplitude.
Appendix B: Old tensor contractions for
Bristlecone-48 and Bristlecone-70
The contractions procedures followed for [Run2-3] are
presented in Fig. B3. Note that we have identified a faster
contraction (about half the runtime) with less memory re-
quirements for Bristlecone-48 and -70, which is described
in Methods B.1, and whose runtimes are reported in the
main text (see Results) and referred to as [Run2b] and
[Run3b], respectively. Note also that the new contrac-
tion scheme cannot be adapted to Bristlecone-60, due
to its topology. The old contractions are included here
for reference, and apply to our released simulation data.
22
C
A
B
C
A
B
Figure B3. Tensors A, B, and C used in the old contraction
schemes used for [Run2] and [Run3] (Bristlecone-48 and -70,
respectively).
two three, d=1 three, d=2 three, d=3 three, d=4 three, d=5 four
Partitions
71
72
73
74
75
76
77
Qu
bi
t c
om
pl
ex
ity
Figure C4. Qubit complexity of the simulation of Bristlecone-
60 with depth (1+32+1) with Schro¨dinger-Feynman simula-
tors [17, 28, 38] for different partition schemes. Top: From
left to right, partition schemes considered for two partitions,
three partitions, and four partitions; in the three partition
case, d is the number of cuts avoided in the diagonal towards
the top-right region, which for the example plotted is d = 2,
and is extended to d = 1 . . . 5 in our study. Bottom: Qubit
complexity for the two-partition, three-partition, and four-
partition schemes; we can see that the complexity is best in
the two-partition case.
Appendix C: Qubit complexity of square grids and
Bristlecone sub-lattices for
Schrod¨inger-Feynman-type simulators
To study the time complexity of Schro¨dinger-Feynman-
type simulators [17, 28, 38], i.e., those that partition the
circuit in sub-circuits and perform a full wave-function
evolution of the sub-circuits for all the paths defined by
the partition, we use the “qubit complexity” [38] for a
given depth defined as follows.
For a bipartition, sub-circuits A and B are generated;
if sub-circuit A has nA qubits, B has nB qubits, and
there are αAB CZ gates cut, then sub-circuits A and
B have to be simulated 2αAB times, with complexity
2nA and 2nB each, respectively, for the computation
of one amplitude, or batch of amplitudes at a small
overhead cost. The qubit complexity is in this case
log2 [2
αAB (2nA + 2nB )].
For a partition in three regions, sub-circuits A, B, and
C are generated, with nA, nB, and nC qubits, respec-
tively. The number of CZ gates cut is now αAB between
A and B, αAC between A and C, and αBC between
B and C. In a naive computation, the qubit complex-
ity would be log2 [(2
αAB2αAC2αBC ) (2nA + 2nB + 2nC )],
since each of the three sub-circuits has to be simu-
lated for each of the 2αAB2αAC2αBC paths. However,
it is possible to decrease the complexity by simulating
sub-circuit A only once for each particular choice of
path over cuts between A and B and cuts between A
and C, and iterate (within this choice) over all paths
over cuts between B and C. In that case, the cost is
log2 {(2αAB+αAC ) [2nA + 2αBC (2nB + 2nC )]}. The qubit
complexity is in this case minimized when A is the biggest
sub-circuit, which can always be assumed without loss
of generality.
For a partition in four sub-circuits (A, B, C, and
D), where sub-circuits share CZ gates only between
A and B, B and C, C and D, and between D and
A, as in Fig. C4, the naive computation has qubit
complexity log2 [(2
αAB+αBC+αCD+αAD )βABCD], where
βABCD ≡ 2nA + 2nB + 2nC + 2nD . However, it is
possible, for each combination of paths between A and
B, and C and D, to iterate over paths between B and
C, and those between A and D, lowering the qubit
complexity to log2 [2
αAB+αCD (2αBCβBC + 2
αADβAD)],
where βBC ≡ 2nB + 2nC and βAD ≡ 2nA + 2nD .
We can see in Fig. C4 that for Bristlecone-60 with
depth (1+32+1) the best performance is achieved for
a partition into two sub-circuits, as is the case for the
rectangular grids considered in Refs. [28, 38]. For a
square grid 8 × 8 × (1 + 32 + 1), the qubit complexity
is 65, which is lower than the best complexity found
in this section for Bristlecone-60 with depth (1+32+1),
i.e., 71, even though the 8× 8 square grid has four more
qubits. This suggests that hard Bristlecone sub-lattices
are harder to simulate than square (or rectangular) grids
of the same (or smaller) number of qubits. Similar
arguments apply to Bristlecone-70.
For the square grid 7× 7× (1 + 40 + 1) split into two
sub-circuits [28], the qubit complexity is slightly over 63,
and for the 7× 8× (1 + 40 + 1), the qubit complexity is
64.
23
Appendix D: Supplementary details on hardware
usage and runtimes
Here we discuss the split in runtimes as reported in
the main text (see Results), as well as the estimation of
core-hours necessary for the simulations:
Split in runtimes. Investigation showed that the
split in runtimes discussed in the main text (see Results)
appears to be due to differing amounts of contention
on the two sockets within a node (all the Pleiades and
Electra nodes are dual-socket). We enforced the rule
that all the jobs running on a particular node had the
same number of threads. This could lead to there being
a few unused cores. Individual threads were “pinned” to
run on individual cores to avoid interference. However,
the pinning strategy caused the unused cores to always
be the lowest numbered cores (which are all on the
first socket), and so fewer jobs ran concurrently on the
first socket, causing them to see less contention, and
therefore slightly higher performance than jobs that ran
on the second socket. Unfavorable thread counts and
core counts could also lead to one job per node having
it’s threads split across the two sockets; this creates yet
another source of anomalous timings.
Core hour estimation. In our simulations on the
Skylake nodes of Electra we used P = 2304× 40 = 92160
cores. Note that we consider 40 cores per node, even
though we use only 39 in practice for the 7×7×(1+40+1)
simulations and 36 in the 8×8× (1+40+1) simulations;
this is due to the ability of modern Intel processors to
“up-clock” their CPUs in favorable conditions (known as
Dynamic Frequency Scaling), thus achieving a perfor-
mance similar to the case where there are no idle cores.
Similar estimates apply to the benchark of the other
types of nodes used. Regarding our comparisons with
Alibaba [39], we take into account that the authors report
the usage of P = 217 = 131072 for their benchmark.
[1] M. A. Nielsen and I. L. Chuang, Quantum computation
and quantum information (Cambridge University Press,
Cambridge, 2000).
[2] E. Rieffel and W. Polak, Quantum Computing: A Gentle
Introduction (MIT Press, Cambridge, MA, 2011).
[3] P. Shor, in Proceedings 35th Annual Symposium on Foun-
dations of Computer Science (Ieee, 2002) pp. 124–134.
[4] L. K. Grover, in Proceedings of the Annual ACM Sympo-
sium on Theory of Computing , Vol. Part F1294 (ACM,
1996) pp. 212–219, arXiv:9605043 [quant-ph].
[5] R. P. Feynman, International Journal Of Theoretical
Physics 21, 467 (1982).
[6] A. Aspuru-Guzik, A. D. Dutoi, P. J. Love, and M. Head-
Gordon, Science 309, 1704 (2005), arXiv:0905.0887.
[7] R. Babbush, N. Wiebe, J. McClean, J. McClain,
H. Neven, and G. K. L. Chan, Physical Review X 8
(2018), 10.1103/PhysRevX.8.011044, arXiv:1706.00023.
[8] Z. Jiang, K. J. Sung, K. Kechedzhi, V. N. Smelyanskiy,
and S. Boixo, Physical Review Applied 9, 44036 (2018).
[9] R. Babbush, C. Gidney, D. W. Berry, N. Wiebe, J. Mc-
Clean, A. Paler, A. Fowler, and H. Neven, Physical
Review X 8, 41015 (2018), arXiv:1805.03662.
[10] J. Preskill, Quantum 2, 79 (2018), arXiv:1801.00862.
[11] M. J. Bremner, A. Montanaro, and D. J. Shepherd, Phys-
ical Review Letters 117, 80501 (2016), arXiv:1504.07999.
[12] A. W. Harrow and A. Montanaro, Nature, 549, 203
(2017), arXiv:1809.07442.
[13] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Bab-
bush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Mar-
tinis, and H. Neven, Nature Physics 14, 595 (2018),
arXiv:1608.00263.
[14] S. Aaronson and A. Arkhipov, in Proceedings of the An-
nual ACM Symposium on Theory of Computing (ACM,
2011) pp. 333–342, arXiv:1011.3245.
[15] J. Preskill, arXiv:1203.5813 (2012).
[16] M. J. Bremner, A. Montanaro, and D. J. Shepherd,
Quantum 1, 8 (2017).
[17] S. Aaronson and L. Chen, in Leibniz International
Proceedings in Informatics, LIPIcs, Vol. 79 (Schloss
Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017).
[18] C. Neill, P. Roushan, K. Kechedzhi, S. Boixo, S. V.
Isakov, V. Smelyanskiy, A. Megrant, B. Chiaro,
A. Dunsworth, K. Arya, R. Barends, B. Burkett,
Y. Chen, Z. Chen, A. Fowler, B. Foxen, M. Giustina,
R. Graff, E. Jeffrey, T. Huang, J. Kelly, P. Klimov,
E. Lucero, J. Mutus, M. Neeley, C. Quintana,
D. Sank, A. Vainsencher, J. Wenner, T. C. White,
H. Neven, and J. M. Martinis, Science 360, 195 (2018),
arXiv:1709.06678.
[19] A. Bouland, B. Fefferman, C. Nirkhe, and U. Vazirani,
Nature Physics 15, 159 (2019).
[20] A. Harrow and S. Mehraban, arXiv:1809.06957 (2018).
[21] R. Movassagh, arXiv:1810.04681 (2018).
[22] G. Kalai and G. Kindler, arXiv:1409.3093 (2014).
[23] A. Arkhipov, Physical Review A - Atomic, Molecular,
and Optical Physics 92, 62326 (2015).
[24] S. Rahimi-Keshari, T. C. Ralph, and C. M. Caves,
Physical Review X 6, 1 (2016), arXiv:1511.06526v2.
[25] M.-H. Yung and X. Gao, arXiv:1706.08913 (2017).
[26] S. Boixo, V. N. Smelyanskiy, and H. Neven,
arXiv:1708.01875 (2017).
[27] X. Gao and L. Duan, arXiv:1810.03176 (2018).
[28] I. L. Markov, A. Fatima, S. V. Isakov, and S. Boixo,
arXiv:1807.10749 (2018).
[29] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu,
G. Arnold, M. Richter, T. Lippert, H. Watanabe, and
N. Ito, Computer Physics Communications 176, 121
(2007), arXiv:0608239 [quant-ph].
[30] M. Smelyanskiy, N. P. D. Sawaya, and A. Aspuru-Guzik,
arXiv:1601.07195 (2016).
[31] T. Ha¨ner and D. S. Steiger, in Proceedings of the Inter-
national Conference for High Performance Computing,
Networking, Storage and Analysis, SC 2017 (ACM, 2017)
p. 33, arXiv:1704.01127.
24
[32] E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh,
T. Magerlein, E. Solomonik, E. W. Draeger, E. T. Hol-
land, and R. Wisnieff, arXiv:1710.05867 (2017).
[33] H. De Raedt, F. Jin, D. Willsch, M. Willsch, N. Yoshioka,
N. Ito, S. Yuan, and K. Michielsen, Computer Physics
Communications 237, 47 (2019).
[34] R. Li, B. Wu, M. Ying, X. Sun, and G. Yang,
arXiv:1804.04797 (2018).
[35] S. Bravyi and D. Gosset, Physical Review Letters 116,
250501 (2016), arXiv:1601.07601.
[36] I. L. Markov and Y. Shi, SIAM Journal on Computing
38, 963 (2008).
[37] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, and H. Neven,
arXiv:1712.05384 (2017).
[38] Z. Y. Chen, Q. Zhou, C. Xue, X. Yang, G. C.
Guo, and G. P. Guo, Science Bulletin 63, 964 (2018),
arXiv:1802.06952.
[39] J. Chen, F. Zhang, C. Huang, M. Newman, and Y. Shi,
arXiv:1805.01450 (2018).
[40] M.-C. Chen, R. Li, L. Gan, X. Zhu, G. Yang, C.-Y. Lu,
and J.-W. Pan, arXiv:1901.05003 (2019).
[41] C. Guo, Y. Liu, M. Xiong, S. Xue, X. Fu, A. Huang,
X. Qiang, P. Xu, J. Liu, S. Zheng, H.-L. Huang, M. Deng,
D. Poletti, W.-S. Bao, and J. Wu, arXiv:1905.08394
(2019).
[42] V. Gogate and R. Dechter, in Proc CUAI (2012) pp.
201–208, arXiv:1207.4109.
[43] T. Jones, A. Brown, I. Bush, and S. C. Benjamin,
Scientific Reports 9 (2019), 10.1038/s41598-019-47174-9.
[44] S. Boixo, GRCS .
[45] A. Lokhmotov and A. Mycroft, in Proceedings of the
19th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA’ 07) (ACM, 2007) p. 198.
[46] T. H. Weng, S. W. Huang, R. K. Perng, C. H. Hsu, and
K. C. Li, in Lecture Notes of the Institute for Computer
Sciences, Social-Informatics and Telecommunications
Engineering , Vol. 18 LNICST (Springer, 2009) pp. 206–
216.
[47] G. Knittel, in 17th DSP 2011 International Conference
on Digital Signal Processing, Proceedings (IEEE, 2011)
pp. 1–6.
[48] J. Emerson, R. Alicki, and K. Zyczkowski, Journal of
Optics B: Quantum and Semiclassical Optics 7, S347
(2005).
[49] J. P. Gaebler, A. M. Meier, T. R. Tan, R. Bowler, Y. Lin,
D. Hanneke, J. D. Jost, J. P. Home, E. Knill, D. Leibfried,
and D. J. Wineland, Physical Review Letters 108, 12307
(2012).
[50] E. Magesan, J. M. Gambetta, and J. Emerson, Physical
Review Letters 106, 180504 (2011).
[51] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N.
Cleland, Physical Review A - Atomic, Molecular, and
Optical Physics 86, 32324 (2012).
[52] R. Barends, J. Kelly, A. Megrant, A. Veitia, D. Sank,
E. Jeffrey, T. C. White, J. Mutus, A. G. Fowler, B. Camp-
bell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth,
C. Neill, P. O’Malley, P. Roushan, A. Vainsencher,
J. Wenner, A. N. Korotkov, A. N. Cleland, and J. M.
Martinis, Nature 508, 500 (2014).
[53] R. Barends, L. Lamata, J. Kelly, L. Garc´ıa-A´lvarez,
A. G. Fowler, A. Megrant, E. Jeffrey, T. C. White,
D. Sank, J. Y. Mutus, B. Campbell, Y. Chen, Z. Chen,
B. Chiaro, A. Dunsworth, I. C. Hoi, C. Neill, P. J.
O’Malley, C. Quintana, P. Roushan, A. Vainsencher,
J. Wenner, E. Solano, and J. M. Martinis, Nature Com-
munications 6, 7654 (2015).
