Faster Schr\"odinger-style simulation of quantum circuits by Fatima, Aneeqa & Markov, Igor L.
Faster Schro¨dinger-style simulation of quantum circuits
Aneeqa Fatima
aneeqaf@umich.edu
University of Michigan
2260 Hayward St.
Ann Arbor, MI 48109-2121
Igor L. Markov
imarkov@umich.edu
University of Michigan
2260 Hayward St.
Ann Arbor, MI 48109-2121
ABSTRACT
Recent demonstrations of superconducting quantum computers
by Google and IBM and trapped-ion computers from IonQ fueled
new research in quantum algorithms, compilation into quantum
circuits, and empirical algorithmics. While online access to quan-
tum hardware remains too limited to meet the demand, simulating
quantum circuits on conventional computers satises many needs.
We advance Schro¨dinger-style simulation of quantum circuits that
is useful standalone and as a building block in layered simulation
algorithms, both cases are illustrated in our results. Our algorith-
mic contributions show how to simulate multiple quantum gates
at once, how to avoid oating-point multiplies, how to best use
instruction-level and thread-level parallelism as well as CPU cache,
and how to leverage these optimizations by reordering circuit gates.
While not described previously, these techniques implemented by
us supported published high-performance distributed simulations
up to 64 qubits. To show additional impact, we benchmark our
simulator against Microso, IBM and Google simulators on hard
circuits from Google.
1 INTRODUCTION
antum computation was rst developed theoretically to acceler-
ate computational bolenecks using quantum-mechanical phenom-
ena [1]. Among several promising quantummodels of computation,
quantum circuits have been implemented in several technologies,
for which end-to-end programmable computation has been demon-
strated at intermediate scale [2]. In 2019, Google claimed quantum-
computational supremacy [3, 4] by scaling quantum computation
to the point where simulating it becomes exceptionally challeng-
ing even on supercomputers. e signicance of quantum design
automation tools has been appreciated for many years, to the point
where IBM and Microso have developed extensive toolchains
(QISKit and QDK respectively), which include language support,
compilation, optimization, and a variety of execution back-ends for
physical quantum computers and circuit simulators. Just like in
conventional Electonic Design Automation, such soware allows
one to validate prototype designs before building the hardware.
antum circuit simulation in general is inherently dicult due
to the exponential growth in the number of internal parameters
and complexity-theoretic reasons [5, 6]. We distinguish several
categories and uses of quantum-circuit simulation:
(1) Polynomial-time simulation of ”easy” special-case quan-
tum circuits — those using Cliord gates [7], many in-
stances of Grover’s algorithm [8], and circuits with small
tree-width [9]. Such simulation is used to (i) rule out quan-
tum speed-up in specic algorithms, and (ii) for limited
initial testing of quantum computers in the lab and debug-
ging of failed tests.
(2) Best-eort simulation of unrestricted ”small” quantum cir-
cuits up to 40 qubits, illustrated in Section 6. Such simu-
lation is used (i) for broad testing and routine debugging
of quantum computers, (ii) to verify local quantum circuit
transformations and whole circuits, as well as (iii) to eval-
uate new quantum algorithms, quantum error-correcting
codes and device architectures [10]. In particular, VQE
algorithms common for quantum-chemistry applications
run numerous quantum circuits in a sequence, their devel-
opment particularly benets from fast simulation.
(3) Distributed quantum-circuit simulation on clusters and su-
percomputers [11, 12, 14–18], including the world’s largest
[19]. Such expensive and scarce resources are used to
verify quantum computation and set performance base-
lines when claiming quantum-computational supremacy
[3, 4, 17]. Whereas other uses entail repeated on-demand
simulations on readily available computing hardware, this
category targets a small number of ”expensive” one-o
simulations.
All three categories of quantum-circuit simulation are in demand
today, but high-performance simulation of unrestricted quantum
circuits in Categories 2 and 3 is challenging and motivates algorith-
mic improvements of the type we propose. Using fast simulation
to evaluate, verify, test and debug larger quantum circuits and
algorithms facilitates the development of many applications.
Schro¨dinger-style simulation is the mainstream technique for
general-case simulation of quantum algorithms, circuits and phys-
ical devices. It represents a quantum state (wave function) by a
vector of complex-valued amplitudes and modies this vector in
place by applying quantum transformations (quantum gates, laser
pulses, algorithm modules, etc). Schro¨dinger simulation
• scales linearly with computation (circuit) depth but ex-
ponentially with the number of qubits, or width (Section
6.3);
• is popular for small and mid-size quantum-circuit and de-
vice/technology simulations because its unoptimized vari-
ants are relatively straightforward to implement;
• dominates supercomputer-based quantum circuit simu-
lations because it can leverage distributed memory, fast
interconnect, GPUs, etc [11–16, 18, 19];
• has been extended for beer scalability via layered simu-
lation [18, 20, 21]. For example, combining Schro¨dinger
simulation with Feynman path summation enables scaling
tradeos between circuit depth and width [5] and orders-
of-magnitude resource savings in important cases [21].
ar
X
iv
:2
00
8.
00
21
6v
1 
 [q
ua
nt-
ph
]  
1 A
ug
 20
20
Our work accelerates both pure Schro¨dinger simulation and layered
algorithms that use it, as we illustrate empirically for Schro¨dinger-
Feynman simulation from [21]. Our algorithmic insights and in-
novations oer both constant-time implementation speed-ups and
algorithmic speed-ups that scale with qubit count:
• Careful selection of the oating-point type backed by nu-
merical accuracy improvements and checks, so as to reduce
memory footprint and memory bandwidth.
• Avoiding most oating-point multiplications, in favor of
faster additive and bitwise instructions.
• Batched simulation of diagonal quantum gates that sig-
nicantly reduces expensive memory traversals and the
overall runtime, while exposing thread-level parallelism.
• Encoding sets of same-type diagonal gates by bitmasks
(Figure 6), simulating them with bitwise and mod-p CPU
instructions, as well as leveraging Gray codes in such sim-
ulation.
• Encoding sets of same-type single-qubit (not necessarily
diagonal) gates by bitmasks (Figure 6) and simulating them
using a recursive FFT-like algorithm that improves cache
locality and exposes thread-level parallelism.
• e insight that some quantum gates (implemented in su-
perconducting quantum computers) are easier to simulate
in pairs because this simplies matrix elements and bene-
ts from batched load/store operations.
• Gate clustering by type, contrasted with the common gate
fusion that clusters heterogenous gates that share qubits.
We develop a reordering-based clustering algorithm that
nds larger homogenous clusters (Figure 6).
• Implementing our algorithms with extensive use of AVX-2
instructions that improves productivity per instruction.
Our empirical results start with comparisons on smaller circuits
where soware from IBM and Microso can be used, then study the
scalability of our techniques and explain how they enrich layered
simulation methods. On a MacBook Pro laptop with 16GiB RAM, we
simulate circuits with a 5×5-qubit array to any depth, with 20× and
12× speedups over simulators from Microso QDK and IBM QISKit
/ QASM, respectively. Our simulator Rollright uses 3.27× and 1.7×
less memory respectively and can also simulate 6 × 5-qubit circuits
of any depth. On a mid-range server, we simulate up to 6 × 6-qubit
circuits and the illustrate Schro¨dinger-Feynman simulation [21] that
assembles results of multiple half-sized Schro¨dinger simulations,
benets from our techniques, and shows 700−4000× speedups over
QDK and QISKit. Additional comparisons to the Qsim simulator
currently under development at Google help estimating the impact
of specic innovations in our work.
e remaining part of the paper is structured as follows. In
Section 2 we review minimal background in quantum circuits and
introduce relevant quantum gates. Section 3 introduces basic ideas
behind Schro¨dinger-style simulation. Our algorithmic framework
is presented in Section 4, including key design decisions and some
of the performance optimizations. Section 5 describes several ways
in which we leverage the CPU architecture and available hardware
resources. Empirical results and scalability studies are reported in
Section 6, followed by conclusions in Section 7. Key concepts are
illustrated by examples.
2 BACKGROUND
A quantum circuit onn qubits is a sequence of quantum gates that act
on quantum states represented by 2n-dimensional complex-valued
vectors [1]. e computation usually starts with the basis vector
(1, 0, . . . , 0) which sets each qubit in the |0〉 state rather than the
|1〉 state. antum gates transform this initial n-qubit state into a
superposition of 2n basis vectors (where each vector is labeled by
some n-bit binary number j and participates with the complex am-
plitude α j ). antum measurements are traditionally performed at
the end to stochastically read out non-quantum bits, while destroy-
ing the quantum state. e probabilities of measurement outcomes
depend on the values α j . In this work, we assume that there is
enough memory to represent all of these values, in which case
simulating measurements is straightforward. us, our task is to
nd the nal values of all α j , also known as strong simulation.
e types of quantum gates used by industry quantum comput-
ers include a handful of one- and two-qubit gates. In quantum
computers based on superconducting technologies, qubits are of-
ten arranged in a planar grid, and two-qubit gates are restricted to
nearest-neighbor qubits. Yet, ourmethods directly handle two-qubit
gates acting on any pair of qubits (as in ion-trap computers).
Specic quantum gates are dened by 2 × 2 or 4 × 4 unitary
matrices [1]. Single-qubit quantum gates include
NOT =
[
0 1
1 0
]
,H =
1√
2
[
1 1
1 −1
]
, and Z =
[
1 0
0 −1
]
,
where NOT negates the state of the qubit, H sets the qubit into a
superposition of |0〉 and |1〉, while Z shis the phase of the qubit.
Multiple qubits can be coupled using the Controlled-NOT (CNOT
and Controlled-Z (CZ ) gates:
CNOT =

1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
 CZ =

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 −1
 .
Example 2.1. Figure 1 illustrates a two-qubit circuit with two
Hadamard gates around a CNOT gate, followed by measurements
on each qubit. e three-gate circuit is equivalent to a CZ gate:
(I ⊗ H ) CNOT (I ⊗ H ) = CZ , (1)
where ⊗ represents the Kronecker product and I represents the
identity matrix of appropriate dimension.1 Given that the CZ gate
is diagonal, it maps |11〉 into − |11〉 and the remaining three basis
vectors to themselves. erefore, if the circuit starts with any basis
vector, the measurement will deterministically produce this vector.
In general, diagonal operators/gates do not create superpositions.
In many circuits, one-qubit gates create separable superpositions,
which diagonal gates then turn into entangled superpositions.
antum circuits with only the gates introduces so far (along
with P = Z 1/2) can be simulated in polynomial time by a compact
algorithm, hence they do not oer quantum computational advan-
tage [7]. Among additional gates supported by Google Bristlecone
and Sycamore chips [3, 4, 25]
X
1/2 = 12
[
1 + i 1 − i
1 − i 1 + i
]
, Y
1/2 = 12
[
1 + i 1 + i
−1 − i 1 + i
]
, T =
[
1 0
0 epi i/4
]
,
1A similar equation expresses CNOT via CZ and two H gates.
2
|q0〉
|q1〉 Z
=
|q0〉
|q1〉 H H
Figure 1: antum circuit diagrams for equivalent circuits
using idx_size = unsigned long long;
template<typename function>
void Apply1QGate(cmplx* w, // wave function
int num_qubits, int target, function& gate_func)
{
const idx size w_size = 1ull << num_qubits,
block_size = 1ull << (num_qubuts - target)
num_iters_per_block = block_size / 2,
gate_bitmask = (1ull << ((num_qubits - 1) - target)),
offset_idx[2] = {0, 1ull << ((num_qubits - 1) - target)};
idx size idx[2] = {0, 0};
while (block_idx < w_size)
{ if (block_idx != 0 && block_idx % (2 * block_size) == 0)
block_idx += block_size
if ((block_idx & gate_bitmask) == 0)
{ idx[0] = offset_idx[0] + block_idx;
idx[1] = offset_idx[1] + block_idx;
cmplx* new_w[2] = gate_func(w[idx[0]], w[idx[1]]);
w[idx[0]] = new_w[0]; w[idx[1]] = new_w[1];
++block_idx; }
else block_idx += (block_idx & gate_bitmask); }
}
Figure 2: Simulating a one-qubit gate on a wave function.
X
1/2 = HPH and Y 1/2 = HZ , but T gates cannot be expressed this
way and therefore hamper polynomial-time simulation [7].2 Adding
T gates facilitates universal quantum computation [22, 23].
Circuit depth is dened as the length of a longest monotonic
path of unitary gates through the circuit.
Example 2.2. Figure 6 shows two equivalent circuits with depth
1+4+1 (1+ and +1 represent the initial and nal rounds of Hadamards).
3 BASELINE SIMULATION ALGORITHMS
Equation 1 illustrates how quantum circuits can be evaluated. First,
order the gates le to right (parallel gates can be ordered arbi-
trarily), pad each gate with an identity matrix of an appropriate
dimension via Kronecker products to obtain a 2n × 2n matrix, and
then multiply all those matrices in order (Equation 1). e resulting
operator represents the entire circuit and can be multiplied by input
state vectors to nd output vectors. While mathematically simple,
this method is enormously wasteful and usually infeasible in prac-
tice. Instead, one applies each gate to the state vector, to avoid
2In this context, we emphasize that the runtime of our simulation algorithms scales
linearly with the number of gates, regardless of gate type.
matrix-matrix multiplications. A key insight in high-performance
Schro¨dinger simulation is how not to pad gates with identity matri-
ces [11–16, 18, 19].
Fast Schro¨dinger simulation. For a q-bit gate dened by its
2q × 2q matrix and circuit qubits i0, . . . , iq−1 to which it is applied,
a typical simulation algorithm modies the 2n-dimensional state-
vector in-place. Such a simulation traverses the state-vector and
enumerates all 2n−q disjoint sets of 2q amplitudes, to which the gate
can be applied in-place. To specify these sets, we turn to the binary
(n-bit) representations of amplitude indices. Each set exhibits all 2q
combinations of bits indexed i0, . . . , iq−1, whereas the remaining
n − q bits are common and form a set id.
Since some gates modify only a fraction of amplitude sets, addi-
tional speedups are available by skipping the remaining sets.
Example 3.1. CZ gates are commonly used in circuit design and
favored because they are qubit-symmetric and because a CNOT
can be expressed via CZ and H gates. To simulate a single CZ gate
acting on qubits i0 and i1, note that it ips the sign of amplitude α j
when j has 1s at binary positions i0 and i1, but otherwise leaves α j
unchanged.3 Hence, the simulation traverses all amplitudes, and
for each α j decides whether to ip the sign based on the bits of
j. Such simple linear memory passes benet from standard CPU
caching and prefetching policies.
Universal quantum gate libraries oen complement the CZ gate
with one-qubit gates [22, 23]. erefore, aminimal circuit-simulation
framework can be completed with an algorithm to simulate an arbi-
trary one-qubit gate acting on qubit i0 (simulating measurements is
straightforward from the denition when all amplitudes are avail-
able). Diagonal one-qubit gates, such as Z and T can be simulated
similarly to how we simulate CZ , except that only one bit of the
amplitude index j needs to be considered.
Example 3.2. NOT gates are not diagonal and swap pairs of
amplitudes whose indices dier at bit i0. One simulates a NOT
gate in one pass over the state vector as follows: for each j , if bit i0
is zero, swap α j with α j′ , where j ′ diers from j at bit i0 only.
Example 3.3. A generic one-qubit gate can be simulated by iso-
lating the gate action to pairs of amplitudes whose indices dier
in one bit only, such as 46=b101110 and 38=b100110. Rather than
swap these amplitudes (as for NOT gates), it applies the 2 × 2 gate
matrix. Figure 2 illustrates this with our C++ code, which addition-
ally exploits bitwise instructions for eciency. To scale this code
beyond 64 qubits, our simulator redenes idx size.
4 OUR ALGORITHMIC FRAMEWORK
We now introduce key techniques and optimizations for advanced
Schro¨dinger-style simulation. First, we show how to achieve su-
cient numerical accuracy with the 32-bit float type and outline
the benets this brings. Second, we introduce a new gate-clustering
approach that forms clusters of gates of a kind. en we focus on
optimizations for each gate type, and point out that clustering can
be improved by circuit reordering. To be specic, we use the gate
library from Section 2, but the framework supports other common
gates too, e.g., by re-expressing them via the gates we consider.
3is test can be implemented using bitmasks by rst dening mask = 1ull << i0 —
1ull ¡¡ i1 and then checking j & mask == mask.
3
4.1 Data type selection and numerical accuracy
Given that state vectors consist of complex-valued amplitudes, the
code in Figure 2 assumes a complex-valued type, and we need to
choose a oating-point type to implement it. e two basic alter-
natives on modern computers are the 32-bit float and the 64-bit
double. Higher-precision types are also available and have been
used for quantum simulation, but signicantly increase the com-
putational load. Our choice of the 32-bit float ensures a reduced
memory footprint and not only helps to simulate more qubits with
limited memory, but also improves memory throughput, which hap-
pens to be a boleneck aer simulation algorithms are properly
optimized. Most other simulators rely on double, which handicaps
them in memory and runtime (see Section 6).
To make our use of float feasible, we need to maintain numeri-
cal accuracy during simulation in the face of potential underows.
Among our gates,NOT , Z , CZ , and even T gates do not signicantly
change themagnitudes ofα j values, butH , X
1/2and Y 1/2include 1/√2
or 1/2 factors which, aer hundreds of gates are applied, can lead to
numerous underows. To avoid underows, we maintain a global
power of 1/√2 to accumulate contributions from individual gates.
Specically, we store a single integer value (starting with 0) and in-
crement it whenever we encounter a factor of 1/√2 (and increment
twice for 1/2). is value can be accounted for when reading o
α j values at the end of the simulation, but that sometimes leads to
very large values. erefore, we “ush” accumulated (1/√2)p when
p > 100, back into α j . We use a similar counter s for global phase is ,
but it cycles through only four possible values. By inspecting gate
matrices in Section 2, one can see that, aer factoring out 1/√2, all
gates can be simulated without oating-point multiplies to improve
both speed and accuracy.
Example 4.1. To simulate a T gate without oating-point multi-
plies, note that the only nontrivial multiplication involves exp(pii/4) =
(i + 1)/√2. is multiplication can be realized by rst incrementing
the p count and then using (i + 1)z = iz + z = (Re(z) − Im(z)) +
i(Re(z)+Im(z)). In other words, add a complex number z to its prod-
uct by i , the laer computed by swapping the real and imaginary
parts and negating the real part. Also see Example 5.1.
As explained in Section 4.2, our simulator rarely deals with T
gates one by one, but rather clusters them and simulates entire clus-
ters, eliminating not only oating-point multiplies, but also most
oating-point additions and subtractions (using integer arithmetic
and bit-parallel instructions instead).
When relying on the compact float type, it is important to
explicitly check for accuracy loss. Since quantum states are rep-
resented by norm-one vectors, we compute the norm before mea-
surement and check how close it is to 1. In practice, the norm
computation itself can introduce greater errors than our simulation,
unless done carefully. Indeed, small contributions of individual am-
plitudes α j are accumulated in a much larger running sum. Adding
a very small number to a much larger number exposes mantissa
limitations. is pitfall can be avoided by representing the run-
ning sum during the norm computation by the higher precision
double type and/or by representing the running sum with a pair of
oating-point values — a common technique for robust arithmetic
in numerical analysis [26].
Using pairs of compact float values to represent complex ampli-
tudes oers an additional, less obvious benet: four pairs of complex
numbers can be added by one AVX-2 instruction (see Section 5).
4.2 Gate clustering and bitmask encoding
Prior quantum simulators [14] typically cluster adjacent gates act-
ing on the same qubits (when this is possible) up to 5 qubits, multiply
out gate matrices up to 32 × 32, and then optimize matrix-vector
multiplication with SIMD multiply-accumulate instructions [14].
Given that we have eliminated most oating-point multiplies for
individual gates, adopting this approach would be a step back.
However, simulating one gate at a time requires expensive memory
traversals. We propose a new gate clustering that considerably re-
duces memory traversals by forming larger clusters yet still avoids
oating-point multiplies and MAC instructions. is is accom-
plished by clustering adjacent gates of a kind — diagonal gates
(T and CZ ) separately from one-qubit non-diagonal gates (H , X 1/2,
Y 1/2). While the decision on how to cluster gates (Figure 6) may
not seem particularly insightful, it enables bitmask encodings and
downstream optimizations with profound impact on simulation
performance, as we show in the rest of the paper.
Clustering diagonal gates together and one-qubit gates together
brings a number of benets. Here we rely on the fact that diagonal
gates act on individual α j values without permuting or mixing
these values. In particular, the order in which diagonal gates are
applied (within the cluster) does not maer. Along these lines, the
order of one-qubit gates acting on dierent qubits does not maer.4
For each type of one-qubit gate, such as T gates, we encode circuit
gates in each cluster using bitmasks.
Example 4.2. In Figure 6, the four-qubit circuit on the right con-
tains a cluster of T gates on qubits 0-3. is cluster can be rep-
resented by the bitmask 15=b1111, neglecting the order in which
these gates were listed in the circuit. e same encoding is used for
X
1/2 gates (7=b0111) and Y 1/2 gates (14=b1110).
A single bitmask cannot encode multiple T gates on one qubit,
but such gates can be separated into adjacent layers (cycles) and
captured using one bitmask per layer.5 Bitmask encodings of CZ
gates are more involved and discussed in Section 4.3.
So far, we explained which gates we cluster and outlined the
logic behind the approach. As will be seen in Section 4.3, our
use of bitmasks signicantly reduces the number of oating-point
operations by (i) rst consolidating the phases contributed by CZ
andT gates to each amplitude index j , and (ii) applying the resulting
phases to the amplitudes α j , when the phases are , 1. Additionally,
simulating all diagonal gates in one memory pass over amplitudes
α j reduces memory trac that is oen the main limiting factor
when large amounts of memory are used.
Optimizations for non-diagonal gates are covered in Section 4.4.
For algorithmic details on gate clustering see Section 4.5.
4Clustering gates of a kind helps nd hidden gate cancellations. Such gate cancellations
sometimes appear in compiled circuits, but not inwell-designed simulation benchmarks
such as the ones we use in this work [3, 24].
5 e benchmarks used in this work [3, 24] do not include repeated gates.
4
/* Returns the 8 phases effected by a set of CZ gates
on 8 consecutive amplitudes */
unsigned char GetCZPhaseInBlockUsingGrayCodesAndBitmask(
idx size num_qubits,
// one bitmask per qubit
idx size* __restrict CZ_bitmasks,
idx size block_idx /* block of 8 floats */)
{
// gc : gray codes
idx size prev_gc = (block_idx - 1) ˆ ((block_idx -
1) >> 1), gc0 = prev_gc, gc4 = (block_idx + 4)
ˆ ((block_idx + 4) >> 1);
const idx size gc[8] = {gc0, gc0 ˆ 1, gc0 ˆ 3, gc0 ˆ
2, gc4, gc4 ˆ 1, gc4 ˆ 3, gc4 ˆ 2};
unsigned char z_phase_result = 0u;
/* In blocks of 4, indices 0,1,0 capture the
application of CZ on the least-sig qubit. */
const idx size bit_idx[8] = {__builtin_ctzl(gc[0] ˆ
prev_gc), 0, 1 , 0, __builtin_ctzl(gc[4] ˆ
gc[3]), 0 , 1, 0};
// Get CZ parity for the prev. set of 1-bit indices
idx size gate_count = 0;
for (idx size i = 0; i < num_qubits; ++i)
if (((prev_gc & (1ull << i)) == (1ull << i))
&& (prev_gc & CZ_bitmasks[i]))
gate_count += __builtin_popcountll((prev_gc
& CZ_bitmasks[i]));
if (gate_count & 2) z_phase_result |= 1 ;
/* Update CZ gate state by calculating
the new parity in the current block */
for (int i = 0; i < 8; ++i)
if (__builtin_parityl(CZ_bitmasks[bit_idx[i]] &
gc[i]) == 1)
z_phase_result |= z_phase_result & (1 << i)
? 0: 1 << i;
return z_phase_result
}
Figure 3: Our optimized algorithm for simulating a bitmask-
encoded cluster ofCZ gates on a block of consecutive ampli-
tudes. It performs a loop-unrolled Gray-code traversal on a
block of 8 consecutive indices. e 8 returned phase values
determine whether or not to negate each amplitude in the
wave function. Compiler intrinsics are explained in Table 1.
4.3 Optimizations for diagonal gates
For a cluster of T and CZ gates, we traverse the entire state vector
once. For each amplitude α j , we calculate the new value aer all
gates in the cluster are applied. For each gate in the cluster, the task
is simple. For example, a T gate acting on qubit i0 leaves unchanged
those α j values where the binary form of j has 0 at bit position i0,
and multiplies the remaining α j by exp(pii/4). A CZ gate acting on
qubits i0 and i1 negates α j when j has bits 1 at positions i0 and i1.
e handling of diagonal clusters can be optimized further. We
form n-qubit layers of T gates, where each layer has at most one
gate on any given qubit and can thus be encoded by a bitmaskm,
where each gate location is represented by a 1 bit. Bitmasks are
formed before the memory pass. For each amplitude α j , each bit
of each bitmask may contribute a factor of exp(pii/4) or a factor of
1. e nontrivial contribution occurs when a 1-bit in bitmaskm
matches a 1-bit in index j.
Example 4.3. To count pairs of matching 1-bits between an am-
plitude index j and a T -gate bitmaskm, two single-cycle CPU in-
structions suce: popcount(m & j). See Table 1 for more details.
Since T 8=I, the number of matching bits above can be taken
mod-8. Applied as a power to exp(pii/4), this integer yields eight
possible values: ±1, ±i and (±1±i)/√2. Wemultiply by these values
using increments of the p counter, oating-point negations, and
swaps of real and imaginary parts (Section 4.1).
To leverage fast bit-based CPU instructions, bitmasks are stored
in 64-bit integers when simulating ≤ 64 qubits. Processing dozens
ofT gates in a cluster by several bit-based operations per amplitude
is much more ecient than simulating gates one by one.
To simulate CZ gates, each CZ gate can be encoded by a bitmask
m with two nonzeros, such bitmasks stored in a list (as long as the
number of CZ gates). en for each α j and each CZ gate, we can
check if m & j == m bitwise, in which case we increment a counter
of contributions. Since each CZ gate can contribute only a factor
of 1 or -1, contributions can be aggregated and then we can either
apply the resulting -1 or do nothing (saving a memory write).
When dealing with large clusters of CZ gates on n qubits, a more
ecient approach is to use (up to n − 1) bitmasks that can capture
multiple gates each. For qubit k > 0, the bitmaskmk represents
(qubits l < k of) CZ gates that also act on qubit k . To each α j ,
these gates can cumulatively contribute phase 1 or -1, which we
determine by aggregating parityll(mk & j) over all k such that
j & (1ull << k) , 0.
Example 4.4. Consider a cluster of six CZ gates that couple all
pairs of four qubits. is cluster is encoded by the following set of
bitmasks (one per qubit): m1 =1000,m2 =1100,m3 =1110. Note
that there are exactly six nonzero bits total across these bitmasks.
A further optimization uses Gray codes. Specically, we traverse
α j in a Gray code order, so that j changes one bit at a time to mini-
mize necessary updates. In this work, we use the more-common
reected Gray code that can be produced from a regular counting
sequence k = 0, 1, 2, 3, . . . with bit operations j = k ⊕ ( k >> 1
).
Example 4.5. e three-bit reected Gray code uses codewords
000-001-011-010-110-111-101-100 or 0-1-3-2-6-7-5-4. Note that
this code is cyclic — the rst and the last values dier in one bit. It
can be obtained from its rst half by seing the most signicant
bit to 1 and reecting the rst half.
Given a paern of CZ gates in a bitmask-encoded cluster, we
precompute which CZ gates become active (-1) or inactive (1) when
each bit switches. When processing blocks of indices, instead of
index calculations from scratch, we incrementally update the “state”
from the previous block. is technique reduces the complexity of
amplitude updates from O(n) to O(1) time, aer initialization.
Figure 3 implements the ideas above, using advanced CPU in-
structions via compiler intrinsics (Table 1). e code works with
5
bitmask-encoded CZ gates and nds the implied Z phase changes
for a block of 8 amplitude indices. Returned as a byte, these 8 bits
determine if respective amplitudes must be negated. e function
can be used in a thread-parallel traversal of the wave function in
conjunction with aligned memory reads (see Section 5).
Other common diagonal gates can be simulated natively or by
expressing them via supported gates, e.g., P = T 2 and Z = T 4.
4.4 Optimizations for non-diagonal gates
Recall that all non-diagonal gates in our gate library are one-qubit
gates. If one wanted to simulate a CNOT gate, it can be re-expressed
using a CZ gate and two H gates on the sides. us, we are now
simulating the following gates (leading factors extracted):
H ′ =
[
1 1
1 −1
]
, X
1/2′ =
[
1 + i 1 − i
1 − i 1 + i
]
, Y
1/2′ =
[
1 + i 1 + i
−1 − i 1 + i
]
When dierent gate types are applied on the same qubit, their
order maers. However, gates applied on dierent qubits can be
reordered. us, we cluster gates of each kind into layers, and
represent each layer by a bitmaskm. Since applying gates one at a
time is inecient, we apply them two at a time. While this requires
/* Writes to idxs the 1st set of N=2ˆnum_gate_qubits
indices of amplitudes on which the gate can be
applied.*/
void GetFirstSetOfIndicesInWToApplyGate(
int num_qubits // qubits in wave function,
idx size* idxs, // starting idxs are written here
idx size gate_bitmask)
{
const idx size num_gate_qubits =
__builtin_popcountll(gate_bitmask);
idx size num_idxs = 1,
stride = 1ull << (num_gate_qubits - 1);
idx size prev_stride = stride;
for (idx size i = idxs[0]; i < num_gate_qubits; ++i)
{
idx size q = __builtin_ctzl(gate_bitmask),
shift = (num_circuit_qubits - 1) - q;
for (idx size n = 0; num_idxs < (1ull << (i +
1)); n += prev_stride) {
idxs[n + stride] = idxs[n] + (1ull << shift);
++num_idxs;
}
gate_bitmask ˆ= 1ull << q;
prev_stride = stride;
stride /= 2;
}
}
Figure 4: Our algorithm for extracting the rst set of in-
dices of the wave function to apply a generic k-qubit gate
on qubits specied by the bitmask. Other sets of indices are
obtained by shiing the rst set, as shown in Figure 5. See
Table 1 for compiler intrinsics.
fetching more data at a time, the resulting matrices
H ′⊗H ′ =

1 1 1 1
1 −1 1 −1
1 1 −1 −1
1 −1 −1 1
 ,Y
1/2′⊗Y 1/2′ = i

1 1 1 1
−1 1 −1 1
−1 −1 1 1
1 −1 −1 1
 ,
X
1/2′ ⊗ X 1/2′ =

i 1 1 1 − i
1 i 1 − i 1
1 1 − i i 1
1 − i 1 1 i

use mostly ±1 and ±i as their entries and can be multiplied by
without oating-point multiply andMAC instructions. For example,
template<typename function>
void Apply2QGates(cmplx* __restrict w, //wave function
idx size gate_qubits, int num_qubits,
function& gate_func, idx size collected_amps /* = 4
when gate_func is in AVX-256 */)
{
const idx size num_idxs = 4, w_size = 1ull <<
num_qubits,
gate_bitmask = (1ull << ((num_qubits - 1) -
__builtin_ctzl(gate_qubits))) |
(1ull << ((num_qubits - 1) - (63 -
__builtin_clzl(gate_qubits))));
array<idx size, num_idxs> starting_idxs;
/* Get starting indices into the wave function to
start applying the gates on */
GetFirstSetOfIndicesInWToApplyGate(num_qubits,
starting_idxs.data(), gate_qubits);
w = (cmplx*)__builtin_assume_aligned(w, 64);
idx size iters = 0, idx = 0;
array<idx size, num_idxs> temp_idxs;
const idx size num_iters = w_size / num_idxs;
while (iters < num_iters) {
/* Skip block where the gate has already been
applied. This is where the bits in the wave
function index are 1 at the position of the
qubit value. */
if (!(idx & gate_bitmask)) {
iters += collected_amps;
/* Increase the starting indices by idx to
progress through the wave function */
for (idx size i = 0; i < num_idxs; ++i)
temp_idxs[i] = starting_idxs[i] + idx;
gate_func(w, temp_idxs.data());
idx += collected_amps;
}
else idx += (idx & gate_bitmask);
}
}
Figure 5: Simulating a two-qubit gate on a q. state using
index extraction (Figure 4) and compiler intrinsics (Table 1).
6
0 |0〉
1 |0〉
2 |0〉
3 |0〉
H T X H
H Z T X Y H
H T X Y H
H Z T Y H
7−→
|0〉
|0〉
|0〉
|0〉
H T X H
H Z T X Y H
H T X Y H
H Z T Y H
Figure 6: Clustering gates of a kind by reordering, with the algorithm from Section 4.5. In the gure, gates are ordered top
down and le to right, to make clusters contiguous. e X and Y boxes above represent X 1/2and Y 1/2gates, respectively. Circuit
depth is 1+4+1, and the circuit (not including the initial and nalHadamard gates) can be encoded using the following bitmasks
as follows. CZ gates:m1=b0001,m3=b0100; T gates: b1111; X
1/2gates: b0111; Y 1/2gates: b1110.
multiplying a complex value αk by the imaginary i entails a swap
of the real and imaginary parts and one negation. Using such
observations, we have developed several algorithms in the spirit
of Figure 2. A common paern in these algorithms is that the two-
qubit gate combination acts each time on four amplitudes whose
indices dier by two bits. To extract such indices, we nd the rst
set using the code in Figure 4 and obtain the remaining sets by
shiing the indices as shown in Figure 5.
Our rst implementation uses a generic gate_func as in Figure
2, but is not limited to one-qubit gates. As shown in Figure 5, it
extracts sets of amplitude indices, then for each set loads the am-
plitudes from the wave function, applies gate_func to them and
saves the result back into the wave function. Our second implemen-
tation in Section 5.3 uses the same index-extraction mechanism,
but increases instruction-level parallelism via custom code for each
pair of one-qubit gates (hence, no generic gate function is passed).
4.5 Gate clustering by reordering
To perform clustering outlined in Section 4.2, we assume a quantum
circuit specied by a list of gates given by their matrices and qubits
on which they act (Section 2). e gates are ordered so that gate дa
whose output qubit acts as input to gate дb appears before дb , but
parallel gates can be ordered either way. Such topologically sorted
orders are generally not unique. Moreover, diagonal gates can
always be reordered without aecting circuit functionality. Google
benchmarks [3, 24] additionally pack parallel gates into numbered
“cycles”, but we ignore this additional structure.
When gates of a kind are adjacent in the given gate ordering,
they form a natural cluster. However, we also nd non-adjacent
gates of a kind that can be reordered to form larger clusters. e
search proceeds from the beginning of the circuit in topological
order. Start with a gate дk that cannot be in a cluster formed before
(such as the rst gate in the circuit) and assume that no gates aer
it can be in a cluster formed before (or else they would have been
reordered). e inner loop of the algorithm (for a xed дk ) nds
gates that cluster with дk and reorders them accordingly. e outer
loop goes over all yet-unclustered gates дk . e state of the inner
loop designates each qubit as unobstructed (initially) or obstructed.
An obstructed qubit prevents a gate of the same type as дk from
being reordered to be adjacent to дk .
e inner loop scans (traverses) gates aerдk , classies each gate
дl in one of three categories and performs the following actions:
(1) a gate of the same kind as дk — if all of the gate’s qubits
are unobstructed, then reorder the gate toward дk to form a
larger cluster, else mark all of the gate’s qubits obstructed;
(2) a gate дl of a dierent kind that can be reordered with дk
— do not change qubit designations because gates of the
same kind as дk can be reordered past дl towards дk ;
(3) a gate that cannot be reordered with a дk — mark all of the
gate’s qubits as obstructed.
e scan from дk continues until all qubits are marked as obstructed
or all gates have been scanned. Upon the completion of the scan, all
qubits are reset to unobstructed, and the next iteration of the outer
loop focuses on to the next gate not in a cluster. is algorithm is
guaranteed to put each gate in the circuit into a cluster (single-gate
clusters are allowed).
Example 4.6. e circuit in Figure 6 starts (and ends) with a
cluster of Hadamards, unchanged during reordering. When the
outer loop focuses on the CZ gate on qubits 0-1, it reorders the
other CZ gate to be adjacent. is step relies on the fact that CZ
and T gates are both diagonal, and so can always be swapped. e
second iteration clusters T gates. Subsequent iterations cluster
X 1/2and Y 1/2gates.
e reordering-based clustering is performed once at the begin-
ning of simulation with negligible impact on runtime. Striving for
larger clusters, we put X 1/2 and Y 1/2 gates into the same clusters
(consider them gates of a kind), especially thatX 1/2⊗Y 1/2 is as simple
to multiply by as other Kronecker products above.
5 LEVERAGING THE CPU ARCHITECTURE
Additional eciencies are available by enhancing memory locality,
optimizing algorithms for the CPU cache size (“blocking”), leverag-
ing instruction-level parallelism, and using multiple CPU threads.
5.1 Memory locality and CPU cache blocking
Given that we have eliminated most oating-point multiplies and
simulate large groups of diagonal gates with fast CPU instructions,
performance bolenecks shi towards memory operations. No-
table performance losses are due to cache misses, especially when
7
Instruction Compiler Use in
type intrinsics simulation
Aligned read/write mm256_load_ps
mm256_store_ps
For all gates, loads/-
stores amplitudes be-
tween RAM& registers.
Packed 32-bit
float arithmetics
mm256_add_ps
mm256_sub_ps
mm256_mul_ps
Used with X 1/2, Y 1/2, and
H gates. AVX multipli-
cation — with T gates.
Multiply 32-bit
floats, then
add/subtract
mm256_fmadd_ps
mm256_fmsub_ps
Optional with T gates.
Bitwise opd on
acked 32-bit
floats
mm256_xor_ps
mm256_or_ps
With X 1/2, Y 1/2, H gates.
Packed 32-bit
floats swizzle
mm256_shuffle_ps
mm256_permute_ps
Used to rearrange re
and im parts of complex
amplitudes for arith-
metic optimizations.
Assume aligned builtin_
assume_aligned
In most kernels lets the
compiler optimize for
aligned vectors.
Count trailing,
leading 0-bits
builtin_ctzl
builtin_clzl
Used to extract indices
and apply bitmasks.
Count 1-bits builtin_
popcountll
Used with CZ and T
gates, also to extract
data from bitmasks.
Parity of 1-bit
count
builtin_
parityll
Used with Z and CZ
gates, with Gray codes.
Table 1: Compiler intrinsics used to accelerate simulation
of specic gate types.
applying one-qubit gates (diagonal gates require a small number
of linear memory passes). To improve memory locality, we take
special care when forming pairs of one-qubit gates. First, the gates
of each kind in a layer are sorted by the qubits they act upon. en,
we form pairs in order, so that paired up gates act on as close qubits
as possible. is reduces memory strides when simulating gate
pairs acting on less signicant bits.
A more sophisticated optimization is blocking. Rather than apply
pairs of one-qubit gates in separate passes, such pairs acting on
dierent qubits can be reordered and even applied partially in
dierent orders. Simulating gates that act on more signicant bits
still suer from long memory strides and frequent cache misses.
To also address those gates, we developed a recursive FFT-like
algorithm for simulating layers of one-qubit (non-diagonal) gates
of a kind. Shown in in Figure 7, this algorithm starts with the most
signicant qubits and simulates gates acting on those qubits (if they
exist), aer which it partitions the state vector into equal-sized
chunks and recurses to individual chunks. Chunks are chosen to
have the smallest size such that the most signicant one-qubit gate
const int kRT = 8 * sizeof(idx size) + 1;
inline int GetNextUsedQubitIndex (const idx size bitmask) {
return bitmask ? __builtin_ctzl(bitmask) : kRT;
}
/* XX, XY, YX gates modify global phase. We accumulate
these phases in an i-counter to avoid multiplies.*/
idx size XYFastTransform (
cmplx* __restrict w, //wave function
idx size X_bitmask, idx size Y_bitmask,
int num_qubits, int num_threads)
{
const int idx_increment = 4; idx size i_phase = 0;
if (X_bitmask & 1 || Y_bitmask & 1)
{
idx size gates_bitmask = 0;
/* Move the bitmask so the next two qubits are the
least significant and find whether to apply XX,
XY, YX, YY.*/
int gate_type = UpdateXYBitmask(
X_bitmask, Y_bitmask, gates_bitmask, i_phase);
Apply2QGates(w, gate_type, gates_bitmask,
num_qubits, idx_increment);
}
const int k = min(GetNextUsedQubitIndex(Y_bitmask),
GetNextUsedQubitIndex(X_bitmask));
// Base condition : end-case when qubit index == 65.
// RT only supported for up to 64 bits indices here.
if (k != kRT) {
const idx size iters = 1ull << k,
stride = 1ull << (num_qubits - k);
X_bitmask >>= k; Y_bitmask >>= k;
idx size temp_i = 0;
for (idx size i = 0; i < iters ; ++i)
temp_i += XYFastTransform( w + (i * stride),
X_bitmask, Y_bitmask, num_qubits - k,
num_threads);
i_phase += temp_i / iters;
}
return i_phase % 4;
}
Figure 7: Our recursive transform (RT) algorithm illus-
trated by applying combinations of X 1/2and Y 1/2gates.
le can be applied within a chunk. Upon recursion, the algorithm
applies multiple non-overlapping pairs of one-qubit gates to each
chunk and moves on to the next chunk. When each chunk ts in
L2 cache, cache misses are reduced and performance is improved.
5.2 read-level parallelism
e recursive FFT-like algorithm has the added benet of exposing
data parallelism aer simulating gates that act on the most sig-
nicant qubits. erefore, dierent branches of recursion can be
processed by dierent CPU threads. Additionally, memory traver-
sals used when simulating diagonal gates expose signicant data
parallelism. We invoke parallel threads using OpenMP pragmas
8
with our C++ code, so that the code runs sequentially when pra-
grmas are ignored. In practice, this brings a 3-4× speedup with 8
threads, aer which threads become memory-starved.
5.3 Instruction-level parallelism
Major performance gains are achieved in our work with aligned
memory reads and writes. Such CPU instructions fetch 256 bits
of data (four complex numbers), but require data alignment to
sucient powers of two. To unlock these optimizations, we had
to give up the use of C++ STL vector classes and instead resorted
to C-style arrays whose memory positioning can be controlled
more precisely. Fortunately, all large arrays in quantum circuit
simulation are sized at large powers of two, and can therefore be
perfectly aligned using a small amount of address arithmetic. When
simulating diagonal gates, each pass becomes faster because the
number of reads and writes is reduced (here we use Gray codes
only within each 256-bit block). When simulating pairs of one-qubit
gates, our memory traversal paern is enhanced by cache blocking
via the recursive FFT-like algorithm in Figure 7. Recall that pairs of
one-qubit gates are applied to four amplitudes at a time, but those
four amplitudes are not contiguous in general. erefore, we load
an entire cache line for each, so that four aligned reads provide data
for applying two gates to four sets of amplitudes.
With data loaded in chunks (typically cache lines), we made a
concerted eort to leverage wide arithmetic operations from the
AVX-2 instruction set. Relevant CPU instructions accessed through
compiler intrinsics are listed in Table 1. In order to prepare registers
for AVX-2 arithmetics, 32-bit values may need to be shued using
several types of bit permutations. CPU cycles can be reduced by
optimizing data shues and by reducing the number of arithmetic
operations. e laer is facilitated by common sub-expression
elimination and matrix factorizations.
Example 5.1. Our AVX-2 kernel used to simulate paired X 1/2gates
is illustrated in Figure 8. e diagonal of the X 1/2⊗ X 1/2matrix (Sec-
tion 4.4) implies multiplication by 1 − i . Figure 8 shows how to
implement such multiplies with fast permutation and XOR instruc-
tions. In particular, _mm256_permute_ps and _mm256_xor_ps
execute in a single cycle on Intel CPUs, whereas multiplication
takes 3-5 cycles depending on the CPU. Without our method, the
four amplitudes would have to be multiplied by (1 − i) each and
then four more loads would be required to complete the add and
subtract operations with the original amplitudes. Our approach
saves at least twelve CPU cycles for a pair of X 1/2gates. e add and
subtract AVX-2 instructions in Figure 8 complete in 24 cycles.
Customizing (to each gate type) such permutations, arithmetic
instructions and read-write instructions is a very laborious process
that requires careful optimization and testing. In particular, we
have studied assembly code generated from our compiler intrinsics
to understand how to perform the same work with fewer CPU
cycles. ese eorts are justied by serious performance benets.
In addition to beer arithmetic performance, aligned memory reads
and writes improve memory bandwidth and help keep more CPU
threads supplied with data.
//-0.0f is needed to switch real and imaginary signs in the
xor operations due to multiplication by (1 - i) .
const m256 kneg1 = {-0.0f, 0.0f, -0.0f, 0.0f, -0.0f,
0.0f, -0.0f, 0.0f};
void ApplyXX12GateAVX(cmplx* __restrict w, // wave function
idx size target[4])
{
// temp wave function
float* __restrict t_w =
(float*)__builtin_assume_aligned(w, 64);
m256 a0 = _mm256_load_ps (t_w + (2*target[0])),
a1 = _mm256_load_ps (t_w + 2*target[1]),
a2 = _mm256_load_ps (t_w + 2*target[2]),
a3 = _mm256_load_ps (t_w + 2*target[3]);
const m256 t0 = _mm256_add_ps(a0, a3);
const m256 t1 = _mm256_add_ps(a1, a2);
m256 t2 = _mm256_sub_ps(a0, a3);
m256 t3 = _mm256_sub_ps(a1, a2);
// Permute real and imaginary numbers
t2 = _mm256_permute_ps(t2, 0b10110001);
t2 = _mm256_xor_ps(t2, kneg1);
t3 = _mm256_permute_ps(t3, 0b10110001);
t3 = _mm256_xor_ps(t3, kneg1);
a0 = _mm256_add_ps(t1, t2); a1 = _mm256_add_ps(t0, t3);
a2 = _mm256_sub_ps(t0, t3); a3 = _mm256_sub_ps(t1, t2);
_mm256_store_ps(t_w + 2*target[0], a0);
_mm256_store_ps(t_w + 2*target[1], a1);
_mm256_store_ps(t_w + 2*target[2], a2);
_mm256_store_ps(t_w + 2*target[3], a3);
}
Figure 8: Optimized AVX-2 code to apply the X 1/2 ⊗ X 1/2 gate
on four amplitudes loaded onto CPU registers. Code for the
Y
1/2 ⊗Y 1/2 gate is simpler due to its simpler matrix, as seen in
Section 4.4. Compiler intrinsics are explained in Table 1.
6 SIMULATION COMPARISONS
is work targets circuits that run on NISQ computers [2] and
are therefore limited in the number of qubits. Among such cir-
cuits, many well-known examples are fairly easy to simulate [8].
erefore, we focus on recent quantum-supremacy circuits from
Google [24] that were designed to ensure diculty of simulation
[25]. e average-case diculty of their simulation has since been
proven analytically [6], and the benchmarks have been revised [21,
arxiv:1807.10749] to remove unintended simulation shortcuts. On
the other hand, our Schro¨dinger simulator does not exploit some of
the well-known features of these benchmarks that simplify simula-
tion, such as their planar-grid qubit layout with nearest-neighbor
qubit couplings, or the knowledge of specic gate paerns. Table
2 shows characteristics of the benchmarks, including their large
T-gate counts, which defeat stabilizer-based simulation techniques.
9
Circuit Gates Microso QDK QISkit-Terra QASM Rollright Ratios QDK/RR Ratios QISkit/RR
depth all 2-q T time mem time mem mode time mem time mem time mem
1+26+1 s MiB s MiB s MiB
MacBook Pro 2017 — MacOS High Sierra: 16 GiB, Intel Core i7-7700HQ (2.80GHz) 4 cores 8 threads
16q 274 78 68 2.34 — 1.59 — S < 0.1 — — — — —
24q 417 123 99 16.64 463 9.87 176.34 S 0.88 128 18.91 3.63 11.22 1.38
25q 435 130 105 28.72 972 18.3 432.36 S 1.45 256 19.81 3.80 12.62 1.69
30q 524 161 119 — OOM — OOM S 58.8 8192 — — — —
Server — Ubuntu Linux: 144 GiB, Intel Xeon Platinum 8124M (3.00GHz) 18 cores, 72 threads
30q 524 161 119 1213.16 23959 212.25 15925 S 23.2 8192 52.29 2.92 9.15 1.94
30q 524 161 119 1213.16 23959 212.25 15925 S-F 0.301 27 4030.43 887.38 705.15 589.83
32q 560 168 168 — OOM 503.8 63922 S 139 32000 — — 3.62 2.00
32q 560 168 168 — OOM 503.8 63922 S-F 0.594 48 — — 917.67 1331.70
36q 633 195 144 — OOM — OOM S-F 109.37 192 — — — —
Table 2: Comparisons of our simulator Rollright to the simulator from Microso QDK v0.11.2006.403 and IBM QISkit-Terra
v0.5.7 benchmarks from Google (v2) [3, 24] with up to 36 qubits, performed on a laptop and a mid-range server. Memory usage
is the increase in max resident memory versus the baseline 16-qubit simulation, to exclude code-segment size.
6.1 Validation and basic performance
We implemented proposed algorithms in C++17 in a package called
Rollright and compiled the codeswith Clang v11.0.3. As described in
Section 5.3 we use AVX-2 instructions that are commonly available
on commodity and server CPUs. Results reported in this paper do
not use GPUs, but use SIMD and thread parallelism.
To ensure correctness of simulation results, we saved all am-
plitudes of nal states (for several circuits with 20-25 qubits) and
compared them to amplitudes produced by several industry and
academic simulators. For circuits with 30-36 qubits, we saved a
small number of amplitudes and cross-checked them against results
obtained by the authors of Google benchmarks independently.
Soware development and some of the experiments were per-
formed on a MacBook Pro 2017 laptop with 16 GiB RAM, where
our baseline Schro¨dinger simulation completes a 30-qubit circuit
with depth 1+26+1 in 72 s using a lile over 8 GiB of RAM (1+ and
+1 denote initial and nal layers of Hadamard gates as in Figure
6).6 On the laptop, we use a single CPU process with eight threads.
For simulations on a mid-range server with ample memory (Tables
2 and 3) we use multiple CPU processes (with eight threads each)
to leverage available hardware threads.
In terms of actual performance, Rollright spends negligible time
on I/O and circuit pre-processing, i.e., gate reordering and forming
clusters. Most of the runtime is spent simulating clusters of diagonal
and non-diagonal gates. For runtime proling, we distinguish non-
diagonal gates acting on the more and the less signicant qubits. To
illustrate, consider a 5 × 6-qubit Google circuit of depth 1 + 26 + 1,
where simulating X 1/2 and Y 1/2 gates on the more signicant qubits
took > 75% of runtime. CZ , T gates, along with remaining X 1/2, Y 1/2,
and H gates took only 14.4% runtime.
In addition to pure Schro¨dinger simulation, our techniques can
be used to accelerate layered simulation algorithms [18, 20, 21]
6e runtime of simulation with Rollright is consistent among dierent circuits of
similar size, which makes concerns about benchmark post-selection moot.
that handle a greater variety of circuits. For example, divide-and-
conquer algorithms leverage Schro¨dinger simulation of n = 32-
qubit blocks to simulate 2n = 64-qubit circuits [21]. Tables 2 and 3
show results for both pure Schro¨dinger and Schro¨dinger-Feynman
simulation [21] with our optimizations included.
6.2 Comparisons to Microso, IBM and Google
We compared our simulator to theMicrosoantumDevelopment
Kit (QDK) v0.2.1806.3001 and IBM QISKit-Terra v0.5.7. Microso
QDK includes a back-end circuit simulator and front-end support
for the Q# language, integrated with Microso Visual Studio and
available on Windows, MacOS and Linux. IBM QISkit includes
several simulators, and we found the QASM simulator to be faster
than the state-vector simulator on quantum-supremacy circuits
[3, 24]. We performed empirical comparisons to our simulator
Rollright on circuits of depth 1+26+1 with up to 36 qubits, as shown
in Table 2. To exclude code segments from memory comparisons,
we rst measured max resident memory for each simulator on the
16-qubit benchmark and then used thosemeasurements as baselines.
Dierences in memory usage are on the order of 2× and are mostly
explained by our use of single-precision oats. Rollright’s runtime
advantage is much greater and grows with the number of qubits.
Given that for 30-qubit circuits Microso and IBM simulators
required > 16 GiB memory available (Rollright used a lile over 8
GiB), we also used a multicore Linux server with sucient mem-
ory and observed that the Microso and IBM simulators used all
available threads. e optimizations proposed in this work apply to
both Schro¨dinger and Schro¨dinger-Feynman simulation, therefore
we evaluated Rollright in both modes. Clearly, the Schro¨dinger-
Feynman simulation oers a much greater advantage in both run-
time and memory on circuits of depth 1+26+1. e 32-qubit circuit
uses the oblong 4 × 8 qubit array, and the Schro¨dinger-Feynman
mode of our simulator is able to exploit this shape. erefore, we
also show results for Schro¨dinger-Feynman simulation on an even-
sided 6 × 6 qubit array. Our comparisons to soware from IBM and
Microso have been presented in person at these companies.
10
-1000
9000
19000
29000
24 26 28 30 32 34 36
M
EM
OR
Y 
(M
IB
)
QUBITS
RR S
RR S-F
MSFT QDK
IBM QASM
-20
30
80
130
24 26 28 30 32 34 36
TI
M
E 
(S
)
QUBITS
RR S
RR S-F
MSFT QDK
IBM QASM
-50
150
350
0 10 20 30 40 50 60 70 80
TI
M
E 
(S
)
DEPTH
RR S
RR S-F
MSFT QDK
Figure 9: Scalability simulations: Microso QDK (solid green line), IBM QISkit-Terra/QASM (black dashed line), as well
as our simulator Rollright in Schro¨dinger (red dot-dashed line) and Schro¨dinger-Feynman (solid blue line) modes. We
plot runtime and memory usage against qubit count and circuit depth. Circuit depth was varied for 30-qubit circuits.
We also compared our simulator to the Qsim simulator under de-
velopment at Google (https://github.com/quantumlib/qsim),
which we compiled using the same compiler we used for Rollright.
According to the authors, Qsim clusters one-qubit gates to nearby
two-qubit gates and uses AVX-2 instructions to simulate resulting
generic two-qubit gates one by one. Qsim lacks our optimizations
for diagonal and one-qubit gates, as well as the FFT-like algorithm
that optimizes memory access. Following our prior collaboration
with Google [21], Qsim supports the same simulation modes as Roll-
right — Schro¨dinger and Schro¨dinger-Feynman, — which facilitates
more detailed apples-to-apples comparisons. Runtimes in Table
3 (shared with Qsim authors) were collected on the same server
as for the lower half of Table 2. While Google Qsim outperforms
IBM QISkit and Microso QDK on Google benchmarks, Rollright
remains ahead, conrming the impact of our proposed methods.
6.3 Scalability studies and use models
e results in Table 2 show massive advantage of Schro¨dinger-
Feynman simulation, but pure Scro¨dinger simulation remains at-
tractive for deep quantum circuits, e.g., in quantum chemistry appli-
cations [10] and/or when supercomputing resources are available
[11–16, 18, 19]. 2num qubits and runtime as depth×2num qubits. Sim-
ulating 40-qubit circuits this way would require servers with >8
TiB (now available from Microso Azure and Amazon AWS). In
the Schro¨dinger-Feynman mode, memory usage can be kept low
(see details in [21]) by serializing the computation, but if massive
parallel resources are available, using greater peak memory can
decrease the latency of simulation. In the meantime, runtime grows
as 2num qubits/2+depth/C for a large C > 0. Figure 9 uses larger
supremacy benchmarks from Google to illustrate these dierences
5 × 5 q 6 × 5 q 8 × 4 q 6 × 6 q
S S-F S S-F S S-F S-F
Qsim 1.1 1.64 24.25 0.47 152 0.75 194.12
RR 0.77 1.26 23.20 0.30 139 0.59 109.37
ratio 1.43 1.30 1.05 1.23 1.09 1.27 1.77
Table 3: Server runtimes (s) of the Google QSim simulator
on Google v2 benchmarks [3, 24] used in Table 2, compared
to runtimes of our simulator Rollright (RR). RR runtimes
for 30-36 qubits match those in the lower half of Table 2.
between Schro¨dinger and Schro¨dinger-Feynman simulation by plot-
ting memory usage and runtime for varied qubit counts and circuit
depth. e linear runtime scaling of Schro¨dinger simulation vs.
circuit depth (regardless of gate typess) contrasts with the semi-
exponential scaling of Schro¨dinger-Feynman simulation.
While this paper focuses on single-server experiments, distributed
Schro¨dinger-Feynman simulations with Rollright using Google
Cloud [21] show that our methods are directly useful to simulate
boundeded-depth 56- and 64-qubit circuits. Alternatively, our tech-
niques can be directly used in supercomputer-based Schro¨dinger
simulations [19] that are not limited by circuit depth. To this, end
results in [20] show how to reduce the simulation of wide but shal-
low circuits (those with many qubits but low depth) to narrow but
deep circuits, where pure Schro¨dinger simulation does well.
7 CONCLUSIONS
Many near-term intermediate-scale quantum (NISQ) computers [2]
are currently operating with less than 64 functional qubits, includ-
ing a recent demonstration of quantum-computational supremacy
by Google [4] with 53 qubits. antum circuits that run on such
computers support a plethora of science experiments [10] and mo-
tivate circuit optimization tasks, most of which require support in
the form of simulation on conventional computers. For example,
recent developments in quantum chemistry oer synthesis methods
for quantum circuits that model molecular congurations and help
computing their energy levels. While these circuits are intended
for NISQ computers, quantum-circuit simulation on conventional
computers is crucial to develop and validate advanced quantum
technologies, such as quantum-on-quantum simulators [10].
Among the simulation algorithms available, this work focuses on
Schro¨dinger simulation that can be used independently or within
layered simulation algorithms [18, 20, 21] that handle a greater
variety of circuits. Our algorithmic optimizations collectively pro-
vide a hey speed-up over quantum simulators from Microso and
IBM on hard circuits from Google. is speedup is not limited by a
constant factor, but grows with the number of qubits.
Some of our optimizations are generic and some are tuned to
gates used by Google benchmarks [3, 24]. It should be clear that
other gates can be natively supported using our techniques and/or
by expressing them in terms of the gates we explored. While
our evaluation focuses on medium-size circuits which industry
11
simulators can handle, our contributions are directly useful when
simulating circuits with many more qubits as shown in [20, 21].
Our techniques can directly benet supercomputing simulations,
especially those using the same types of circuits and simulation
frameworks [19]. However, supercomputers do not support the
needs of frequent on-demand simulation during practical work with
quantum algorithms, quantum error-correcting code, and/or NISQ
hardware. For such uses, beer approaches include smaller-scale
Schro¨dinger simulation on single high-mem servers and distributed
Schro¨dinger-Feynman simulation in commercial clouds [21].
ACKNOWLEDGMENTS
We are thankful to Dmitri Maslov, Sergio Boixo and Sergei Isakov
for insightful comments.
REFERENCES
[1] Nielsen, M. A. & Chuang, I. L. antum Computation andantum Information
(10th Anniversary edition). Cambridge . Press (2016), 978-1-10-700217-3, 1-676.
[2] Preskill, J. antum computing and the entanglement frontier (2012). 25th
Solvay Conf.
[3] Boixo, S. et al. Characterizing quantum supremacy in near-term devices. Nat.
Phys. 14, 595 (2018). arXiv:1608.00263.
[4] Arute, F. & Arya, K. & Babbush, R. & et al antum supremacy using a
programmable superconducting processor. Nature 574, 505510 (2019) Nature
575, 505-510 (2019). arXiv:1910.11333.
[5] Aaronson, S. & Chen, L. Complexity-theoretic foundations of quantum
supremacy experiments. arXiv:1612.05903 (2016).
[6] Bouland, A., Feerman, B., Nirkhe, C. & Vazirani, U. On the complexity and
verication of quantum random circuit sampling. Nature Phys 15, 159163
(2019).
[7] Aaronson, S. & Goesman, D. Improved Simulation of Stabilizer Circuits. Phys.
Rev. Le. 70, 052328 (2004).
[8] Viamontes, G. F., Markov, I. L. & Hayes, J. P. antum circuit simulation. Springer
Science & Business Media, 2009.
[9] Markov, I.L. & Shi, Y. Simulatingantum Computation by Contracting Tensor
Networks.. SIAM J. Comput., 38(3), 963–981 (2008).
[10] Altman, E. et al. antum Simulators: Architectures and Opportunities.
arXiv:1912.06938 (2019).
[11] De Raedt, H. et al. Massively parallel quantum computer simulator. Computer
Physics Communications 176, 121–136 (2007).
[12] De Raedt, H. et al. Massively parallel quantum computer simulator, eleven years
later. arXiv:1805.04708 (2018).
[13] Wecker, D. & Svore, K., M. LIQUi—¿: A Soware Design Architecture and
Domain-Specic Language for antum Computing. arXiv:1402.4467 (2014).
[14] Ha¨ner, T. & Steiger, D. S. 0.5 Petabyte Simulation of a 45-bitantum Circuit.
arXiv:1704.01127 (2017).
[15] Smelyanskiy, M., Sawaya, N. P. D., qHiPSTER: eantum High Performance
Soware Testing Environment arXiv:1601.07195 2017
[16] Pednault, E. et al. Breaking the 49-qubit barrier in the simulation of quantum
circuits. arXiv:1710.05867 (2017).
[17] Villalonga, B. et al. Establishing theantum Supremacy Frontier with a 281
Pop/s Simulation. antum Science and Technology 5, 034003 (2020).
[18] Pednault, E et al. Leveraging Secondary Storage to Simulate Deep 54-qubit
Sycamore Circuits. arXiv:1910.09534 (2019).
[19] Li, R., Wu, B., Ying, M., Sun, X. & Yang, G. antum supremacy circuit simulation
on Sunway TaihuLight. arXiv:804.04797 (2018).
[20] Chen, M.-C. et al. antum-Teleportation-Inspired Algorithm for Sampling
Large Randomantum Circuits. arXiv:1901.05003. Phys. Rev. Le. 128, 080502
(2020).
[21] Markov, I.L. & Fatima, A. & Isakov, S. & Boixo, S. Massively parallel simulation
of hard quantum circuits. DAC 2020 arXiv:1807.10749 (2018).
[22] Ekert, A., Hayden, P. & Inamori, H. Basic concepts in quantum computation.
arXiv:quant-ph/00110137 (2000).
[23] Shi, Y. Both Tooli and Controlled-NOT need lile help to do universal quantum
computation. arXiv:quant-ph/0205115 (2002).
[24] Boixo, S. & Neill, C. e question of quantum supremacy. Google AI Blog (2018).
Benchmarks available on GitHub at hps://github.com/sboixo/GRCS.
[25] A Preview of Bristlecone, Googles Newantum Processor, Google AI Blog
(2018).
[26] Ogita, T. & Rump, S. M. & Oishi, S. Accurate sum and dot product. SIAM J. on
Scientic Computing 26(6), 1955-1988 (2005).
12
