QuEST and High Performance Simulation of Quantum Computers by Jones, Tyson et al.
QuEST and High Performance Simulation of Quantum Computers
Tyson Jones,1 Anna Brown,2 Ian Bush,2 and Simon Benjamin1
1Department of Materials, University of Oxford, Parks Road, Oxford OX1 3PH, United Kingdom
2Oxford e-Research Centre, Department of Engineering Science,
University of Oxford, Keble Road, Oxford OX1 3PH, United Kingdom
(Dated: December 5, 2018)
We introduce QuEST, the Quantum Exact Simulation Toolkit, and compare it to ProjectQ [1],
qHipster [2] and a recent distributed implementation [3] of Quantum++ [4]. QuEST is the first open
source, OpenMP and MPI hybridised, GPU accelerated simulator of universal quantum circuits.
Embodied as a C library, it is designed so that a user’s code can be deployed seamlessly to any
platform from a laptop to a supercomputer. QuEST is capable of simulating generic quantum
circuits of general single-qubit gates and multi-qubit controlled gates, on pure and mixed states,
represented as state-vectors and density matrices, and under the presence of decoherence. Using
the ARCUS Phase-B and ARCHER supercomputers, we benchmark QuEST’s simulation of random
circuits of up to 38 qubits, distributed over up to 2048 compute nodes, each with up to 24 cores. We
directly compare QuEST’s performance to ProjectQ’s on single machines, and discuss the differences
in distribution strategies of QuEST, qHipster and Quantum++. QuEST shows excellent scaling,
both strong and weak, on multicore and distributed architectures.
I. INTRODUCTION
Classical simulation of quantum computation is vital
for the study of new algorithms and architectures. As
experimental researchers move closer to realising quan-
tum computers of sufficient complexity to be useful, their
work must be guided by an understanding of what tasks
we can hope to perform. This in turn means we must
explore an algorithm’s scaling, its robustness versus er-
rors and imperfections, and the relevance of limitations
of the underlying hardware. Because of these require-
ments simulation tools are needed on many different clas-
sical architectures; while a workstation may be sufficient
for the initial stages of examining an algorithm, further
study of scaling and robustness may require more pow-
erful computational resources. Flexible, multi-platform
supporting simulators of quantum computers are there-
fore essential.
Further it is important these simulations are very effi-
cient since they are often repeated many times, for exam-
ple to study the influence of many parameters, or the be-
haviour of circuits under noise. But it is expensive to ex-
actly simulate a quantum system using a classical system,
since a high-dimensional complex vector must be main-
tained with high fidelity. Both the memory requirements,
and the time required to simulate an elementary circuit
operation, grow exponentially with the number of qubits.
A quantum computer of only 50 qubits is already too
large to be comprehensively simulated by our best classi-
cal computers [5], and is barely larger than the 49 qubit
computers in development by Intel and Google [6, 7].
To simulate quantum computers even of the size already
experimentally realised, it is necessary that a classical
simulator take full advantage of the performance optimi-
sations possible of high performance classical computing.
It is also equally important that the research commu-
nity have access to an ecosystem of simulators. Verfi-
cation of complex simulations is a non-trivial task, one
that is much eased by having the facility to compare the
results of simulations performed by multiple packages.
The number of single compute node generic [8–10]
and specialised [11–14] simulators is rapidly growing.
However despite many reported distributed simulators
[2, 3, 15–20] and proposals for GPU accelerated simula-
tors [18, 21–24], QuEST is the first open source simulator
available to offer both facilities, and the only simulator to
offer support on all hardware plaforms commonly used in
the classical simulation of quantum computation
II. BACKGROUND
A. Target Platforms and Users
Simulations of quantum computation are performed on
a wide variety of classical computational platforms, from
standard laptops to the most powerful supercomputers in
the world, and on standard CPUs or on accelerators such
as GPUs. Which is most suitable for the simulation of a
given circuit will depend upon the algorithm being stud-
ied and the size of the quantum computer being modelled.
To date this has resulted in a number of simulators which
typically target one, or a small number, of these architec-
tures. While this leads to a very efficient exploitation of
a given architecture, it does mean that should a research
project need to move from one architecture to another,
for instance due to the need to simulate more qubits, a
different simulation tool is required. This may require a
complete rewrite of the simulation code, which is time
consuming and makes verification across platforms diffi-
cult. In this article we describe QuEST which runs ef-
ficiently on all architectures typically available to a re-
searcher, thus facilitating the seamless deployment of the
researcher’s code. This universal support also allows the
researcher to easily compare the performance of the dif-
ferent architectures available to them, and so pick that
ar
X
iv
:1
80
2.
08
03
2v
6 
 [q
ua
nt-
ph
]  
4 D
ec
 20
18
2most suitable for their needed simulations.
In the rest of this section we shall examine the nature
of the architectures that are available, cover briefly how
codes exploit them efficiently, and show how QuEST, the
universal simulator, compares with the more platform
specific implementations.
B. Simulator Optimisations
Classical simulators of quantum computation can make
good use of several performance optimisations.
For instance, the data parallel task of modifying the
state vector under a quantum operation can be sped up
with single-instruction-multiple-data (SIMD) execution.
SIMD instructions, like Intel’s advanced vector exten-
sions (AVX), operate on multiple operands held in vec-
tor registers to concurrently modify multiple array ele-
ments [25], like state vector amplitudes.
Task parallelism can be achieved through multithread-
ing, taking advantage of the multiple cores found in
modern CPUs. Multiple CPUs can cooperate through
a shared NUMA memory space, which simulators can
interface with through OpenMP [26].
Simulators can defer the expensive exchange of data
in a CPU’s last level cache (LLC) with main memory
through careful data access; a technique known as cache
blocking [27]. Quantum computing simulators can cache
block by combining sequential operations on adjacent
qubits before applying them, a technique referred to as
gate fusion [2, 18]. For instance, gates represented as ma-
trices can be fused by computing their tensor product.
Machines on a network can communicate and coop-
erate through message passing. Simulators can parti-
tion the state vector and operations upon it between dis-
tributed machines, for example through MPI, to achieve
both parallelisation and greater aggregate memory. Such
networks are readily scalable, and are necessary for sim-
ulating many qubit circuits [18].
With the advent of general-purpose graphical process-
ing units (GPGPUs), the thousands of linked cores of a
GPU can work to parallelise scientific code. Simulators
can make use of NVIDIA’s compute unified device archi-
tecture (CUDA) to achieve massive speedup on cheap,
discrete hardware, when simulating circuits of a limited
size [23]. We mention too a recent proposal to utilise
multi-GPU nodes for highly parallel simulation of many
qubit quantum circuits [22].
1. Single node
ProjectQ is an open-source quantum computing frame-
work featuring a compiler targeting quantum hardware
and a C++ quantum computer simulator behind a
Python interface [1]. In this text, we review the perfor-
mance of its simulator, which supports AVX instructions,
employs OpenMP and cache blocking for efficient par-
allelisation on single-node shared-memory systems, and
emulation to take computational shortcuts [28].
QuEST is a new open source simulator developed in
ISO standard conformant C [29], and released under the
open source MIT license. Both OpenMP and MPI based
parallelisation strategies are supported, and they may
be used together in a so-called hybrid strategy. This
provides seamless support for both single-node shared-
memory and distributed systems. QuEST also employs
CUDA for GPU acceleration, and offers the same in-
terface on single-node, distributed and GPU platforms.
Though QuEST does not use cache blocking or emula-
tion, we find QuEST performs equally or better than
ProjectQ on multicore systems, and can use its addi-
tional message-passing facilities for faster and bigger sim-
ulations on distributed memory architectures.
ProjectQ offers a high-level Python interface, but
can therefore be difficult to install and run on super-
computing architectures, though containerisation may
make this process easier in future [30, 31]. Conversely,
Quest is light-weight, stand-alone, and tailored for high-
performance resources - its low-level C interface can be
compiled directly to a native executable and run on per-
sonal laptops and supercomputers.
Both QuEST and ProjectQ maintain a pure state in 2n
complex floating point numbers for a system of n qubits,
with (by default) double precision in each real and imag-
inary component; QuEST can otherwise be configured to
use single or quad precision. Both simulators store the
state in C/C++ primitives, and so (by default) consume
16×2n B [32] in the state vector alone. However ProjectQ
incurs a ×1.5 memory overhead during state allocation,
and QuEST clones the state vector in distributed appli-
cations. Typical memory costs of both simulators on a
single thread are shown in Figure 1, which vary insignifi-
cantly from their multithreaded costs. While QuEST al-
lows direct read and write access to the state-vector, Pro-
jectQ’s single amplitude fetching has a Python overhead,
and writing is only supported in batch which is mem-
ory expensive due to Python objects consuming more
memory than a comparable C primitive - as much as
3× [33, 34]. Iterating the state-vector in ProjectQ is
therefore either very slow, or comes with an appreciable
memory cost.
QuEST applies a single-qubit gate (a 2 × 2 matrix
G) on qubit q of an N -qubit pure state-vector |ψ〉 =
2N−1∑
n=0
αn |n〉, represented as the complex vector ~α, by up-
dating vector elements(
αni
αni+2q
)
7→ G
(
αni
αni+2q
)
(1)
where ni = bi/2qc 2q+1 + (i mod 2q) for every integer
i ∈ [0, 2N−1 − 1]. This applies G via 2N computations of
ab + cd for complex a, b, c, d and avoids having to com-
pute and matrix-multiply a full 2N × 2N unitary on the
3State vector
ProjectQ
QuEST
12 16 20 24 28
1
10
1.5
Factor
15 20 25 30
1 MiB
16 MiB
256 MiB
4 GiB
64 GiB
Number of qubits
M
em
or
y
FIG. 1. Memory consumption of QuEST’s C and ProjectQ’s
Python processes, as reported by Linux’s /proc/self/status
during random circuit simulation on a single 256 GiB ARCUS
Phase-B compute node. Full and dashed lines show the typ-
ical and maximum usage respectively, while the gray dashed
line marks the memory required to store only the state-vector
(in double precision). The subplot shows the ratio of total
memory consumed to that by only the state-vector.
state-vector. This has a straight-forward generalisation
to multi-control single-target gates, and lends itself to
parallelisation.
We leverage the same hardware-optimised code to en-
act gates on N -qubit density matrices, by storing them
as 2N -qubit state-vectors,
ρ =
2N−1∑
j=0
2N−1∑
k=0
αj,k |j〉 〈k| → ρ′ =
22N−1∑
n=0
α′n |n〉 . (2)
Here the object ρ′ does not, in general, respect the con-
straint
∑ |α′n|2 = 1. An operation GqρG†q, that is a gate
on qubit q, can then be effected on ρ′ as G∗q+NGqρ
′, by
exploiting the Choi–Jamiolkowski isomorphism [35] This
holds also for multi-qubit gates. The distribution of the
density matrix in this form lends itself well to the parallel
simulation of dephasing and depolarising noise channels.
2. Distributed
How simulators partition the state vector between pro-
cesses and communicate over the network is key to their
performance on distributed memory architectures. All
simulators we have found so far employ a simple parti-
tioning scheme; the memory to represent a state vector
is split equally between all processes holding that vector.
A common strategy to then evaluate a circuit is to pair
nodes such that upon applying a single qubit gate, every
process must send and receive the entirety of its portion
of the state vector to its paired process [2, 3, 17].
The number of communications between paired pro-
cesses, the amount of data sent in each and the addi-
tional memory incurred on the compute nodes form a
×2 ×1.5 ×1.25
32 qubits
31 qubits
cloned
overhead
free
1.25
1.5
2
1 2 4 8 16
25
26
27
28
29
Memory (GiB) on each of 2k nodes
N
um
be
r
of
qu
bi
ts
-k
FIG. 2. An illustration of strategies to distribute the state
vector between two 64 GiB nodes. A complete cloning (×2
memory) of the partition on each node is wasteful. Half the
partition can be cloned, at the cost of twice as many MPI mes-
sages, to fit another qubit into memory [17]. Further division
requires more communication for less memory overhead [2].
The bottom plot shows the maximum number of qubits which
can fit on 2k nodes of varying memory, assuming a 50 MiB
overhead per node.
tradeoff. A small number of long messages will ensure
that the communications are bandwidth limited, which
leads to best performance in the communications layer.
However this results in a significant memory overhead,
due to the process having to store buffers for both the
data it is sending and receiving, and in an application
area so memory hungry as quantum circuit simulation
this may limit the size of circuit that can be studied.
On the other hand many short messages will minimise
the memory overhead as the message buffers are small,
but will lead to message latency limited performance as
the bandwidth of the network fabric will not be satu-
rated. This in turn leads to poor parallel scaling, and
hence again limits the size of the circuit under consid-
eration, but now due to time limitations. Note that the
memory overhead is at most a factor 2, which due to the
exponential scaling of the memory requirements, means
only 1 less qubit may be studied. Some communication
strategies and their memory overheads and visualised in
Figure 2.
QuEST partitions the state vector equally between the
processes within the job, and the message passing be-
tween the process pairs is so organised as to absolutely
minimise the number of communications during the oper-
ation of a single gate. Thus parallel performance should
be good, but there will be a significant memory overhead;
in practice a factor of 2 as described above. For n qubits
distributed over 2k nodes, these communications occur
when operating on qubits with index ≥ n − k, indexing
from 0.
4An alternative strategy is to clone, send and receive
only half of each node’s data in two exchanges [17], in-
curring instead a 1.5× memory cost. This often leaves
room to simulate an additional qubit, made clear in Fig-
ure 2. This strategy can be recursed further to reduce the
memory overhead even more, and negligible additional
memory cost can be achieved by communicating every
amplitude separately as in [3], though this comes at a
significant communication cost, since a message passing
pattern is latency dominated and will exhibit poor scal-
ing with process count. However an improvement made
possible by having two exchanges is to overlap the com-
munication of the first message with the computation on
the second half of the state vector, an optimisation im-
plemented in qHipster [2]. This depends on the network
effectively supporting asynchronous communications.
We also mention recent strategies for further reduc-
ing network traffic by optimising the simulated cir-
cuit through gate fusion, state reordering [2, 18] and
rescheduling operations [18], though opportunities for
such optimisations may be limited.
In terms of the functionality implemented in the simu-
lation packages we note that while qHipster is limited to
single and two-qubit controlled gates, QuEST addition-
ally allows the distributed operation of any-qubit con-
trolled gates.
3. GPU
Though MPI distribution can be used for scalable
parallelisation, networks are expensive and are overkill
for deep circuits of few qubits. Simulations limited
to 29 qubits can fit into a 12 GB GPU which of-
fers high parallelisation at low cost. In our testing,
QuEST running a single Tesla K40m GPU (retailing cur-
rently for ∼3.6 k USD) outperforms 8 distributed 12-core
Xeon E5-2697 v2 series processors, currently retailing at
∼21 k USD total, ignoring the cost of the network.
QuEST is the first available simulator of both state-
vectors and density matrices which can run on a CUDA
enabled GPU, offering speedups of ∼5× over already
highly-parallelised 24-threaded single-node simulation.
We mention QCGPU [36] which is a recent GPU-
accelerated single-node simulator being developed with
Python and OpenCL, and Quantumsim, a CUDA-based
simulator of density matrices [37].
4. Multi-platform
QuEST is the only simulator which supports all of
the above classical architectures. A simulation written
in QuEST can be immediately deployed to all environ-
ments, from a laptop to a national-grade supercomputer,
performing well at all simulation scales. We list the fa-
cilities supported by other state-of-the-art simulators in
Table II B 4.
|0〉 H
|0〉 H
T
|0〉 H
T
Y 1/2
|0〉 H
|0〉 H
T
T
X1/2
T
T
X1/2
X1/2
T
Y 1/2
T
Y 1/2
X1/2
FIG. 3. An example of a depth 10 random circuit on 5 qubits,
of the linear topology described in [38]. This diagram was
generated using ProjectQ’s circuit drawer.
C. Algorithm
We compare QuEST and ProjectQ performing simu-
lations of universal psuedo-random quantum circuits of
varying depth and number of qubits. A random circuit
contains a random sequence of gates, in our case with
gates from the universal set {H, T , C(Z), X1/2, Y 1/2}.
These are the Hadamard, pi/8, controlled-phase and root
Pauli X and Y gates. Being computationally hard to sim-
ulate, random circuits are a natural algorithm for bench-
marking simulators [38]. We generate our random cir-
cuits by the algorithm in [38], which fixes the topology for
a given depth and number of qubits, though randomises
the sequence of single qubit gates. An example is shown
in Figure 3. The total number of gates (single plus con-
trol) goes like O(nd) for an n qubit, depth d random
circuit, and the ratio of single to control gates is mostly
fixed at 1.2± 0.2, so we treat these gates as equal in our
runtime averaging. For measure, a depth 100 circuit of 30
qubits features 1020 single qubit gates and 967 controlled
phase gates.
Though we here treat performance in simulating a ran-
dom circuit as an indication of the general performance
of the simulator, we acknowledge that specialised simula-
tors may achieve better performance on particular classes
of circuits. For example, ProjectQ can utilise topologi-
cal optimisation or classical emulation to shortcut the
operation of particular subcircuits, such as the quantum
Fourier transform [28].
We additionally study QuEST’s communication effi-
ciency by measuring the time to perform single qubit
rotations on distributed hardware.
III. SETUP
A. Hardware
We evaluate the performance of QuEST and ProjectQ
using Oxford’s computing facilities, specifically the AR-
CUS Phase-B supercomputer, and the UK National Su-
percomputing facility ARCHER.
QuEST and ProjectQ are compared on single nodes
5Simulator multithreaded distributed GPU accelerated stand-alone density matrices
QuEST X X X X X
qHipster X X X
Quantum++ X ?
Quantumsim X X
QCGPU X X
ProjectQ X
TABLE I. A comparison of the facilities offered by some publicly available, state-of-the-art simulators. Note the distributed
adaptation of Quantum++[3] is not currently publicly available. Here, density matrices refers to the ability to precisely
represent mixed states.
with 1-16 threads, on ARCUS Phase-B with nodes of
64, 128 and 256 GiB memory (simulating 1-31, 32 and
33 qubits respectively), each with two 8-core Intel Xeon
E5-2640 V3 processors and a collective last level cache
(LLC) size of 41 MB between two NUMA banks. We fur-
thermore benchmark QuEST on ARCUS Phase-B Tesla
K40m GPU nodes, which with 12 GB global memory over
2880 CUDA cores, can simulate up to 29 qubit circuits.
QuEST and ProjectQ are also compared on ARCHER,
a CRAY XC30 supercomputer. ARCHER contains both
64 and 128 GiB compute nodes, each with two 12-core
Intel Xeon E5-2697 v2 series processors linked by two
QuickPath Interconnects, and a collective LLC of 61 MB
between two NUMA banks. Thus a single node is ca-
pable of simulating up to 32 qubits with 24 threads.
We furthermore evaluate the scalability of QuEST when
distributed over up to 2048 ARCHER compute nodes,
linked by a Cray Aries interconnect, which supports an
MPI latency of ∼ 1.4± 0.1µs and a bisection bandwidth
of 19 TB/s.
B. Software
1. Installation
On ARCUS Phase-B, we compile both single-node
QuEST v0.10.0 and ProjectQ’s C++ backend with GCC
5.3.0, which supports OpenMP 4.0 [26] for parallelisation
among threads. For GPU use, QuEST-GPU v0.6.0 is
compiled with the NVIDIA CUDA 8.0. ProjectQ v0.3.5
is run with Python 3.5.4, inside an Anaconda 4.3.8 envi-
ronment.
On ARCHER, ProjectQ v0.3.6 is compiled with GCC
5.3.0, and run in Python 3.5.3 inside an Anaconda 4.0.6.
QuEST is compiled with ICC 17.0.0 which supports
OpenMP 4.5 [26], and is distributed with the MPICH3
implementation of the MPI 3.0 standard, optimised for
the Aries interconnect.
2. Configuration
We attempt to optimise ProjectQ when simulating
many qubits by enabling gate fusion only for multi-
threaded simulations [31].
from projectq import MainEngine
from projectq.backends import Simulator
MainEngine(
backend=Simulator(
gate_fusion=(threads > 1)))
We found that ProjectQ’s multithreaded simulation of
few qubit random circuits can be improved by disabling
all compiler engines, to reduce futile time spent optimis-
ing the circuit in Python.
MainEngine(
backend=Simulator(gate_fusion=True),
engine_list=[])
However, this disables ProjectQ’s ability to perform
classical emulation and gate decomposition, and so is not
explored in our benchmarking. We studied ProjectQ’s
performance for different combinations of compiler en-
gines, number of gates considered in local optimisation
and having gate fusion enabled, and found the above con-
figurations gave the best performance for random circuits
on our tested hardware.
Our benchmarking measures the runtime of strictly
the code responsible for simulating the sequence of gates,
and excludes the time spent allocating the state vector,
instantiating or freeing objects or other one-time over-
heads.
In ProjectQ, this looks like:
6# prepare the simulator
sim = Simulator(gate_fusion=(threads > 1))
engine = MainEngine(backend=sim)
qubits = engine.allocate_qureg(num_qubits)
engine.flush()
# ensure we're in the 0 state
sim.set_wavefunction([1], qubits)
sim.collapse_wavefunction(
qubits, [0]*num_qubits)
engine.flush()
# start timing, perform circuit
# ensure cache is empty
engine.flush()
sim._simulator.run()
# stop timing
and in QuEST:
// prepare the simulator
QuESTEnv env = createQuESTEnv();
Qureg qubits = createQureg(num_qubits, env);
// ensure we're in the 0 state
initZeroState(&qubits);
// start timing, perform circuit
// ensure distributed work finishes
syncQuESTEnv(env);
// stop timing
IV. RESULTS
A. Single Node Performance
The runtime performance of QuEST and ProjectQ,
presented in Figure 4, varies with the architecture on
which they are run, and the system size they simulate.
Anomalous slowdown of ProjectQ at 22 qubits may be
explained by the LLC becoming full, due to its use of
cache blocking through gate fusion [31].
For fewer than ∼22 qubits, ProjectQ’s Python over-
head is several orders of magnitude slower than QuEST’s
C overhead, independent of circuit depth. The Python
overhead can be reduced by disabling some simulation fa-
cilities - see Section III B 2. For larger systems, the time
spent in ProjectQ’s C++ backend operating on the state
vector dominates total runtime, and the time per gate
of both simulators grows exponentially with increasing
number of qubits.
On a single ARCUS-B thread, ProjectQ becomes twice
as fast as QuEST, attributable to its sophisticated circuit
evaluation. However, these optimisations appear to scale
poorly; QuEST outperforms ProjectQ on 16 threads on
ARCUS-B, and on ARCHER both simulation packages
are equally fast on 24 threads. This is made explicit
in the strong scaling over threads shown in Figure 5,
which reveals ProjectQ’s scaling is not monotonic. Per-
ProjectQ
QuEST
10 15 20 25 30
10-1
10-2
10-3
10-4
10-5
10-6
1
10
Number of qubits
Ti
m
e
pe
rg
at
e
(s)
Speedup
20 25 30
0
1
2
3
ARCUS
ProjectQ
QuEST
10 15 20 25 30
1
10-1
10-2
10-3
10-4
10-5
10-6
Number of qubits
Ti
m
e
pe
rg
at
e
(s)
Speedup
20 25 30
0
1
2
3
ARCUS
ProjectQ
QuEST
10 15 20 25 30
1
10-1
10-2
10-3
10-4
10-5
10-6
Number of qubits
Ti
m
e
pe
rg
at
e
(s)
Speedup
20 25 30
0
1
2
3
ARCHER
FIG. 4. Comparison of QuEST and ProjectQ when simulating
random circuits over 1, 16 (on ARCUS Phase-B) and 24 (on
ARCHER) threads (top to bottom). Coloured lines indicate
the mean, with shaded regions indicating a standard devia-
tion either side, over a total of ∼77 k simulations of varying
depth. Vertical dashed lines indicate the maximum number
of qubits for which the entire state vector fits into the LLC.
The speedup subgraphs show the ratio of ProjectQ to QuEST
runtime.
7Optimal scaling
ProjectQ
QuEST
1 2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Number of threads
S
pe
ed
up
FIG. 5. Single-node strong scaling achieved when paral-
lelising (through OpenMP) 30 qubit random circuits across
a varying number of threads on a 16-CPU ARCUS Phase-B
compute node. Solid lines and shaded regions indicate the
mean and a standard deviation either side (respectively) of
∼7 k simulations of circuit depths between 10 and 100.
24 threads≈3k CUDA cores
5 10 15 20 25
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
1
10
Number of qubits
Ti
m
e
pe
rg
at
e
(s)
Speedup
10 15 20 25
0
5
10
FIG. 6. QuEST’s single-node performance using multithread-
ing and GPU acceleration to parallelise random circuit sim-
ulations. The subplot shows the speedup (ratio of runtimes)
that a GPU of 2880 CUDA cores on ARCUS Phase-B achieves
against 24 threads on ARCHER.
formance suffers with the introduction of more than 8
threads, though is restored at 16.
We demonstrate QuEST’s utilisation of a GPU for
highly parallelised simulation in Figure 6, achieving
a speedup of ∼5× from QuEST and ProjectQ on 24
threads.
B. Distributed Performance
Strong scaling of QuEST simulating a 30 and 38 qubit
random circuit, distributed over 1 to 2048 ARCHER
nodes, is shown in Figure 7. In all cases one MPI process
per node was employed, each with 24 threads. Recall that
QuEST’s communication strategy involves cloning the
state vector partition stored on each node. The 30 qubit
(38 qubit) simulations therefore demand 32 GiB (8 TiB)
memory (excluding overhead), and require at least 1 node
(256 nodes), whereas qHipster’s strategy would fit a 31
qubit (39 qubit) simulation on the same hardware [2].
Communication cost is shown in Figure 8 as the time
to rotate a single qubit when using just enough nodes to
store the state-vector; the size of the partition on each
node is constant for increasing nodes and qubits. QuEST
Optimal scaling
30 qubits
38 qubits
21 22 23 24 25 26 27 28 29 210 2111
21
22
23
24
25
26
27
28
29
210
211
1
Number of compute nodes
S
pe
ed
up
FIG. 7. QuEST multinode strong scaling when distributing
(through MPI) a depth 100 (depth 10) random circuit simu-
lation of 30 qubits (38 qubits) across many 24-thread 64 GiB
ARCHER nodes.
34 qubits
35
36
37
38
0.9
1
1.1
1.2
1.3
1
5
10
15
Slowdown
26 27 28 29 30 31 32 33 34 35 36 37
0.5
1
2
5
Qubit position
Ti
m
e
(s)
FIG. 8. QuEST multinode weak scaling of a single qubit ro-
tation, distributed on {16, 32, 64, 128, 256} ARCHER nodes
respectively, each with 24 threads between two sockets and
64 GiB of memory. Communication occurs for qubits at po-
sitions ≥ 30, indexing from 0. Time to rotate qubits at posi-
tions 0-25 are similar to those of 26-29 and are omitted. The
bottom subplot shows the slowdown caused by communica-
tion, while the top subplot shows the slowdown of rotating
the final (communicated) qubit as the total number of qubits
simulated increases from 34.
8shows excellent weak scaling, and moving from 34 to 37
simulated qubits slows QuEST by a mere ≈ 9%. It is in-
teresting to note that the equivalent results for qHipster
show a slowdown of ≈ 148% [2], but this is almost cer-
tainly a reflection of the different network used in gener-
ating those results, rather than in any inherent weakness
in qHipster itself. QuEST and qHipster show comparable
∼ 101 slowdown when operating on qubits which require
communication against operations on qubits which do
not (shown in the bottom subplot of Figure 8). Though
such slowdown is also network dependent, it is signifi-
cantly smaller than the ∼ 106 slowdown reported by the
Quantum++ adaptation on smaller systems [3], and re-
flects a more efficient communication strategy. We will
discuss these network and other hardware dependencies
further in future work, and also intend to examine qHip-
ster on ARCHER so a true like with like comparison with
QuEST can be made.
V. SUMMARY
This paper introduced QuEST, a new high perfor-
mance open source framework for simulating universal
quantum computers. We demonstrated QuEST shows
good strong scaling over OpenMP threads, competitive
with a state of the art single-node simulator ProjectQ
when performing multithreaded simulations of random
circuits. We furthermore parallelised QuEST on a GPU
for a 5× speedup over a 24 threaded simulation, and a
40× speedup over single threaded simulation. QuEST
also supports distributed memory architectures via mes-
sage passing with MPI, and we’ve shown QuEST to have
excellent strong and weak scaling over multiple nodes.
This behaviour has been demonstrated for up to 2048
nodes and has been used to simulate a 38 qubit random
circuit. Despite its relative simplicity, we found QuEST’s
communication strategy yields comparable performance
to qHipster’s, and strongly outperforms the distributed
adaptation of Quantum++. QuEST can be downloaded
in Reference [39]
VI. ACKNOWLEDGEMENTS
The authors would like to thank Mihai Duta as a con-
tributor to QuEST, and the ProjectQ team for helpful
advice in configuring ProjectQ on ARCUS Phase-B and
ARCHER. The authors are grateful to the NVIDIA cor-
poration for their donation of a Quadro P6000 to fur-
ther the development of QuEST. The authors also ac-
knowledge the use of the University of Oxford Advanced
Research Computing (ARC) facility (http://dx.doi.
org/10.5281/zenodo.22558) and the ARCHER UK Na-
tional Supercomputing Service (http://www.archer.
ac.uk) in carrying out this work. TJ thanks the Claren-
don Fund for their support. SCB acknowledges EP-
SRC grant EP/M013243/1, and further acknowledges
US funding with the following statement: The research
is based upon work supported by the Office of the Di-
rector of National Intelligence (ODNI), Intelligence Ad-
vanced Research Projects Activity (IARPA), via the U.S.
Army Research Office Grant No. W911NF-16-1-0070.
The views and conclusions contained herein are those of
the authors and should not be interpreted as necessar-
ily representing the official policies or endorsements, ei-
ther expressed or implied, of the ODNI, IARPA, or the
U.S. Government. The U.S. Government is authorized to
reproduce and distribute reprints for Governmental pur-
poses notwithstanding any copyright annotation thereon.
Any opinions, findings, and conclusions or recommenda-
tions expressed in this material are those of the author(s)
and do not necessarily reflect the view of the U.S. Army
Research Office.
[1] Damian S. Steiger, Thomas Ha¨ner, and Matthias Troyer,
“ProjectQ: an open source software framework for quan-
tum computing,” Quantum 2, 49 (2018).
[2] Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and
Ala˜n Aspuru-Guzik, “qHiPSTER: The quantum high
performance software testing environment,” (2016),
arXiv:1601.07195.
[3] Ryan LaRose, “Distributed memory techniques for
classical simulation of quantum circuits,” (2018),
arXiv:1801.01037.
[4] Vlad Gheorghiu, “Quantum++ - a C++11 quantum
computing library,” (2014), arXiv:1412.4704.
[5] Edwin Pednault, John A. Gunnels, Giacomo Nannicini,
Lior Horesh, Thomas Magerlein, Edgar Solomonik,
and Robert Wisnieff, “Breaking the 49-qubit bar-
rier in the simulation of quantum circuits,” (2017),
arXiv:1710.05867.
[6] Brian Krzanich, “How data is shaping innovation of the
future,” (Presented as the 2018 Consumer Electron-
ics Show, Consumer Technology Association, Las Vegas,
2018).
[7] R. Courtland, “Google aims for quantum computing
supremacy,” IEEE Spectrum 54, 9–10 (2017).
[8] Dave Wecker and Krysta M. Svore, “LIQUi| >: a soft-
ware design architecture and domain-specific language
for quantum computing,” (2014).
[9] Robert S. Smith, Michael J. Curtis, and William J.
Zeng, “A practical quantum instruction set architecture,”
(2016), arXiv:1608.03355.
[10] Keith Heston, Den Delimarsky, Alan Geller, and Dave
Wecker, “The Q# programming language,” (2017).
[11] Alwin Zulehner and Robert Wille, “Advanced simulation
of quantum computations,” (2017), arXiv:1707.00865.
[12] E. Schuyler Fried, Nicolas P. D. Sawaya, Yudong Cao,
Ian D. Kivlichan, Jhonathan Romero, and Ala˜n Aspuru-
Guzik, “qTorch: The quantum tensor contraction han-
dler,” (2017), arXiv:1709.03636.
9[13] Sergey Bravyi and David Gosset, “Improved classical
simulation of quantum circuits dominated by clifford
gates,” Phys. Rev. Lett. 116, 250501 (2016).
[14] Axel Dahlberg and Stephanie Wehner, “SimulaQron -
a simulator for developing quantum internet software,”
Quantum Science and Technology 4, 015001 (2019).
[15] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu,
G. Arnold, M. Richter, Th. Lippert, H. Watanabe, and
N. Ito, “Massively parallel quantum computer simula-
tor,” Computer Physics Communications 176, 121–136
(2007).
[16] Jumpei Niwa, Keiji Matsumoto, and Hiroshi Imai,
“General-purpose parallel simulator for quantum com-
puting,” Phys. Rev. A 66, 062317 (2002).
[17] D. B. Trieu, Large-scale simulations of error-prone
quantum computation devices, Dr. (univ.), Univ. Diss.
Wuppertal, Jlich (2009), record converted from VDB:
12.11.2012; Wuppertal, Univ. Diss., 2009.
[18] Thomas Ha¨ner and Damian S. Steiger, “0.5 petabyte sim-
ulation of a 45-qubit quantum circuit,” in Proceedings of
the International Conference for High Performance Com-
puting, Networking, Storage and Analysis, SC ’17 (ACM,
New York, NY, USA, 2017) pp. 33:1–33:10.
[19] N. Khammassi, I. Ashraf, X. Fu, C. G. Almudever, and
K. Bertels, “QX: A high-performance quantum computer
simulation platform,” in Design, Automation Test in Eu-
rope Conference Exhibition (DATE), 2017 (2017) pp.
464–469.
[20] Zhaoyun Chen, Qi Zhou, Cheng Xue, Xia Yang, Guang-
can Guo, and Guoping Guo, “64-qubit quantum circuit
simulation,” (2018), arXiv:1802.06952.
[21] A. Amariutei and S. Caraiman, “Parallel quantum com-
puter simulation on the GPU,” in 15th International
Conference on System Theory, Control and Computing
(2011) pp. 1–6.
[22] Pei Zhang, Jiabin Yuan, and Xiangwen Lu, “Quan-
tum computer simulation on multi-GPU incorporating
data locality,” in Algorithms and Architectures for Par-
allel Processing, edited by Guojun Wang, Albert Zomaya,
Gregorio Martinez, and Kenli Li (Springer International
Publishing, Cham, 2015) pp. 241–256.
[23] Eladio Gutirrez, Sergio Romero, Mara A. Trenas, and
Emilio L. Zapata, “Quantum computer simulation us-
ing the CUDA programming model,” Computer Physics
Communications 181, 283 – 300 (2010).
[24] I. Savran, M. Demirci, and A. H. Yilmaz, “Accelerat-
ing shor’s factorization algorithm on GPUs,” (2018),
arXiv:1801.01434.
[25] Chris Lomont, “Introduction to Intel advanced vector ex-
tensions. Intel white paper,” (2011).
[26] “OpenMP compilers & tools,” http://www.openmp.org/
resources/openmp-compilers/ (2016), accessed: 2018-
02-14.
[27] Monica D. Lam, Edward E. Rothberg, and Michael E.
Wolf, “The cache performance and optimizations of
blocked algorithms,” SIGARCH Comput. Archit. News
19, 63–74 (1991).
[28] Thomas Ha¨ner, Damian S Steiger, Krysta Svore, and
Matthias Troyer, “A software methodology for compiling
quantum programs,” Quantum Science and Technology
(2018).
[29] “International standard - programming languages - C
ISO/IEC 9899:1999,” http://www.open-std.org/jtc1/
sc22/wg14/www/standards (1999).
[30] Tyson Jones, “Installing ProjectQ on supercomputers,”
https://qtechtheory.org/resources/installing_
projectq_on_supercomputers (2018), accessed 30-5-
2018.
[31] Thomas Ha¨ner, private communication (2017).
[32] “Fundamental types, C++ language reference, mi-
crosoft developer network,” https://msdn.microsoft.
com/en-us/library/cc953fe1.aspx, accessed: 2018-2-
05.
[33] Python Software Foundation, “Data model, the
Python language reference,” https://docs.python.
org/3/reference/datamodel.html (2018), accessed 20-
5-2018.
[34] Laboratoire d’Informatique des Syste´mes Adap-
tatifs, “Python memory management,” http:
//deeplearning.net/software/theano/tutorial/
python-memory-management.html (2017), accessed
20-5-2018.
[35] Man-Duen Choi, “Completely positive linear maps on
complex matrices,” Linear algebra and its applications
10, 285–290 (1975).
[36] Adam Kelly, “Simulating quantum computers using
OpenCL,” (2018), arXiv:1805.00988.
[37] Brian Tarasinski, Viacheslav Ostroukh, and Thomas
O’Brien, “Project title,” https://github.com/
charlespwd/project-title (2013).
[38] Sergio Boixo, Sergei V. Isakov, Vadim N. Smelyanskiy,
Ryan Babbush, Nan Ding, Zhang Jiang, Michael J.
Bremner, John M. Martinis, and Hartmut Neven, “Char-
acterizing quantum supremacy in near-term devices,”
(2016), arXiv:1608.00263.
[39] “QuEST: The quantum exact simulation toolkit,”
(2018).
