Practical Quantum Computing: The value of local computation by Cruise, James R. et al.
Practical Quantum Computing:
The value of local computation∗
J.R. Cruise,† N.I. Gillespie,‡ and B. Reid§
Riverlane, St Andrews House, 59 St Andrews Street, Cambridge CB2 3BZ, UK
(Dated: September 21, 2020)
As we enter the era of useful quantum computers we need to better understand the limitations of classical
support hardware, and develop mitigation techniques to ensure effective qubit utilisation. In this paper we
discuss three key bottlenecks in near-term quantum computers: bandwidth restrictions arising from data transfer
between central processing units (CPUs) and quantum processing units (QPUs), latency delays in the hardware
for round-trip communication, and timing restrictions driven by high error rates. In each case we consider a
near-term quantum algorithm to highlight the bottleneck: randomised benchmarking to showcase bandwidth
limitations, adaptive noisy, intermediate scale quantum (NISQ)-era algorithms for the latency bottleneck and
quantum error correction techniques to highlight the restrictions imposed by qubit error rates. In all three cases
we discuss how these bottlenecks arise in the current paradigm of executing all the classical computation on
the CPU, and how these can be mitigated by providing access to local classical computational resources in the
QPU.
I. INTRODUCTION
Over the past two decades quantum computing has gained
momentum as an area of theoretical research and experimen-
tal implementation. The first major milestone was reached
late in 2019 with the report of so-called quantum supremacy
by a research group at Google [1]. The prevailing expecta-
tion is that within the next five to ten years a quantum device
will exist that can perform a useful computational task that
is intractable for state-of-the-art classical high performance
computers (HPCs). However, this device is likely to be highly
sensitive to noise and of limited scope in its applications.
Current incarnations of quantum computers, like those of
IBM [2] and Rigetti [3], employ a ‘black-box’ model: quan-
tum programs are written on a central processing unit (CPU)
and forwarded to the quantum hardware to be implemented
blindly. This construction prevents developers having direct
access to the internal components of the computer, and as
such the only location for intermediate classical computation
is on the user’s CPU. Physically there are a number of inter-
mediate platforms between the CPU and qubits that can com-
municate classical information with minimal disturbance, as
displayed in Fig. 1. Most widespread in current hardware is
the use of field programmable gate arrays (FPGAs), which
sit physically close to the qubits and provide commands lo-
cally. This device’s role is to control the analog hardware that
directly interfaces with the qubits, for example, the lasers re-
sponsible for gate implementation. These components are un-
available to algorithm developers. We believe that providing
access to these computational units will allow the maximum
utilisation of noisy, intermediate scale (NISQ) devices. NISQ-
era hardware is classified by relatively short coherence times
∗ This work is licensed under a Creative Commons Attribution 4.0 In-
ternational license (CC BY 4.0) https://creativecommons.org/
licenses/by/4.0/
† james.cruise@riverlane.com
‡ neil.gillespie@riverlane.com
§ brendan.reid@riverlane.com
FIG. 1: Pictorial representation of the computational stack of
a quantum computer.
and, consequently, the need for classical hardware to supple-
ment calculations. This hybrid quantum-classical interplay
is necessary to demonstrate a computational advantage over
purely classical means. However, the latency between hard-
ware components causes qubit utilisation to suffer and a corre-
sponding degradation of wall-clock time. If it were possible to
program directly onto every component in the computational
hierarchy certain latency dividends would not need to be paid.
We call this ‘local computation’: employing control protocols
on local components relative to the quantum hardware. Practi-
cally, this involves identifying aspects of the desired algorithm
which could be moved away from the CPU to lower levels in
the stack.
In this paper we highlight three challenges in near-term hy-
brid quantum technology which could benefit from employing
local control protocols. These are: a bandwidth bottleneck
due to the limited communication capacity between the CPU
and quantum processing unit (QPU); a latency bottleneck due
ar
X
iv
:2
00
9.
08
51
3v
1 
 [q
ua
nt-
ph
]  
17
 Se
p 2
02
0
2to the round trip delay between the CPU and QPU; and the
qubit error rate due to the short coherence time of qubits. We
propose that one possible method for overcoming these chal-
lenges is to utilise latent computational power. By identifying
simple calculations that need to be performed regularly (such
as a parameter update or logical while loop) these can be as-
signed to hardware components physically close to the qubits:
the remaining difficulty is that this task is non-trivial and be-
yond the skill set of a typical programmer. While we will
not propose a concrete pathway to solving these issues, we
aim to convince the reader that developing a more hardware-
conscious methodology is a priority for the immediate future
of quantum computation.
A. Hardware Stack
Whilst quantum computers utilise a variety of different
technologies and implementations, they follow a similar struc-
tural design. This begins with the actual qubits at the bottom
of a stack, continues through the hardware used to physically
interact with the qubits (qubit I/O), onto the computational
units used to control the physical hardware, and on top a clas-
sical computer that we refer to as the CPU (see Fig. 1). This
stack can be seen when visiting hardware laboratories work-
ing on a diverse range of qubit technologies, including trapped
ions, superconducting qubits, and semiconductor qubits.
We will focus on the intermediate hardware between the
QPU and CPU. As mentioned the most common hardware
used to interface between the CPU and QPU is FPGAs. How-
ever, there are alternative programmable platforms that may
become more popular in the future. While the important fea-
ture of this hardware is its programmability, equally important
is its physical connection to both the CPU and QPU. Currently
there are a number of technologies that fill this role within the
hardware stack, including Ethernet, universal serial bus (USB)
and peripheral component interconnect express (PCIe). Each
technology has benefits and limitations both in regards to ca-
pacity and ease of use. For example, Google has demonstrated
the use of Ethernet for this connection within a cloud comput-
ing setting in their recent work on optimising variational quan-
tum eigensolver (VQE) implementations [4]. In comparison
many laboratories are using the PCIe lanes in commercially
available motherboards.
The two metrics we will be concerned with in this paper are
the communication latency (the delay between sending and
receiving a message across the link) and the available band-
width (the rate at which data can be sent between the two
devices). For communication latency, we will report results
based on a timing of 100µs for one-way communication – we
consider this a realistically achievable value without substan-
tial engineering overhead, though we also include plots which
highlight the effect of latency so other values can be easily
considered. For bandwidth, we make no assumptions but will
report requirements for the communication channel to prevent
a degradation in performance. This will more clearly allow
practitioners to understand where the bottlenecks will form
within their system.
In terms of physical qubits there are a number of competing
technologies; here we will focus on timings for superconduct-
ing qubits and trapped ions. The reason for this is that super-
conducting qubits demonstrate relatively fast gates times but
short-lived coherence whereas, in comparison, ion traps have
substantially longer coherence accompanied with very slow
gate implementation. The majority of other qubit technologies
lie between these two extremes. As we will focus on quanti-
fying overall performance we will consider only gate times
rather than the details of qubit implementation. Throughout
this work we consider gate and measurement times to be 10µs
and 750µs respectively for trapped ion qubits [5], and 120ns
and 120ns for superconducting qubits [6]. Note that these are
relatively long and we expect these times to reduce as further
research is conducted.
B. Computational Challenges
To highlight the issues of carrying out all classical com-
putation on the CPU we consider a toy example – a while
loop. The loop runs over a circuit execution and measurement,
counting the number of times the value 0 is returned and stop-
ping when we have reached a certain count (see Fig. 2). This
example might seem overly simple, but quantum expectation
estimation to a high confidence will be similar to this setup.
In this toy example a message will be have to be passed
to the CPU between each circuit execution to decide if an-
other run is required or not. As mentioned above this will
add a delay of approximately 200µs for each circuit evalua-
tion. The time required for a measurement to be generated
will be O(1)µs, resulting in the qubits sitting idle for 99% of
the computation time waiting for the CPU to issue commands.
Even for trapped ion qubits, where the circuit execution time
is likely to be on the order of 800µs, this will still lead to
the qubits sitting idle for approximately 20% of time. If the
update and checking process were completed locally on an
FPGA or similar device, qubit utilisation would be increased
substantially.
Randomised benchmarking [7–10] is a method to provide
summary statistics about inherent noise levels in quantum de-
vices, for example, average gate fidelity, which is key for
understanding device performance. Zero noise extrapolation
[11–14] is an error mitigation scheme. Both algorithms need
to run a large number of random circuits and record the re-
sults. If circuit generation occurs on the CPU, rather than lo-
cally relative to the qubits, a substantial amount of data must
be transferred to prevent low qubit utilisation from lack of cir-
cuits to execute. We discuss this in greater detail in Section
II.
There has been a growing interest in adaptive refinements
to quantum algorithms in order to better utilise quantum de-
vices. In these procedures the circuit to be implemented and
relevant statistics are updated between circuit executions as
in the while loop example. For instance, accelerated VQE
(AVQE) [15] introduces an adaptive quantum subroutine to
outperform VQE in energy minimisation. Section III goes into
3FIG. 2: Simple while loop which stops only when n measurements of 0 have been completed. Under the black-box model, the
green boxes will be completed on the QPU and the orange on the CPU.
detail of how this subroutine operates. We note that AVQE is
not unique in modifying VQE for improved performance us-
ing adaptive routines; work in the same spirit was recently
released by Zapata Computing [16].
Implementing error correcting strategies in quantum de-
vices, both in the near term [17] and when lower physical er-
ror rates have been achieved [18], is critical. In order to avoid
an exponential growth of noise, errors must be dealt with fre-
quently. Any correction protocol needs to be fast enough to
keep up with the error rate otherwise a data bottleneck is cre-
ated, which can result in the loss of any quantum advantage
[17]. See Section IV for further details on quantum error cor-
rection.
Finally, in Section V we provide some concluding remarks
and comments on providing developers with access to local
computation.
II. RANDOMISED BENCHMARKING AND ZERO NOISE
EXTRAPOLATION
In the NISQ-era of quantum computing one of the key chal-
lenges will come from noise inherent in the quantum com-
puter. To maximise the utility of these computers we will need
to both quantify the level of noise within the device and de-
velop noise mitigation strategies. Both of these are key tasks
but they are difficult to achieve in a scalable manner, often
growing exponentially in computational complexity with the
number of qubits [19]. In this section we will discuss two pro-
tocols which have gained substantial traction within the quan-
tum computing community as their computational complex-
ity scales more favourably with the number of qubits: ran-
domised benchmarking [7, 8, 10, 20] for noise quantification
and zero noise extrapolation[11–14] for noise mitigation. In
both cases we highlight how the algorithms naturally lead to a
bandwidth bottleneck which can be removed through the use
of local computation.
Randomised benchmarking was initially proposed as an al-
ternative to quantum tomography [19] as a methodology to
quantify the quality of the computational capabilities of quan-
tum computers. Unlike quantum tomography, randomised
benchmarking does not provide a full specification of the ac-
tion of various gates but instead obtains estimates of sum-
mary statistics, for example, average gate fidelity. In exchange
for obtaining lower quality information, we obtain a proto-
col with substantially improved scaling as we increase the
number of qubits and that is robust to state preparation and
measurement (SPAM) errors. Randomised benchmarking has
gained traction within the experimental quantum computing
community and has been the focus of a concerted research
effort. This has included exploring adaptive algorithmic im-
provements [9, 21, 22], different gates sets [23, 24] and error
models [25, 26], generalisations to capture further details of
quantum computing performance [27, 28], and further theo-
retical understanding of algorithmic performance [9, 10, 22].
Zero noise extrapolation [11, 14] is a class of algorithms
and techniques to mitigate noise in near term quantum com-
puters. For details of alternative strategies see [29–31]. In
zero noise extrapolation we take the counter-intuitive step of
actively increasing the noise in the calculation and record-
ing the outcome over a large range of noise levels. This
data is then used to extrapolate back to the noiseless situa-
tion which is, in general, unobtainable from direct calculation.
This works under the assumption that the calculation degrades
smoothly and constantly as the noise level increases, so the
degradation at higher noise levels can shed light on the per-
formance at low noise levels. This was initially proposed by
[11] using Richardson extrapolation. Later work has explored
alternative methods for increasing the noise level [11, 13, 32],
alternative extrapolation methods [12, 14, 31] and the devel-
opment of an adaptive framework [14].
A key step in both algorithms is data collection, see Sect.
II A, which involves running a large number of different cir-
cuits on the qubits in question and recording the results. If
the circuit generation occurs on a CPU, rather than local to
the qubits, a substantial amount of data must be transferred
4FIG. 3: Diagrammatic representation of algorithms (Randomised Benchmarking or Zero Noise Extrapolation)
between the two devices and at speed to prevent low qubit
utilisation, see Sect. II B 1. This transfer causes a substan-
tial bandwidth bottleneck between the CPU and the QPU.
This bottleneck can be removed by moving the circuit gen-
eration process to the QPU’s local computation hardware, see
Sect. II C 1, therefore substantially reducing the communica-
tion overhead between the CPU and QPU and increasing qubit
utilisation.
Furthermore, recent developments have introduced adap-
tive alternatives for both randomised benchmarking [9, 21]
and zero noise extrapolation [14]. Adaptive algorithms in gen-
eral suffer from poor performance due to high latency between
the CPU and QPU. In this section we will briefly touch on the
use of local computation to mitigate this poor performance,
Sect. II C 2 , but this will be dealt more fully in the section
discussing the Accelerated Variational Quantum Eigensolver,
Sect. III.
A. Algorithm Description
Both algorithms in their simplest form follow two steps:
1. Data collection: Independent identical realizations of
random circuits, indexed by a parameter m, are applied
to a prepared input state and the measurement outcome
recorded. Data collection is carried out over a range of
values of m.
2. Parameter estimation: A function of the control pa-
rameter m is fitted to the estimated proportion of mea-
surements returning a given state from the data collec-
tion. The parameters of the fitted function provide esti-
mates for the key statistics of interest. For randomised
benchmarking we estimate the average gate error rate
and for zero noise extrapolation we estimate the true
value of the function.
See Fig. 3 for a diagrammatic representation of the pro-
cess: the data collection including random circuit generation
and evaluation on qubits, then parameter estimation includ-
ing function fitting and parameter extraction. Initially, we will
consider these two steps run in a sequential fashion where the
parameter estimation step does not influence the data collec-
tion, i.e. the number of samples at each m level is predeter-
mined. In Sect. II A 3 we will discuss recent developments of
adaptive parameter estimation schemes.
1. Randomised benchmarking
Randomised benchmarking is used to estimate the average
gate fidelity for a given set of qubits through the application of
a sequence of random gates designed to implement the iden-
tity operation if evaluated on an error free set of qubits.
Data Collection: For a given circuit depth m we generate
a sequence of m random Clifford gates, C1,C2, . . .Cm on the
n qubits we wish to benchmark. The random Clifford gates
are sampled independently from the set of all Clifford gates
on n qubits, see [33, 34] for an example scheme to do this.
The inverse gate of the composition of these m gates is then
calculated: C0 = (C1C2 . . .Cm)−1. The circuit to be evalu-
ated is then the sequential application of these m + 1 gates to
an input state and then measurement in the computational ba-
sis, with the goal of comparing the initial and final state. In
most circumstances the input state will be the all zero state
|0⊗n〉 ≡ |00. . .0〉. See Fig. 4 for a diagrammatic example of
such a circuit. This process is repeated for each m value a
large number of times, producing a number of independent
samples, and then across a large range of m values.
Initial descriptions of randomised benchmarking were
based on samples from the full unitary group rather than re-
stricting to only the Clifford group [35, 36], as the aim was to
estimate the average gate fidelity over the entire gate set rather
than just the Clifford subset. The discovery that the Clifford
group forms a 2-design for the unitary group [35] means that
restricting to the Clifford gate set still provides an estimate of
the average fidelity over the whole unitary group. As such,
we will discuss randomised benchmarking protocols that con-
sider the Clifford group only.
Parameter estimation: Now we want to use the acquired
data to estimate the average gate error rate and by extension
the average gate fidelity, a key indicator of qubit performance.
We will assume that gate errors are independent and homoge-
neous in time, which is often referred to as the 0th order model.
More complicated models have been developed, for details see
5FIG. 4: Example Circuit for Randomised benchmarking,
C1,C2, . . . ,Cm−1,Cm are random Clifford gates and C0 is the
cumulative inverse of C1C2 · · ·Cm−1Cm.
FIG. 5: Example of Randomised benchmarking data and
function fitting taken from [7]
[36]. Using this assumption, we want to fit a model to the esti-
mated survival probabilities, p¯m, the proportion of circuit runs
returning the originally prepared state after m random Clifford
gates and the associated cumulative inverse.
Under the time and gate independence assumption it can be
shown that the application of m independent random Clifford
gates and their cumulative inverse is equivalent to applying m
independent copies of the depolarizing channel to the initial
state with a depolarizing probability p — the average gate er-
ror rate. Therefore the probability of recovering the original
state is A+ Bpm, a geometric probability, where A and B allow
for state preparation and measurement errors and p is the av-
erage gate error rate. Hence we fit this model to the estimated
survival probabilities, normally by fitting a linear equation to
log( p¯m) and using the estimated slope to calculate the average
error probability. An example of this for real data can be seen
in Fig. 5 from [7].
2. Zero noise extrapolation
In zero noise extrapolation, we carry out the calculation of
interest across a wide range of noise levels by artificially in-
creasing the noise and using the estimated values to extrapo-
late back to the case with zero noise.
Data Collection: Given a circuit to estimate an expectation
value E, we want to implement this circuit at a wide range
of noise levels. There are number of ways to artificially in-
crease the noise level including unitary folding and parameter
noise scaling [14] . For unitary folding, see Fig. 6, we con-
sider the circuit to be a series of n unitary layers L1, L2, . . . Ln,
and between each layer there are identity operations consist-
ing of a random unitary gate and its inverse, Ii = UiU
†
i . The
noise level is increased through the addition of extra identity
blocks between layers. In parameter noise scaling, for every
gate parameterized by θ, we now implement the gate using a
perturbed value, θ′ = θ + X where X is randomly generated
and X ∼ N(µ = 0, σ2). Here the level of artificial noise is
controlled by the variance of the perturbation.
In both scenarios we carry out the expectation estimation
across a range of noise levels, treating the inherent noise of
the quantum computer as noise level one. For each noise
level we execute a large number of independent identical sam-
ples of the circuit, i.e. for unitary folding this would be with
newly sampled independent unitary gates or for parameter
noise scaling resampling the perturbation for each parameter
θ. The results of the samples are then used to produce empir-
ical estimates of the expectation values, E¯(λ), where λ is the
noise level, which are then used in the parameter estimation
step.
FIG. 6: Example Unitary Folding Circuit, L1, L2, . . . , Ln are
layers of gates implementing the original circuit and
I1, I2, . . . , In are identity blocks such that Ii = UiU
†
i where Ui
are randomly sampled unitary gates.
Parameter estimation: Parameter estimation in zero noise
extrapolation aims to fit a function to the expectation esti-
mates, E¯(λ), across a range of noise levels λ > 1. This fitted
function can then be used to extrapolate for λ = 0 and there-
fore E(λ = 0), the value of the expectation without noise. The
hope is that the true expectation value will be smooth as we
vary the noise parameter and there will not be a substantial
change in behaviour for λ < 1.
In general, there is no a priori class of functions to fit to
the data as it is unknown how the noise will effect the expec-
tation value. A number of general schemes have been used,
including Richardson extrapolation [11], polynomial, expo-
nential and linear regression [14, 31].
3. Adaptive modifications
In the discussion of the algorithms, we split them into two
distinct steps: data collection and parameter estimation. We
complete the data collection process before carrying out the
parameter estimations. This means that we use a pre-defined
data collection strategy: specifying the parameter range to
consider and the number of samples to be collected for each
value in that range.
6Unlike most statistical settings, there is no need for this ar-
tificial separation; instead, the data collection strategy can be
updated in an adaptive manner to maximise the obtained in-
formation given the current parameter estimates. In this case,
we would determine at each circuit execution what value of
parameter to use for sampling to maximise the acquired data
or minimise the total number of computational resources re-
quired to obtain a given accuracy. This leads us to develop
the sampling schedule in an online fashion, as we collect the
data and carry out an update step between shots. For example,
such adaptive algorithms can be seen in [9, 21] for randomised
benchmarking and [14] for zero noise extrapolation.
4. Repeated circuit evaluations
In the described data collection steps for both randomised
benchmarking, Sect. II A 1, and zero noise extrapolation,
Sect. II A 2, for each circuit execution a new random circuit
was generated. However, the noise in the estimation comes
from two sources: sampling from random circuits and the in-
herent shot noise. Therefore, an alternative strategy is to ex-
ecute each random circuit a number of times rather than just
once, hopefully lowering the number of random circuits that
need to be created. Unfortunately, this increases the total sam-
ples (hence number of circuit evaluations) required to obtain
the same accuracy. See Fig. 7 for number of samples required
to obtain the same accuracy in the toy example described be-
low.
Consider the following toy model, let X ∼ Bernoulli(P),
where P is a random variable such that E(P) = µ and Var(P) =
σ2. Here, E(X) = µ, is the expectation we want to estimate,
and P represents the noise between circuits. We acquire sam-
ples of X and we want to estimate the true value of µ. We
now consider two schemes. In the first scheme, we produce k
independent samples of P, {Pi}i=1...k, and then for each sample
Pi we produce l independent samples of X|Pi, {Xi, j} j=1,...l. In
the second scheme, we instead produce n = l × k independent
samples of X, the same number of samples as in the previous
scheme, but for each sample of X we use a new sample of P.
It should be noted that the first scheme reuses the random cir-
cuits, while the second scheme produces a new random circuit
for each shot. In both cases we estimate µ by the empirical
mean of samples of X, µ¯ = 1lk
∑
Xi.
To compare the two schemes we are interested in the vari-
ance of the estimator. For the first sampling scheme we have
that
Var(µ¯1) =
1
lk
[
µ −
(
µ2 + σ2
)]
+
σ2
k
.
For the second sampling scheme we have
Var(µ¯2) =
1
lk
[
µ −
(
µ2 + σ2
)]
+
σ2
lk
=
µ − µ2
n
.
Therefore, we have Var(µ¯1) ≥ Var(µ¯2) with equality only if σ
= 0 or l = 1. We achieve a smaller variance if a new random
circuit is obtained for each shot rather than reusing circuits.
FIG. 7: Number of samples required to obtain an accuracy of
0.01 for µ = 0.5, the true value, across a range of σ2 values,
the between circuit noise.
To further highlight this, in Fig. 7 we plot the total number of
samples required to obtain a given accuracy as a function of
m, the number of times we reuse the circuit.
B. CPU to QPU bottlenecks
Many implementations of these algorithms utilise a CPU
and QPU model for their execution i.e. where all the clas-
sical computation is carried out on the CPU and the QPU is
used purely to execute the prepared circuits. As discussed in
the introduction this model fails to take full advantage of the
computational resources available much closer to the qubits,
often located on FPGAs tasked with hardware control. Here
we will highlight two key bottlenecks in the performance of
both randomised benchmarking and zero noise extrapolation
which come about because of this design decision. Then, in
the next section, Sect II C, we explain how utilisation of the
local computation can improve performance through the re-
moval of these bottlenecks.
The first of these bottlenecks is due to the limited band-
width available between the CPU and QPU control hardware,
see Fig. 8. This comes from the need to provide a constant
stream of different circuits to be executed on the QPU within
the data collection phases of the algorithms. The second of
these is due to the high latency involved in the round trip
of carrying out an update within an adaptive algorithm, i.e.
the circuit to be run next depends on the outcome of the cur-
rent circuit. The relatively high communication lag will of-
ten be an order of magnitude longer than the circuit execution
and computational update step times leading to a substantial
degradation in performance, see Fig. 8.
1. Bandwidth Bottleneck
In both randomised benchmarking and zero noise extrapo-
lation one of the key challenges is to execute a large number
of different circuits. This contrasts with VQE, [37], which ex-
ecutes the same circuit a number of times. As highlighted in
7FIG. 8: Bottlenecks considered in this section. The orange arrows represent the bottlenecks, Bandwidth bottleneck between
circuit generation (left) and execution and Latency bottleneck between circuit execution and adaptive update (right).
Sect II A 4, to minimise the total number of circuit executions
and therefore total run time of the algorithms, each generated
circuit can only be run once on the qubits. To understand
the bandwidth requirement of delivering such a stream of cir-
cuits from the CPU to QPU, such that this does not create
a bottleneck, we consider a slightly simpler case to analyse.
Namely, the problem of providing a constant stream of gates
to all qubits within the system.
We assume a data requirement of 2 bytes of information to
specify a gate per qubit. For a 2-qubit gate this means 2 bytes
to specify the qubits the gate acts upon and 2 bytes to spec-
ify the action gate, a total of 4 bytes. For the 1-qubit gates
this is a byte to specify the qubit the gate acts upon and a
byte to specify the gate action. In Fig. 9 we plot the band-
width requirement to deliver such a gate stream against av-
erage gate time for a range of device sizes assuming a 50%
qubit utilisation, meaning that a qubit will only be involved
in 50% of the layers of gate execution. From this we can see
that for a superconducting system with an average gate time
of 120ns with 150 qubits this will require a bandwidth of at
least 1.2GB/s. Note that this is under a relatively low qubit
utilisation assumption and a gate time which is only likely
to improve substantially in the coming years, both of which
will further increase this already high communication require-
ment. Furthermore, this calculation ignores the communica-
tions overheads required to transfer data between the CPU and
the QPU control hardware; these will further inflate the com-
munication requirements. Maintaining data transfer rates of
this level is a substantial engineering challenge and in many
cases will be unattainable. Failure to obtain these data rates
will lead to under-utilisation of qubits and increased running
time for the various algorithms.
2. Latency Bottleneck
As describe in Sect. II A 3 there is a move to improve per-
formance, accuracy and run time, through the use of adaptive
algorithms. In adaptive algorithms the circuit specification,
either depth for randomised benchmarking or noise level for
zero noise extrapolation, is updated at every shot. This re-
quires a round-trip to the CPU between each shot within this
FIG. 9: Bandwidth requirements for a constant gate stream at
50% utilisation.
computational model, see Fig. 8. The associated communica-
tion latency of such a round trip is often on the order of 200µs.
See Sect. I A for a discussion of CPU and QPU communi-
cations, which are 40 times longer than the 5µs required to
execute a circuit and update the circuit specification for super-
conducting qubits. Therefore the communication latency will
lead to a substantial degradation in algorithmic performance
and low qubit utilisation. See Sect. III and specifically Sect.
III D for a more detailed discussion of adaptive algorithms and
the associated bottleneck.
C. Local computation mitigation
Having discussed the bottlenecks, bandwidth and latency,
which arise from the traditional model of carrying out all tra-
ditional computation on the CPU rather than locally to the
qubits, we now highlight how the use of computation local
to the qubits can mitigate these and remove the bottleneck.
81. Local circuit generation
As described in Sect. II A for both randomised benchmark-
ing and zero noise extrapolation, the random circuits are sam-
pled from a predefined template. Using these templates, the
creation of the next circuit to be run at each time step is a rel-
atively low weight computational task. Rather than complet-
ing this task using the CPU, this can instead be completed,
with a low overhead, on the simple computational units mak-
ing up the available local computation. For example, within
near term quantum computers, the FPGA used for hardware
control can be programmed to carry out this task. The use of
the available local computation in this fashion completely re-
moves the bandwidth bottleneck. As the local computation di-
rectly interfaces with the qubit hardware, the only data trans-
fer is the circuit results and, depending the on set up, a stream
of random numbers.
2. Local update
As will be discussed in detail for the AVQE in the next
section, for adaptive algorithms in general we can carry out
the required update calculation and decision on the local com-
putation hardware rather than the CPU. There is no need to
communicate back to the CPU and wait for a response be-
tween circuit evaluations. This removes the communication
delay completely, hence reducing the length of time required
to complete each iteration of the algorithm and therefore the
total run time.
It is worth noting that the computational resources available
on local computation hardware are often limited. For exam-
ple, in near term devices this will generally be the number of
FPGAs. Often this means that developers will implement ap-
proximations of the full update process that would be carried
out on the CPU or develop lighter weight algorithms that give
close to optimal performance. This may lead to a slight degra-
dation in algorithmic performance but, in general, this will be
easily compensated for by the improved performance due to
removal of the latency bottleneck.
III. ACCELERATED VARIATIONAL QUANTUM
EIGENSOLVER
A. Introduction
Variational algorithms provide a promising pathway for in-
vestigating quantum effects on near-term (NISQ) quantum
hardware. The hybrid quantum-classical nature of these al-
gorithms means that required coherence times on a QPU are
minimised and a classical component can perform the bulk
of the necessary work. The VQE is the flag-bearer for this
class of algorithms and has found several applications in quan-
tum chemistry [38–40], which is widely expected to be one of
the first use-cases for quantum computing machines. While
QPU time is minimised by using only short quantum circuits,
FIG. 10: Measurements for the AVQE algorithm N(p, α)
plotted against α for various values of the precision p.
there is a penalty in the number of times those circuits ac-
tually need to be implemented. Specifically, the circuit being
implemented has depth d = O(1) but to achieve a required pre-
cision p the total number of circuit executions (equivalently,
measurements) is O(p−2) [37]. VQE was developed for use in
estimating ground state energies of quantum Hamiltonians as
an alternative to Kitaev’s quantum phase estimation (QPE) al-
gorithm. While Kitaev’s algorithm has advantages in that the
number of measurements required is O(1), the required cir-
cuit depth is beyond the scope of NISQ devices — O(p−1) for
precision p [41, 42].
These two algorithms provide recourse in two different
hardware regimes; the gap between O(1) and O(p−1) coher-
ent circuit depth is vast and the technological bridge between
them will be made incrementally. This realisation has spurred
development into algorithms that allow circuit depth as a (di-
rectly or indirectly) controllable parameter [16, 43]. In con-
trast to running VQE (whose circuit depth is fixed) until the
requisite hardware is available to run QPE, the ability to vary
the circuit depth your algorithm requires will allow more util-
ity from successive quantum hardware generations. Here we
focus on one such algorithm: the accelerated variational quan-
tum eigensolver (AVQE) [15]. Through the introduction of a
parameter α ∈ [0, 1] it is possible to interpolate between scal-
ing regimes: decreasing the number of measurements in ex-
change for an increased circuit depth. The required number of
measurements in AVQE can be given by:
N(p, α) =
 21−α
(
p−2(1−α) − 1
)
if α ∈ [0, 1)
4 log
(
p−1
)
if α = 1
. (1)
Note that N(p, 0) = O(p−2) is the required number of mea-
surements in VQE; N(p, 1) = O[log(p−1)] is the required
number of measurements in quantum phase estimation (QPE)
up to further log factors. This is an up-to-exponential im-
provement over VQE in terms of measurements required for
all α > 0. As stated above this decrease in measurements
is paired with a commensurate increase in circuit depth of
O(p−α), offering an improvement over QPE for all α < 1.
9AQPE estimation of
h ( )|P1| ( )i
FIG. 11: Flow chart of the AVQE algorithm.
Equation (1) is a decreasing function of α; its behaviour is
shown for various values of p in Fig. 10.
The task in question is minimisation of the energy of a given
Hamiltonian, H =
∑
i aiPi where ai are (known) complex co-
efficients and Pi are Pauli matrices across subsystems of the
Hamiltonian. Given an ansatz wavefunction |ψ(λ)〉 := R(λ) |0〉
parameterised by a real valued variable λ and generated by a
preparation circuit R(λ), the energy of this Hamiltonian can be
written:
E(λ) =
∑
i
ai 〈ψ(λ) | Pi |ψ(λ)〉 . (2)
The process is to compile R(λ) and pass
{
|ψ(λ)〉 , {Pi}
}
to the
QPU for each expectation value in Eq. (2) to be estimated.
These are then collated and, using a CPU, the energy E(λ)
is calculated. This information is passed to a classical opti-
miser; the value of λ is updated and the process begins again.
Whereas VQE operates an algorithm known as quantum ex-
pectation estimation (QEE), AVQE replaces this with a mod-
ified Bayesian QPE algorithm known as AQPE. The gener-
alisation from VQE to AVQE is then simply replacing this
quantum subroutine. A flow-chart diagram of the algorithm is
shown in Fig. 11.
This modification has a consequence in the computational
resources required. In contrast to VQE, the quantum subrou-
tine in AVQE is adaptive, such that each circuit depends on
the outcome of the previous. Consequently, a number of cal-
culations are required between circuit evaluations. In the next
section we will detail the action of this subroutine, before ex-
panding on the precise steps needed to evaluate a single ex-
pectation estimation and exactly how this process could bene-
fit from local computation.
B. Quantum Subroutine
The quantum subroutine in AVQE is the key driving factor
around the advantages the algorithm exhibits over its competi-
tors. It is a minor modification of an algorithm known as Re-
jection Filtering Phase Estimation [43]. Note that where VQE
uses an algorithm to estimate expectations—which are the
FIG. 12: Phase estimation circuit. M, θ are free parameters.
quantities needed to calculate the energy—we are instead im-
plementing a phase estimation algorithm to indirectly achieve
the value of the expectation. The justification for this is that
through a careful choice of unitary applied to the target qubit,
the output phase is directly proportional to 〈ψ | P |ψ〉.
The circuit we need to implement is shown in Fig. 12, with
the goal to estimate the value of φ where U |φ〉 = eiφ |φ〉, φ ∈
[−pi, pi]. Note this circuit has depth O(M) and measurement
outcomes are only recorded from the ancilla (control) qubit.
The outcome probability P(E|φ) of E ∈ {0, 1} is what gives
us information on the eigenphase φ. The rotation gate on the
control qubit is Z(M, θ) = diag(1, e−iMθ) where parameters M,
θ are free to be chosen.
With the addition of a reflection operation Π = 1 − 2 |0〉 〈0|
we can specify the unitary U = RΠR†PRΠR†P† applied to
input state |φ〉. This operator defines a rotation by an angle
φ in the plane spanned by {|ψ〉 , P |ψ〉}. It can be shown that
φ = 2 arccos
( |〈ψ | P |ψ〉| ) which allows us to approximate
|〈ψ | P |ψ〉| = cos (φ/2) .
While this process does not offer the sign of the expectation,
this can be readily found through other methods.
This estimation of phase is statistical in nature, in that we
assume the phase value φ follows a certain prior distribution
with mean µ and standard deviation σ. Our goal, through re-
peated implementations of this circuit, is to update the dis-
tribution and reduce σ below a certain threshold p, at which
point we can take φ ≈ µ as our result. Assume initially that the
phase follows a normal distribution f ∼ N(µ, σ). Given our
knowledge of the outcome probability P(E|φ) we can produce
the posterior distribution f (φ|E) using Bayes’ rule:
f (φ|E) = f (φ)P(E|φ)∫
dϕ f (ϕ)P(E|ϕ) .
To avoid directly calculating the posterior distribution,
which is computationally hard, we instead use a method
known as rejection sampling [44]. We begin by collect-
ing a measurement outcome E ∈ {0, 1} and sampled data
Φ = {φ j}kj=1 from the prior. By performing an acceptance test
on each φ j ∈ Φ according to P(E|φ j) we are able to ‘reject’
certain values and the remaining (accepted) values are guaran-
teed to have been sampled from the true posterior. These ac-
cepted values then offer a new mean µ and standard deviation
σ for a new prior distribution; the update steps then continue
until the requirement on σ has been met. The cardinality k of
Φ is directly related to the reduction of σ at each step; natu-
rally for larger k the approximation to the posterior mean and
standard deviation is improved, but at the cost of performing
more tests.
10
FIG. 13: Minimum number of measurements required Nmin,
Eq. (4), against the target precision p plotted here for various
coherent circuit depths d. Setting d = 1 recovers the scaling
of VQE, and d = 1/p is the scaling regime of QPE.
The next point to address is the specific form of outcome
probability P(E|φ), which is related to the input state |φ〉. We
noted before that the unitary U gives a direct relationship be-
tween the eigenphase and the desired expectation value. It can
also be shown that |ψ〉, our prepared ansatz state, is in an equal
superposition of the eigenstates of U:
|ψ〉 = 1√
2
(|φ〉 + |−φ〉) ,
meaning that |ψ〉 must be collapsed into one of |±φ〉 prior to
running the circuit. Collapsing into one of |±φ〉 gives an out-
come probability:
P
(
E| ± φ) = 1
2
(
1 + (1 − 2E) cos [M (θ ∓ φ)] ). (3)
For details on the collapse process please see Ref. [15], Ap-
pendix C.
The parameters M, θ each play an important role in the al-
gorithm. The value of M sets the overall depth of the circuit
but also enters as a multiplicative factor in Eq. (3); the param-
eter θ acts as a rotation that, with the correct setting, can offset
the rapid oscillations induced by M. The authors of Ref. [43]
chose M = d1.25/σe meaning that upon each update on your
posterior distribution (i.e. defining a new µ, σ) the circuit
depth changes. For a precision p, the final circuit depth would
be O(p−1) — the circuit depth regime of QPE. We maintain
this dynamic behaviour of circuit depth but instead set
M = max
(
1,
⌊
1
σα
+
1
2
⌋ )
for parameter α ∈ [0, 1]. By pre-setting the value of α we
are able to control how quickly the circuit depth grows and to
what maximum it reaches, O(p−α). Finally, we set the param-
eter θ = µ − σ as in Ref. [15].
FIG. 14: Total number of gates required to reach precision
p = 10−3 shown for VQE (solid) and AVQE (dashed). Two
preparation circuit lengths are reported, nP = 10 (blue,
orange) and nP = 103 (green, red).
C. Performance & Applications for Local Computation
As the behaviour of the classical outer-loop in AVQE
is problem-dependent and benefits from various optimisa-
tion protocols, the performance of AVQE is largely centered
around the action of the quantum subroutine. Put another way,
the accuracy in estimating the ground state energy of a Hamil-
tonian by AVQE is dependent on the accuracy of AQPE in
measuring the eigenphase φ.
The adaptive nature of the subroutine means it requires a
large number of classical calculations to be completed before
a single eigenphase can be estimated. This is in contrast to
VQE which benefits from batching, i.e. pre-computing a se-
ries of circuits and passing these to the QPU to be executed
in order. The individual steps for an eigenphase calculation in
AVQE are: sampling k values of φ from a normal distribution
with mean µ and standard deviation σ, testing each sampled
φ against Eq. (3) and storing accepted values, and finally cal-
culating a new µ and σ to inform a new prior. These steps
need to be repeated multiple times to achieve a value of φ.
The exact number of iterations (equivalently, measurements)
is primarily related to the desired precision p and acceleration
parameter α, but we are further restricted through hardware
specifications. For a given coherent circuit depth d the opti-
mal value of α is found by minimising the number of mea-
surements Eq. (1). For simplicity let us set d = p−α, p  1,
such that the maximum value of α is given by
αmax = min
{
− log(d)
log(p)
, 1
}
.
The minimum number of measurements required to achieve
a precision p is therefore
Nmin(p, d) =
 2 log(p)log(pd)
(
1
p2d2 − 1
)
pd < 1
4 log(p−1) pd > 1
. (4)
11
Figure 13 shows the behaviour of Eq. (4) for decreasing value
of precision, and various maximum circuit depths d. For com-
parison purposes the circuit depth of VQE (blue) and QPE
(purple) are shown. The strength of AVQE lies in this range
with the ‘acceleration’ epitomised by the up-to-exponential
decrease in required measurements. We propose that AVQE
is a candidate for utilising the full potential of near-term NISQ
devices, and, in particular, that its quantum phase estimation
subroutine can make effective use of local computation.
While overall AVQE requires more computational effort, a
further advantage over traditional VQE lies in reducing the re-
quired quantum resources, specifically in the number of gates
that need to be implemented. VQE operates with a fixed
circuit depth, that of nP + 1 where nP is the gate length of
the preparation circuit. Comparatively, AVQE requires much
deeper circuits and these continue to increase as the algorithm
runs. With this in mind, for α = 0 (M ≡ 1) there are limited
benefits to running AVQE in comparison to VQE. For α > 0
however, the decrease in number of measurements Eq. (1) be-
gins to outweigh this increased computational cost. To illus-
trate this trade-off we can consider the number of gates in
the preparation circuit, the common factor between both al-
gorithms.
To achieve a precision p in VQE, the total gates required
is (approximately) (nP + 1)p−2. For AVQE a single execution
of the circuit in Fig. 12, ignoring possible parallelisation, re-
quires 4M(nP + 1) + nP + 3 gates. The dynamic nature of M
makes the calculation of total gates required to achieve a pre-
cision p difficult, however we can numerically approximate
its schedule. The comparison between these two is shown in
Fig. 14 for a precision p = 10−3, equivalently a fidelity of
99.9%. We report two extreme values of nP: 10 and 103 with
VQE shown in solid lines (blue, green for nP = 10, 103 re-
spectively) and AVQE shown in dashed (orange, red). For
both values of nP the intersection happens at α ≈ 0.23. That is
to say, beyond this value AVQE requires fewer quantum com-
putational resources than VQE. The plateau at the beginning
of the AVQE lines are a result of the low values of α— the
value of M never increases beyond 1. The second plateau is
the case in which M never increases beyond 2; as α increases
the acceleration of M causes these plateaus to disappear.
AVQE exhibits clear advantages over VQE, however, the
intermediate calculations can still incur a runtime penalty. In
an effort to reduce the overall wall-clock time as many as pos-
sible of these calculations can be done locally. By relegating
the CPU so that it only operates on the classical outer-loop
of the algorithm we can utilise the abilities of the lower lev-
els in the computational stack for subroutine calculations. If
we were to continually communicate back to the CPU after
each recorded measurement outcome, the latency between the
QPU and CPU would create untenable runtime. In VQE (and
others) the relevant sequence of circuits to be run can be pre-
computed. However, it is not possible in this case, as circuit
parameters M, θ depend on the outcome of the previous cir-
cuit.
FIG. 15: Algorithm iteration of AVQE for a generic M-depth
circuit consisting of 2-qubit gates only using approximate
hardware times for superconducting qubits implementations.
D. Costing the subroutine
To emphasise the benefits of implementing this algorithm
in a low-latency environment we here provide a comprehen-
sive view of the computational cost of AVQE. For ease of dis-
cussion we introduce the following parameters: tlat denotes
the latency between QPU and the classical computation block
(either CPU or alternative); tc(M) is the time taken to imple-
ment the circuit in Fig. 12 for a certain depth d ≈ M, including
the circuit to prepare the input state |φ〉. The qubit reset and
measurement times are denoted tr, tmeas respectively; as these
can be implemented simultaneously we need only consider
max(tr, tmeas). Finally, the time taken to perform the Bayesian
update on the probability distribution is then tB. Note that the
latency time tlat needs to be included twice: once for com-
municating the circuit instructions to the QPU and again for
the measurement outcome being sent back. One iteration of
AQPE has the following cost then:
T (M) = 2tlat + tc(M) + max(tr, tmeas) + tB, (5)
and the cost of estimating the eigenphase:
τ =
∑
M
aM(p, α)T (M).
The real-valued coefficients aM(p, α), which determine how
many circuit evaluations take place for a value M, represent
the only unknowns in this costing. Its dependencies are due to
the fact α and p both play a key role in the schedule of M; the
process is probabilistic due to the nature of the Bayesian up-
date but will manifest as a step function over Z+. We will not
address methods for investigating the behaviour of aM here as
we believe it is outside the scope of this document.
We will evaluate the time taken for a single iteration Eq. (5)
in both a superconducting qubit and trapped ion implementa-
tion. As the circuit being implemented is dominated by M,
for simplicity we will consider circuits of depth M consisting
12
only of 2-qubit gates for each hardware specification. The
Bayesian update time tB is performed outside of the quan-
tum hardware; an optimised and hardware-conscious rejection
sampling calculation can be performed in approximately 5µs.
For superconducting qubit hardware the readout and reset
of qubits can be done in 120ns, with a similar 2-qubit gate
time. For trapped ions the times are much slower, with a 2-
qubit gate time of 10µs and readout and reset time of approx-
imately 750µs. We can see the result of the superconduct-
ing simulation in Fig. 15 with a theoretical latency range of
[1, 100]µs. Due to the very low gate times of this technology
the benefit of a lower latency in the calculation is stark; up to a
20x improvement on runtime can be achieved. Comparatively
the improvement for trapped ion technology, Fig. 16, is much
more slight due to the long gate times. However we argue that
an improvement in runtime, however small, is advantageous
in NISQ hardware.
E. Closing remarks
The accelerated variational quantum eigensolver algorithm
is able to outperform traditional VQE in terms of measure-
ments, utilisation of quantum coherence in hardware and the
overall required quantum resource budget. These advantages
are coupled with the requirement for regular, generally non-
trivial, classical calculations and low batch size. With this in
mind, operating AVQE in a low latency environment can ac-
centuate its performance and allow for more efficient imple-
mentations.
The future of NISQ hardware and showcasing of quantum
advantage on these devices is, we believe, rooted in these
kinds of adaptive algorithms. While AVQE is not unique in
incorporating coherent circuit depth into algorithm function-
ality (references passim), it is a prime example of the types
of intermediary calculations adaptive algorithms will require.
Efficient implementation of these algorithms can be achieved
through the use of a degree of local computation.
IV. QUANTUM ERROR CORRECTION
A. Introduction to QEC
An area where local computation is likely to prove advan-
tageous is error correction. In the fault tolerant regime, error
correcting protocols are going to be necessary to achieve the
required coherence times. However, even in the NISQ era,
some low level error correction is likely to be useful [17].
Error correction on a quantum computer is performed by a
repetitive sequence of events. First, a bit string called a syn-
drome is created. This is a record of the error that has occurred
on the data, and on a quantum computer the syndrome is cre-
ated by entangling ancilla qubits to the data qubits and mea-
suring certain Pauli operators corresponding to the code being
used. The syndrome then needs to be decoded by a classi-
cal decoder, which outputs the best guess of the error that has
occurred. This information then needs to be relayed back to
FIG. 16: Algorithm iteration of AVQE for a generic M-depth
circuit consisting of 2-qubit gates only using approximate
hardware times for trapped ion implementations.
the QPU in order for the error to be taken into account, ei-
ther via a correction or an update to the set of gates that are
subsequently going to be applied.
To prevent the build up of errors this error correction cycle
needs to be performed at regular intervals. If the decoder is
situated on the CPU, latency will become an issue given the
frequency of this task. Moreover, given the number of oper-
ations required to perform an error correction cycle, and the
frequency with which they need to be transmitted, bandwidth
will also be an issue. However, we show in this section that if
the decoder can be implemented using local computation then
the latency and bandwidth bottlenecks can be overcome.
B. The Surface Code
The surface code is the most promising family of quantum
error correcting codes due to its high noise threshold, which
means that they can handle a very noisy error rate. The surface
code considered in [18] is laid out on a (2d − 1) × (2d − 1)
square grid of qubits (see Figure 17). The qubits consist of
data qubits, which store the logical information, and ancilla
qubits, which are used to measure the syndrome. Such a code
has distance d, which is a measure of its error tolerance.
The fact that the surface code can be laid out on a two-
dimensional grid means that the ancilla qubits only need to
interact with their nearest neighbour data qubits. After a syn-
drome measurement, an ancilla qubit “lights up” if the parity
of the measurements with its neighbour data qubits is odd. We
shall call such a qubit a hot syndrome qubit, or just a hot qubit.
The syndrome produced by a syndrome measurement is a bit
string that records the set of hot syndrome qubits.
The circuit used to measure the syndrome is susceptible to
the same noise that affects the data qubits. A consequence
of this is that measurement errors can cause an incorrect syn-
drome to be produced, as well as introduce errors to the data
qubits. In order to overcome this problem, several rounds of
13
FIG. 17: The grid of qubits for the distance 3 surface code,
taken from [18]. The X (Z) ancilla qubits are used to measure
X-errors (Z-errors). The red edges represent X errors on the
corresponding data qubits.
syndrome extraction are performed, in particular, d rounds are
performed in the case of a distance d surface code. From
the accumulated information contained in these multiple syn-
dromes, a good decoder will output an error which best ex-
plains them.
1. The Union-Find decoder
Many decoding algorithms for the surface code use the de-
coding graph as their input. The decoding graph captures the
results from the d rounds of syndrome measurement and it is
best represented as a three-dimensional graph on a lattice of
vertices consisting of d layers with edges connecting nearest
neighbours. Each layer of the lattice records the results of a
single round of syndrome extraction; the vertices represent the
ancilla qubits and each horizontal edge in a single layer rep-
resents a possible error that can occur on the corresponding
qubit. Each vertical edge between layers represents a possi-
ble syndrome bit flip error (see Figure 18). The vertices cor-
responding to the hot syndrome qubits after the d-rounds of
measurement are highlighted in the graph (coloured red). The
goal of any decoding algorithm is to come up with an error
pattern consisting of both data qubit errors and syndrome bit
errors that best describes the set of hot syndrome qubits in the
decoding graph.
The Union-Find algorithm [45, 46] is an algorithm that per-
forms this task in almost linear time. We illustrate how the
algorithm works in Figure 19, for only one level of the decod-
ing graph, that is, assuming there are no measurement errors.
First in (a), the syndrome is measured giving a set of hot syn-
drome qubits (these are the red qubits). Once the syndrome
has been measured, clusters of edges are grown around the hot
syndrome qubits by expanding along half edges of the lattice
in all directions, which is shown in (b). During this process,
clusters that overlap are combined and if at any point a clus-
ter contains an even number of hot syndrome qubits, it stops
growing. The cluster growth process continues until all clus-
ters contain an even number of hot syndrome qubits. Once the
cluster growth process ceases, a spanning tree for each clus-
FIG. 18: The graph in (a) is a decoder graph for a single
round of syndrome measurements. The graph (b) is is a
three-dimensional decoder graph for three rounds of
syndrome measurement. The horizontal red edges represent
X errors on qubits; the vertical red edges represent syndrome
bit errors. Figure taken from [18].
ter is created, shown in (c). Finally in (d), given the spanning
trees for the clusters, the Peeling decoder is applied to each
spanning tree to decode the error.
FIG. 19: The four stages of the Union-Find decoder, taken
from [18].
C. Decoding must be fast
Any quantum circuit that cannot be simulated classically
must contain non-Clifford gates [47], and in order to apply a
non-Clifford gate it is necessary to know the current state of
the errors. This is because, assuming stochastic Pauli noise,
14
unlike for Clifford gates, a general error cannot be handled by
commuting it through the non-Clifford gates where it can be
dealt with at a later time. Therefore a decoder must keep up
with the rate at which syndromes are being generated, other-
wise a data backlog will be created.
To clarify this, consider the ratio f of the rate of syndrome
generation (rgen) versus the rate of decoder processing (rproc).
In [17] it is shown that the data backlog caused by a slow de-
coder will lead to a latency overhead that grows exponentially
in f , namely if a circuit contains k non-Clifford gates then the
latency scales as f k. So any protocol with f > 1 is going to
be problematic.
To see the impact of this, consider the circuit given in [48]
designed to perform a multiply-controlled CNOT gate on 100
logical qubits. It consists of ∼ 2356 gates, of which 686 are
T -gates (which are non-Clifford). With the caveats that the
syndrome generation cycle time is approximately 400 ns [49],
and the decoder requires 800 ns to execute [50], the ratio f = 2
leads to an execution time of approximately 10196 seconds (or-
ders of magnitude longer than the age of the Universe). This
motivates us to consider fast decoding protocols.
D. A micro-architecture for the surface code
As commented on above, any classical decoder must finish
before the following rounds of syndrome measurement have
completed. Until recently it was not known whether such a
decoder for the surface code existed [51]. However Das et
al. [18] have proposed a micro-architecture to implement the
Union-Find decoder for the surface code that overcomes this
hurdle. This micro-architecture deals with the three stages of
the Union-Find decoder that can be implemented classically.
These are the growing of the clusters, the creation of the span-
ning trees, and the application of the Peeling decoder.
1. Growing the clusters
In the micro-architecture the Graph-Genenerator (Gr-Gen)
engine creates the clusters around the syndrome vertices. It
consists of a Spanning Tree Memory (STM), a Zero Data Reg-
ister (ZDR), a Fusion Edge Stack (FES), a Root Table, a Size
Table, a Parity Register and a Tree Traversal Register.
The STM keeps track of the clusters - it stores one bit for
each vertex and two for each edge. The ZDR is used to quickly
look up the STM; each entry of this register corresponds to a
row of the STM and if a row in the STM contains a non-zero
bit, then the corresponding register bit is 1, otherwise it is
equal to 0. Newly grown edges are stored in the FES, from
where they can be added to the existing clusters in the STM.
The root and size tables respectively keep track of the roots
and sizes of each cluster. The parity register keeps track of
the parity of each cluster; it stores a 1 if the corresponding
cluster is odd, that is, the cluster contains an odd number of
hot syndrome qubits.
The Gr-Gen grows clusters by first reading the parity reg-
ister to identify odd clusters. Using the information stored in
the root and size tables, as well as the FES, it then grows the
odd clusters by reading and writing to the STM. Newly added
edges are checked to see if they connect two clusters. This
is done by checking the root of each vertex incident with an
edge (note, this process is aided by the Tree Traversal Regis-
ter). If the two roots are different, the corresponding clusters
are subsequently merged. This involves updating the root and
size tables.
2. Creating the Spanning Trees
The spanning trees, which are used as the input to the peel-
ing decoder, are created by the Depth First Search (DFS) En-
gine. In the DFS engine a depth first search algorithm is ap-
plied to the clusters stored in the STM, and it is implemented
using a finite state machine and two edge stacks. The reason
edge stacks are used is that the implementation of the peeling
decoder requires the edges in a spanning tree to be traversed in
reverse order. Moreover, two stacks are used to enhance per-
formance; while a spanning tree is created in one edge stack
by the DFS engine, the peeling decoder can be applied to a
spanning tree for a cluster stored in the other edge stack.
3. Implementing the Peeling Decoder
The Correction engine (Corr engine) implements the peel-
ing decoder described in [46]. It needs access to the spanning
trees held on the edge stacks in the DFS engine as well as
the bits corresponding to the hot syndrome qubits in the STM.
To reduce latency, these syndrome bits are saved to the edge
stack when the spanning trees are created so that the Corr en-
gine only needs to read the edge stack. As part of the peeling
decoder, the syndrome is dynamically changed. A temporary
syndrome register keeps track of this in the Corr engine. Fi-
nally, the result of the peeling decoder is recorded by updating
the Error Log. If the error to be recorded cancels out an er-
ror held in the Error Log from a previous correction cycle, the
Error Log is updated to reflect this.
4. Performance
By running simulations, it is observed in [18] that the Gr-
Gen engine takes roughly twice as long as both the DFS en-
gine and the Corr engine. If one decoding block consists of
one of each type of engine, then the DFS engine and Corr en-
gine will spend a lot of time waiting for the Gr-Gen engine to
finish. This suggests that a more efficient configuration can be
achieved. A decoding block consisting of 2 Gr-Gen engines,
one DFS and one Corr engine is proposed in [18], which pro-
cesses 2 logical qubits. Such a decoding block is labelled
a (4, 2, 1, 1)-decoding block, where the 4 corresponds to the
number of logical X and Z operators that are being protected.
Using these decoding blocks, an algorithm that requires L log-
ical qubits will require L Gr-Gen engines, L/2 DFS engines
and L/2 Corr engines.
15
Now recall that when using a distance d surface code, d
rounds of syndrome extraction are performed in one correc-
tion cycle. We call this a complete measurement cycle. In
order for the decoder to keep up with error correction, it must
finish its decoding before the subsequent complete measure-
ment cycle is finished, because otherwise errors might accu-
mulate and spread in an uncontrollable manner. If the decoder
does not finish before the subsequent complete measurement
cycle, we call this a timeout failure. One way that the im-
pact of timeout failures can be reduced is to ensure that they
are less likely than logical errors. That is, the probability of
a timeout failure of a decoder block per logical qubit must
be less than the probability of a logical failure. Using the
(4, 2, 1, 1)-decoder block, this implies that
pToE(d, p)/2 ≤ pLog(d, p),
where pToE(d, p) is the probability of timeout failure in a de-
coder block and pLog(d, p) is the logical failure rate. Imple-
menting the Union-Find decoder on the micro-architecture
outlined above facilitates this requirement as it reduces la-
tency. Specifically, assuming a surface code with distance
d = 11 is being used on a (4, 2, 1, 1)-decoding block with a
physical error rate of 10−3, then a block of 2-logical qubits
can be decoded in 325 ns (see Figure 20). This is well below
the time it takes to perform a complete measurement cycle for
a surface code with these parameters, which is approximately
11 µs [52].
FIG. 20: A Monte-Carlo simulation of the micro-architecture
implementing the (4, 2, 1, 1)-decoding block, taken from [18].
E. NISQ+ Regime
The micro-architecture described in the previous section
is designed for fault tolerant machines. In contrast, we now
show the benefits of local computation on near term devices.
In [17] the authors propose an approximate decoding algo-
rithm to overcome the latency overhead problem. The impor-
tant point to make here is that they sacrifice some accuracy
in their decoder in order to be efficient. The classical de-
coder is implemented using single flux quantum (SFQ) logic;
is a classical logic implemented in superconducting hardware.
The decoder design is built out of a 2d array of modules im-
plemented in SFQ logic circuits; the 2d array represents the
qubits of the 2d surface code. Specifically, each module rep-
resents either a data qubit or an ancilla qubit.
The decoding algorithm proceeds by first identifying the
two modules representing the two hot syndrome qubits that
are closest together. Next, a chain of modules representing
a path connecting these two qubits is recorded. Finally, the
two chosen modules are reset and the algorithm iterates on to
the next closest pair of hot qubits. The first part of the algo-
rithm that identifies the two closest hot qubits is performed
by implementing the cluster growth stage of the Union-Find
decoder. Moreover, the modules are hardwired so that only
certain paths between qubits can be recorded, leading to an
approximation of an ideal decoder.
One obvious problem with the algorithm as outlined above
occurs when there are multiple hot qubits that are equidistant.
To overcome this, the authors introduce a request-grant pol-
icy that allows the hardware to select a subset of pairs of the
equidistant hot qubits. Similarly, a hardware solution is pro-
vided for hot qubits that are close to the boundary of the sur-
face code lattice.
F. Performance of the SFQ logic decoder
One of the metrics used in [17] to evaluate their decoding
protocol is simple quantum volume (SQV). This can be de-
fined as the product of the number of computational qubits
of a machine by the number of gates expected to be able to
perform without error. It is shown that in near term machines
(termed NISQ+ machines in [17]), their proposed decoding
protocol can increase the SQV. In particular, the authors con-
sider a device of 1000 physical qubits with an error rate of
10−5, an extension of a machine that is predicted to exist in
the near term [53]. Using a surface code with distance 3 which
encodes 78 logical qubits, the authors show that their protocol
can increase the SQV from 105 to 3.4 × 108. Similarly a code
of distance 5 that encodes 40 logical qubits increases the SQV
to 1.12 × 109 (see Figure 21).
FIG. 21: Simple quantum volume of a near term device with
1000 physical qubits and an error rate of 10−5, taken from
[17].
Despite the approximate nature of the SFQ logic de-
16
coder, the reduction in resources due to its speed compares
favourably to other decoding protocols, as can be seen in Fig-
ure 22. This is in part because, if the ratio is f > 1, the back-
log in the bottleneck increases the effective logical error rate
as many more syndrome cycles are needed to process one log-
ical gate. Therefore it is clear that one needs to take into ac-
count both the speed of the decoder and the latency overhead
when assessing the efficiency of a decoding protocol.
FIG. 22: Comparison of required code distances of different
decoders to execute an algorithm consisting of 100 T-gates,
taken from [17]. Compared are the SFQ Decoder, minimum
weight perfect matching decoder (MWPM) [54], neural
network decoder [55], union find decoder [45], and a
theoretical MWPM decoder with no backlog, across both
code distances and physical error rates.
G. The instruction bandwidth problem
In [56] a bottleneck caused by the instruction bandwidth
for quantum error correction (QEC) is considered. Any fault
tolerant device will require QEC, and it has been proposed that
this should be managed at the software level. The reason for
this is twofold. First, the optimum quantum error correcting
protocol for a quantum device is dependent on the hardware
qubit properties as well as the algorithm that is going to be
implemented. Therefore it is suggested that a QEC library
will be necessary in order to minimise the resources to run a
particular algorithm [57]. Second, quantum error correction is
an active research area, and so one would want to add to the
QEC library as more efficient codes are discovered.
However, controlling the QEC at the software level requires
the QEC instructions to be transmitted on the same channel as
the program instructions. This leads to an instruction bottle-
neck. Specifically, in [56], the quantum resource estimator
toolbox (QuRE) [58] is used to cost the instruction bandwidth
for error correction for seven quantum algorithms. They find
that at least 99.99% of the total instruction bandwidth is taken
up by the QEC instructions when the error correction is man-
aged at the software level (see Figure 23). To demonstrate
the problems that this can cause, consider a device comprised
of superconducting qubits that operate at 100 MHz with byte
sized quantum instructions. In order for a qubit to maintain its
integrity it must receive QEC instructions at roughly its oper-
ating rate. Therefore each qubit requires 100 MB/s of QEC
instruction. However, this implies that a quantum computer
with 100,000 qubits would require 10TB/s of QEC instruction
bandwidth.
FIG. 23: Ratio of QEC operations to non-QEC operations for
seven quantum algorithms, taken from [56].
Large instruction bandwidth can be handled in traditional
computing systems by caching instructions. However, this can
lead to small instruction delays which are not acceptable for
QEC instructions. Even small delays (∼ 100ns) in the im-
plementation of quantum error correction can lead to a build
up of errors that renders any computation useless. Therefore
alternative solutions are necessary.
To overcome this instruction bandwidth bottleneck, an ar-
chitecture that delegates the task of QEC from the software to
the hardware has been proposed [56]. The control processor
in this architecture consists of a master controller and an array
of dedicated Micro-coded Control Engines (MCEs), which are
connected by a global data and instruction bus (see Figure 24).
This architecture is designed to distribute the instruction de-
livery for QEC across the MCEs. Each MCE manages a dedi-
cated “tile” of a small number of qubits and executes the quan-
tum error correction instructions without any software coordi-
nation.
FIG. 24: The control processor consisting of the master
controller and an array of MCEs, connected by a bus. Figure
taken from [56].
An MCE consists of an instruction pipeline, a microcode
pipeline, a quantum execution unit, and an error decoder
pipeline. The instruction pipeline delivers logical instructions
and translates them into physical instructions. The microcode
17
pipeline feeds these instructions to the quantum execution
unit. The microcode pipeline also stores the QEC instructions
in memory and feeds these to the quantum execution unit as
well. The quantum execution unit executes the instructions
that it receives. The error decoder pipeline is part of a two-
stage decoding process. It collects the results of syndrome
measurements and implements a simple look up table to cor-
rect single qubit errors. More complicated errors are dealt
with by a global decoder located in the master controller.
The microcode pipeline consists of a microcode memory,
which stores the instructions for the QEC, and an address de-
coder. A micro-operation (µop) corresponding to a QEC in-
struction moves from the microcode memory to the address
decoder, which in turn delivers the µop to the quantum execu-
tion unit. Importantly, the capacity of the microcode memory
affects the number of qubits that an MCE can manage. The
necessary memory capacity of the MCE can be reduced by, in
part, taking advantage of the repetitive nature of the instruc-
tions necessary for the QEC [56]. This increases the number
of qubits managed by a single MCE by a factor of 90.
A comparison between the two architectures is given in
[56], where QuRE is used to calculate the global instruc-
tion bandwidth requirements for seven quantum algorithms
on each architecture. The baseline calculation represents the
QEC being managed by software. In this case the compiler
(or programmer) generates the physical instruction stream for
the QEC as well as the logical instructions. The architecture
with dedicated MCEs to manage the instruction stream for the
QEC is also costed, whose global bandwidth is comprised of
the algorithms logical instructions as well as the master con-
troller’s synchronisation tokens. Across the algorithms con-
sidered, a global bandwidth reduction of at least four orders
of magnitude is observed (see Figure 25). One can also al-
low for the MCEs to cache the logical instructions; software
managed instruction caches facilitate this process. This gives
extra savings of three times an order of magnitude, which are
also given in Figure 25.
FIG. 25: Global bandwidth savings using local computation,
taken from [56].
It is worth observing that the architectures were costed us-
ing several different syndrome extraction methods and differ-
ent technologies. The values given in Figure 25 are for Steane
style extraction using the projected gate latencies proposed by
DiVincenzo [59], which are often used in the QEC literature.
The other configurations of technology and extraction method
produce very similar numbers (the coefficient of variance be-
tween the different configurations is 0.0002%).
V. PROVIDING ACCESS
In this paper we have considered three bottlenecks that will
have a substantial impact on near term quantum computing
performance due to the divide between the QPU and CPU. In
all three cases the link between the QPU and the CPU causes
limitations i.e. latency introduced by the link or bandwidth
restrictions. A possible way to mitigate these bottlenecks is
to decrease the load on the link between the CPU and QPU
by removing the restriction that classical computation has to
be carried out on the CPU. We have shown that by moving
some of the light-weight, regularly executed classical com-
putation to the local computation unit within the QPU, these
bottlenecks can be eliminated.
Moving away from carrying out classical computation
solely on the CPU will require access to local computation
within the QPU for algorithm and software developers. This
will require a substantially different model for the execution
of hybrid programs. In many near term quantum computers
the local computation will be provided by FPGAs rather than
the more traditional CPU. This is due to the strict timing re-
quirements for operating the qubit I/O interfaces, for example,
the triggering of a given pulse sequence. If we want to use lo-
cal computation to mitigate bottleneck issues and to improve
qubit utilisation, allowing developers to deploy gateware to
these local computation units is vital albeit challenging. A
number of proposals exist to provide the required access, see
[60, 61].
[1] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon,
Joseph C. Bardin, Rami Barends, Rupak Biswas, Sergio Boixo,
Fernando G. S. L. Brandao, David A. Buell, Brian Burkett,
Yu Chen, Zijun Chen, Ben Chiaro, Roberto Collins, William
Courtney, Andrew Dunsworth, Edward Farhi, Brooks Foxen,
Austin Fowler, Craig Gidney, Marissa Giustina, Rob Graff,
18
Keith Guerin, Steve Habegger, Matthew P. Harrigan, Michael J.
Hartmann, Alan Ho, Markus Hoffmann, Trent Huang, Travis S.
Humble, Sergei V. Isakov, Evan Jeffrey, Zhang Jiang, Dvir
Kafri, Kostyantyn Kechedzhi, Julian Kelly, Paul V. Klimov,
Sergey Knysh, Alexander Korotkov, Fedor Kostritsa, David
Landhuis, Mike Lindmark, Erik Lucero, Dmitry Lyakh, Sal-
vatore Mandra`, Jarrod R. McClean, Matthew McEwen, An-
thony Megrant, Xiao Mi, Kristel Michielsen, Masoud Mohseni,
Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill,
Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov, John C.
Platt, Chris Quintana, Eleanor G. Rieffel, Pedram Roushan,
Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger, Vadim
Smelyanskiy, Kevin J. Sung, Matthew D. Trevithick, Amit
Vainsencher, Benjamin Villalonga, Theodore White, Z. Jamie
Yao, Ping Yeh, Adam Zalcman, Hartmut Neven, and John M.
Martinis. Quantum supremacy using a programmable super-
conducting processor. Nature, 574(7779):505–510, 2019.
[2] IBM. IBM Quantum experience. https://www.ibm.com/
quantum-computing/technology/experience. Accessed:
09/09/20.
[3] Rigetti. Rigetti. https://www.rigetti.com. Accessed:
09/09/20.
[4] Kevin J. Sung, Matthew P. Harrigan, Nicholas C. Rubin, Zhang
Jiang, Ryan Babbush, and Jarrod R. McClean. An exploration
of practical optimizers for variational quantum algorithms on
superconducting qubit processors, 2020, arXiv:2005.11011.
[5] Colin D. Bruzewicz, John Chiaverini, Robert McConnell, and
Jeremy M. Sage. Trapped-ion quantum computing: Progress
and challenges. Applied Physics Reviews, 6(2):021314, Jun
2019.
[6] Morten Kjaergaard, Mollie E. Schwartz, Jochen Braum-
ller, Philip Krantz, Joel I.-J. Wang, Simon Gustavsson, and
William D. Oliver. Superconducting qubits: Current state of
play. Annual Review of Condensed Matter Physics, 11(1):369–
395, 2020, https://doi.org/10.1146/annurev-conmatphys-
031119-050605.
[7] E. Knill, D. Leibfried, R. Reichle, J. Britton, R. B. Blakestad,
J. D. Jost, C. Langer, R. Ozeri, S. Seidelin, and D. J. Wineland.
Randomized benchmarking of quantum gates. Physical Review
A, 77(1), Jan 2008.
[8] Benjamin Le´vi, Cecilia C. Lo´pez, Joseph Emerson, and D. G.
Cory. Efficient error characterization in quantum information
processing. Phys. Rev. A, 75:022314, Feb 2007.
[9] Jonas Helsen, Joel J. Wallman, Steven T. Flammia, and
Stephanie Wehner. Multiqubit randomized benchmarking us-
ing few samples. Phys. Rev. A, 100:032304, Sep 2019.
[10] Timothy Proctor, Kenneth Rudinger, Kevin Young, Mohan
Sarovar, and Robin Blume-Kohout. What randomized bench-
marking actually measures. Phys. Rev. Lett., 119:130502, Sep
2017.
[11] Kristan Temme, Sergey Bravyi, and Jay M. Gambetta. Error
mitigation for short-depth quantum circuits. Phys. Rev. Lett.,
119:180509, Nov 2017.
[12] Ying Li and Simon C. Benjamin. Efficient variational quantum
simulator incorporating active error minimization. Phys. Rev.
X, 7:021050, Jun 2017.
[13] Abhinav Kandala, Kristan Temme, Antonio D. Crcoles, Anto-
nio Mezzacapo, Jerry M. Chow, and Jay M. Gambetta. Error
mitigation extends the computational reach of a noisy quantum
processor. Nature, 567(7749):491–495, March 2019.
[14] Tudor Giurgica-Tiron, Yousef Hindy, Ryan LaRose, Andrea
Mari, and William J. Zeng. Digital zero noise extrapolation
for quantum error mitigation, 2020, arXiv:2005.10921.
[15] Daochen Wang, Oscar Higgott, and Stephen Brierley. Ac-
celerated variational quantum eigensolver. Phys. Rev. Lett.,
122:140504, Apr 2019.
[16] Dax Enshan Koh, Guoming Wang, Peter D. Johnson, and
Yudong Cao. A framework for engineering quantum likelihood
functions for expectation estimation. 2020, arXiv:2006.09349.
[17] Adam Holmes, Mohammad Reza Jokar, Ghasem Pasandi,
Yongshan Ding, Massoud Pedram, and Frederic T. Chong.
Nisq+: Boosting quantum computing power by approximating
quantum error correction. 2020, arXiv:2004.04794.
[18] Poulami Das, Christopher A. Pattison, Srilatha Manne, Dou-
glas Carmean, Krysta Svore, Moinuddin Qureshi, and Nico-
las Delfosse. A scalable decoder micro-architecture for fault-
tolerant quantum computing. 2020, arXiv:2001.06598.
[19] M. Mohseni, A. T. Rezakhani, and D. A. Lidar. Quantum-
process tomography: Resource analysis of different strategies.
Phys. Rev. A, 77:032322, Mar 2008.
[20] Joseph Emerson, Robert Alicki, and Karol Åyczkowski. Scal-
able noise estimation with random unitary operators. Journal of
Optics B: Quantum and Semiclassical Optics, 7(10):S347S352,
Sep 2005.
[21] Christopher Granade, Christopher Ferrie, and D G Cory. Ac-
celerated randomized benchmarking. New Journal of Physics,
17(1):013042, Jan 2015.
[22] Robin Harper, Ian Hincks, Chris Ferrie, Steven T. Flammia, and
Joel J. Wallman. Statistical analysis of randomized benchmark-
ing. Physical Review A, 99(5), May 2019.
[23] Jonas Helsen, Xiao Xue, Lieven M. K. Vandersypen, and
Stephanie Wehner. A new class of efficient randomized bench-
marking protocols. npj Quantum Information, 5(1):71, Aug
2019.
[24] Andrew W Cross, Easwar Magesan, Lev S Bishop, John A
Smolin, and Jay M Gambetta. Scalable randomised benchmark-
ing of non-clifford gates. npj Quantum Information, 2(1), Apr
2016.
[25] Arnaud Carignan-Dugas, Kristine Boone, Joel J Wallman, and
Joseph Emerson. From randomized benchmarking experi-
ments to gate-set circuit fidelity: how to interpret random-
ized benchmarking decay parameters. New Journal of Physics,
20(9):092001, Sep 2018.
[26] Joel J. Wallman. Randomized benchmarking with gate-
dependent noise. Quantum, 2:47, Jan 2018.
[27] Alexander Erhard, Joel J. Wallman, Lukas Postler, Michael
Meth, Roman Stricker, Esteban A. Martinez, Philipp Schindler,
Thomas Monz, Joseph Emerson, and Rainer Blatt. Character-
izing large-scale quantum computers via cycle benchmarking.
Nature Communications, 10(1):5347, Nov 2019.
[28] A. K. Hashagen, S. T. Flammia, D. Gross, and J. J. Wallman.
Real randomized benchmarking. Quantum, 2:85, Aug 2018.
[29] Zhenyu Cai. Multi-exponential error extrapolation and combin-
ing error mitigation techniques for nisq applications, 2020.
[30] Sam McArdle, Xiao Yuan, and Simon Benjamin. Error-
mitigated digital quantum simulation. Phys. Rev. Lett.,
122:180501, May 2019.
[31] Suguru Endo, Simon C. Benjamin, and Ying Li. Practical quan-
tum error mitigation for near-future applications. Phys. Rev. X,
8:031027, Jul 2018.
[32] Andre He, Benjamin Nachman, Wibe A. de Jong, and Chris-
tian W. Bauer. Resource efficient zero noise extrapolation with
identity insertions, 2020, arXiv:2003.04941.
[33] Sergey Bravyi and Dmitri Maslov. Hadamard-free cir-
cuits expose the structure of the clifford group, 2020,
arXiv:2003.09412.
[34] Robert Koenig and John A. Smolin. How to effi-
ciently select an arbitrary clifford group element. Jour-
19
nal of Mathematical Physics, 55(12):122202, 2014,
https://doi.org/10.1063/1.4903507.
[35] Christoph Dankert, Richard Cleve, Joseph Emerson, and Etera
Livine. Exact and approximate unitary 2-designs and their ap-
plication to fidelity estimation. Phys. Rev. A, 80:012304, Jul
2009.
[36] Easwar Magesan, J. M. Gambetta, and Joseph Emerson. Scal-
able and robust randomized benchmarking of quantum pro-
cesses. Physical Review Letters, 106(18), May 2011.
[37] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong
Yung, Xiao-Qi Zhou, Peter J. Love, Ala´n Aspuru-Guzik, and
Jeremy L. O’Brien. A variational eigenvalue solver on a pho-
tonic quantum processor. Nature Communications, 5(1):4213,
2014.
[38] P. J. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R.
McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding,
B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, A. G.
Fowler, E. Jeffrey, E. Lucero, A. Megrant, J. Y. Mutus, M. Nee-
ley, C. Neill, C. Quintana, D. Sank, A. Vainsencher, J. Wenner,
T. C. White, P. V. Coveney, P. J. Love, H. Neven, A. Aspuru-
Guzik, and J. M. Martinis. Scalable quantum simulation of
molecular energies. Phys. Rev. X, 6:031007, Jul 2016.
[39] Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika
Takita, Markus Brink, Jerry M. Chow, and Jay M. Gambetta.
Hardware-efficient variational quantum eigensolver for small
molecules and quantum magnets. Nature, 549(7671):242–246,
2017.
[40] Robert M. Parrish, Edward G. Hohenstein, Peter L. McMa-
hon, and Todd J. Martı´nez. Quantum computation of electronic
transitions using a variational quantum eigensolver. Phys. Rev.
Lett., 122:230401, Jun 2019.
[41] Ala´n Aspuru-Guzik, Anthony D. Dutoi, Peter J. Love, and
Martin Head-Gordon. Simulated quantum computation of
molecular energies. Science, 309(5741):1704–1707, 2005,
https://science.sciencemag.org/content/309/5741/1704.full.pdf.
[42] Seth Lloyd. Universal quantum simula-
tors. Science, 273(5278):1073–1078, 1996,
https://science.sciencemag.org/content/273/5278/1073.full.pdf.
[43] Nathan Wiebe and Chris Granade. Efficient bayesian phase es-
timation. Phys. Rev. Lett., 117:010503, Jun 2016.
[44] Nathan Wiebe, Christopher Granade, Ashish Kapoor, and
Krysta M. Svore. Bayesian inference via rejection filtering.
June 2015.
[45] Nicolas Delfosse and Naomi H. Nickerson. Almost-linear
time decoding algorithm for topological codes. 2017,
arXiv:1709.06218.
[46] Nicolas Delfosse and Gilles Zmor. Linear-time maximum like-
lihood decoding of surface codes over the quantum erasure
channel. 2017, arXiv:1703.01517.
[47] Daniel Gottesman. The heisenberg representation of quantum
computers. 1998, arXiv:9807006.
[48] Adam Holmes, Sonika Johri, Gian Giacomo Guerreschi,
James S Clarke, and A Y Matsuura. Impact of qubit connec-
tivity on quantum algorithm performance. Quantum Science
and Technology, 5(2):025009, mar 2020.
[49] Joydip Ghosh, Austin G. Fowler, and Michael R. Geller. Sur-
face code with decoherence: An analysis of three superconduct-
ing architectures. Phys. Rev. A, 86:062318, Dec 2012.
[50] Christopher Chamberland and Pooya Ronagh. Deep neural de-
coders for near term fault-tolerant experiments. Quantum Sci-
ence and Technology, 3(4):044002, jul 2018.
[51] Austin Fowler. Towards sufficiently fast quantum error correc-
tion. Conference QEC, 2017.
[52] Craig Gidney and Martin Eker. How to factor 2048 bit rsa
integers in 8 hours using 20 million noisy qubits. 2019,
arXiv:1905.09749.
[53] Lev S Bishop, Sergey Bravyi, Andrew Cross, Jay M Gambetta,
and John Smolin. Quantum volume. Technical report, IBM,
2017.
[54] Austin G. Fowler, Matteo Mariantoni, John M. Martinis, and
A N Cleland. Surface codes: Towards practical large-scale
quantum computation. Physical Review A, 86:032324, 2012.
[55] Paul Baireuther, M. D. Caio, Ben Criger, C. W. J. Beenakker,
and Timothy E. O’Brien. Neural network decoder for topologi-
cal color codes with circuit level noise. New Journal of Physics,
21:013003, 2019.
[56] S. S. Tannu, Z. A. Myers, P. J. Nair, D. M. Carmean, and M. K.
Qureshi. Taming the instruction bandwidth of quantum com-
puters via hardware-managed error correction. In 2017 50th
Annual IEEE/ACM International Symposium on Microarchitec-
ture (MICRO), pages 679–691, 2017.
[57] Thomas Hner, Damian S Steiger, Krysta Svore, and Matthias
Troyer. A software methodology for compiling quantum pro-
grams. Quantum Science and Technology, Vol. 3(2):020501,
Feb 2018.
[58] M. Suchara, J. Kubiatowicz, A. Faruque, F. T. Chong, C. Lai,
and G. Paz. Qure: The quantum resource estimator toolbox. In
2013 IEEE 31st International Conference on Computer Design
(ICCD), pages 419–426, 2013.
[59] David P. DiVincenzo. The physical implementation of quantum
computation. Fortschritte der Physik, 48(911):771–783, 2000.
[60] Riverlane. Deltaflow®. https://www.riverlane.com/
products/. Accessed: 07/08/20.
[61] Quantum Machines. Quantum Machines. https://www.
quantum-machines.co. Accessed: 07/08/20.
