A Scalable Decoder Micro-architecture for Fault-Tolerant Quantum
  Computing by Das, Poulami et al.
A Scalable Decoder Micro-architecture for Fault-Tolerant Quantum Computing
Poulami Das,1 Christopher A. Pattison,2 Srilatha Manne,3 Douglas
Carmean,3 Krysta Svore,3 Moinuddin Qureshi,1 and Nicolas Delfosse∗3
1Georgia Institute of Technology, Atlanta, GA, USA
2California Institute of Technology, Pasadena, CA, USA
3Microsoft Quantum Systems and Microsoft Research, Redmond, WA, USA
Quantum computation promises significant computational advantages over classical computation
for some problems. However, quantum hardware suffers from much higher error rates than in classical
hardware. As a result, extensive quantum error correction is required to execute a useful quantum
algorithm. The decoder is a key component of the error correction scheme whose role is to identify
errors faster than they accumulate in the quantum computer and that must be implemented with
minimum hardware resources in order to scale to the regime of practical applications. In this work,
we consider surface code error correction, which is the most popular family of error correcting codes
for quantum computing, and we design a decoder micro-architecture for the Union-Find decoding
algorithm. We propose a three-stage fully pipelined hardware implementation of the decoder that
significantly speeds up the decoder. Then, we optimize the amount of decoding hardware required
to perform error correction simultaneously over all the logical qubits of the quantum computer. By
sharing resources between logical qubits, we obtain a 67% reduction of the number of hardware
units and the memory capacity is reduced by 70%. Moreover, we reduce the bandwidth required for
the decoding process by a factor at least 30x using low-overhead compression algorithms. Finally,
we provide numerical evidence that our optimized micro-architecture can be executed fast enough
to correct errors in a quantum computer.
Quantum computing promises significant speed-up
over conventional computers for specific applications such
as integer factorization [1], physics and chemistry simu-
lations [2–4] and database search [5].
The primary obstacle to the implementation of quan-
tum algorithms solving industrial size problems is the
high noise rate in any quantum device that makes the
output of a quantum computation indistinguishable from
a random output. A fault-tolerant quantum computer,
in which quantum bits, or qubits, are regularly refreshed
by quantum error correction [6], is necessary to perform
a useful computation on noisy quantum hardware.
Most classical error correction schemes can be adapted
to the quantum setting thanks to the CSS construc-
tion [7, 8] and the stabilizer formalism [9], providing a
quantum version of standard families of classical error-
correcting codes [10] such as repetition codes, Hamming
codes, Reed-Muller codes, BCH codes, LDPC codes and
polar codes.
The main difference with the classical setting is the
very high noise rate that affects current quantum hard-
ware, often of the order of 1% per quantum gate, which
makes quantum error correction much more challenging
to implement. This is because it relies on the measure-
ment of quantum parity checks that are likely to intro-
duce additional noise to the qubits. Moreover, one must
be able to implement a universal set of quantum gates on
encoded qubits, or logical qubits. In a fault-tolerant quan-
tum computer (FTQC), the computation is performed by
alternating logical quantum gates and quantum error cor-
∗nidelfos@microsoft.com
rection cycles that removes the noise injected by logical
gates. Both quantum error correction and logical quan-
tum gates must be implemented through fault-tolerant
gadgets [11–13] to avoid the injection of pathological er-
ror configurations that would be uncorrectable by the
subsequent correction cycle.
In this work, we consider a fault-tolerant quantum
computer based on surface codes [14, 15] which is the
most promising family of error-correcting codes for very
noisy quantum hardware. Surface codes can correct up to
1% of noise on all the basic components of the quantum
computer and they can be implemented on a square grid
of qubits using exclusively local quantum gates acting on
nearest neighbor qubits. For comparison, most quantum
error correction codes such as quantum Hamming code
(Steane code) requires an error rate below 10−5 [16].
In this paper, we focus on the design of an error de-
coder, or simply decoder, which is the primary building
block in charge of error correction. The decoder takes as
an input the syndrome, which is the data extracted from
quantum parity check measurements and it returns an
estimation of the error. Given this estimation, the effect
of the error can be easily reversed. In a fault-tolerant
quantum computer, the decoder must satisfy the three
following design constraints.
1. Accuracy Constraint: The decoder must cor-
rectly identify the error with high probability.
2. Latency Constraint: The decoder must be fast
enough to correct errors within one error correction
cycle.
3. Scalability Constraint: The decoder design
must feasible to implement in the regime of practi-
ar
X
iv
:2
00
1.
06
59
8v
1 
 [q
ua
nt-
ph
]  
18
 Ja
n 2
02
0
2cal applications that may require millions of phys-
ical qubits.
A number of surface code decoding algorithms have been
proposed in the past 20 years [14, 17–68] and many satisfy
the Accuracy Constraint which is often the main motiva-
tion. However, it remains unclear whether the decoding
can be made fast enough to satisfy the Latency Con-
straint without degrading the decoder performance [69].
Moreover, the decoding problem is generally studied for
a single logical qubit, ignoring the Scalability Constraint,
whereas practical applications require hundreds or thou-
sand of logical qubits encoded into millions of physical
qubits [3]. A substantial amount of decoding hardware
is required to decoding simultaneously all the qubits of
the quantum computer. Most of the work in quantum er-
ror correction decoders focuses on algorithmic aspects of
the problem. Here, we consider this problem through the
lens of computer architecture and we propose a decoder
micro-architecture that satisfies simultaneously our three
design contraints.
Individual Error Decoders
Decoder
Qubit
Decoder
Qubit
Decoder
Qubit
…
…
Error Decoding Architecture (EDA)
N-Qubit 
Decoder Block
Qubit Qubit
…
…
High Bandwidth Interface
N-Qubit 
Decoder Block
Qubit Qubit…
Syndrome Compression
Syndrome Decompression
Figure 1: A naive decoding architecture with a decoder asso-
ciated with each logical qubit and our Error Decoding Archi-
tecture based on optimized decoder blocks shared across mul-
tiple logical qubits. A low-overhead compression algorithm is
used to reduce the bandwidth cost. Our optimized decoder
block is described in Fig. 12.
We propose a decoder micro-architecture based on the
Union-Find decoding algorithm [30, 43]. We choose the
Union-Find decoder for its accuracy and its simplicity. It
is proven to achieve good decoding performance. More-
over, it comes with a almost-linear time complexity and it
is also very fast in practice because it requires no floating-
point arithmetic and no matrix operations. Finally, the
simplicity of the decoder allows us to design a special-
ized hardware implementation that leads to a significant
speed-up. Our main contributions are the following.
1. We propose a hardware implementation of the
Union-Find decoder based on three hardware units
corresponding to each step of the algorithm.
2. We design a three-stage pipepline based on our
three hardware units that brings an important
speed-up to the Union-Find decoder by paralleliz-
ing the implementation of the decoding stages.
3. We observe that the utilization of different com-
ponents of the Union-Find decoder pipeline varies
with the unit type and across the logical qubits
and hence propose an efficient time-division multi-
plexing that allows logical qubits to share decoding
resources within a decoder block without compro-
mising the error correction capabilities.
4. We propose different compression algorithms
adapted to a cryogenic setting in order to reduce
the bandwidth consumed to send the syndrome
data to the decoder.
5. Combining all the previous results, we describe an
Error Decoding Architecture (EDA) represented in
Fig. 1 that scales the decoder design for a large
FTQC while reducing hardware costs. The number
of hardware units is reduced by 67% and we obtain
a 30× bandwidth reduction while preserving the
decoder accuracy. Moreover, we demonstrate by
numerical simulations that our architecture leads to
a decoder that is fast enough to satisfy the Latency
Constraint, despite the high noise rate of quantum
hardware.
Item 1 and 2 provide a hardware acceleration of the
decoder and 3 and 4 lead to a satisfying solution to the
Scalability Constraint. Our EDA is optimized carefully
in order to guarantee that the error correction capability
of the initial UF decoder is preserved, which guarantees
that the Accuracy Constraint is satisfied. The numerical
simulation of our optimized micro-architecture (item 5)
accounts for the limitation imposed by shared hardware
resources which demonstrates that our decoder satisfy
simultaneously the three design constraints.
This article exploits a number of ideas from computer
architecture such as pipelining and resource optimiza-
tion. We believe this approach is necessary in order to
scale quantum machines. Even though we provide a de-
tailed micro-architecture for the Union-Find decoder, the
general principles of our design apply to any decoding al-
gorithm. A key ingredient is the simplicity of the decoder
and the decomposition in independent steps, which leads
to a natural pipeline and a speed-up by instruction-level
parallelism.
I. BACKGROUND AND MOTIVATION
A. Qubits and decoherence
We refer to Preskill’s lecture notes [70] and Nielsen
and Chuang’s book [71] for a great overview of field
of quantum computation and quantum information the-
ory. A qubit is the basic unit of information in a quan-
tum computer. A qubit is described by complex vector
|ψ〉 = α|0〉 + β|1〉, which represents the superposition
of two basis state |0〉 and |1〉, with α, β ∈ C such that
|α|2 + |β|2 = 1. Without error correction, a quantum
state rapidly decoheres due to the accumulation of tiny
rotations. By constantly measuring the system, one can
3project these tiny rotations onto three types of errors de-
noted X,Y and Z and called Pauli errors. The bit-flip
error X swaps the basis state |0〉 and |1〉 and maps the
qubit |ψ〉 into β|0〉 + α|1〉. The phase-flip error Z is de-
fined by Z|ψ〉 = α|0〉−β|1〉. It introduces a relative phase
between the two basis states. The error Y corresponds to
a simultaneous bit-flip and phase-flip error, i.e., Y = XZ
up to an overall phase which does not affect the result of
the computation. We use the notation I for the identity
operation that corresponds to the error-free case.
By definition of Y , it is enough to correct X-type and
Z-type errors. In this work, we focus on the correction
of X-type errors. By symmetry of the error correction
scheme consider in this article, Z-type errors can be cor-
rected using the exact same mechanisms by swapping
the roles of X and Z. For simplicity, we assume that the
correction of X-type and Z-type errors is performed in-
dependently, although a performance gain is possible by
taking into account the correlations between X and Z.
[72–74].
B. Surface codes
In order to combat decoherence, we must perform the
computation on encoded data, also called logical data,
which is corrected at regular time intervals using a quan-
tum error correcting code [6–9]. In this work, we focus
on the surface code [15, 75] which is the most promising
quantum error correcting code for a quantum computing
architecture due to its high error threshold, which means
that the error correction protocol can be implemented
with very noisy quantum hardware. An error rate below
1% for all the components of the quantum computer is
sufficient in order to obtain encoded qubits with better
quality than our initial physical qubits [76–78]. In or-
der to scale to the massive sizes required for practical
applications, it is necessary to build quantum hardware
whose fault rate is far below the 1%-threshold. This is
because error correction does not decrease the error rate
sufficiently if the initial qubit error rate is too close to
the threshold. In this work, we consider a noise rate of
10−3.
The family of surface codes is the most widely con-
sidered candidate for designing a fault-tolerant quantum
computer. The distance-d surface code, that we de-
note SC(d), encodes a logical qubit into a square grid of
(2d− 1)× (2d− 1) qubits, alternating data qubits, which
store the logical information, and ancilla qubits, used to
detect errors as shown in Figure 2 with the distance-three
surface code SC(3). The main reason for the success of
the surface code is its locality which significantly simpli-
fies the quantum chip design. Error correction with sur-
face codes only requires local interactions between ancilla
qubits and their nearest neighbor data qubits, that is at
most four qubits. The minimum distance d of the code
measures the error correction capability. A larger mini-
mum distance d results in an increased error tolerance at
Data qubit
Z-type Ancilla qubit
X-type Ancilla qubit
X
X
X
(a) (b)
Figure 2: (a) Distance-three surface code SC(3). Data qubits
store the logical information and X-type (resp. Z-type) an-
cilla qubits are used to extract the syndrome of X-type (resp.
Z-type) errors. (b) A set of X-errors detected by non-trivial
syndrome values on red nodes.
the price of an increased qubit overhead.
Encoding physical qubits that suffer from an error rate
p using a distance-d surface code, we obtain a logical
qubits with error rate
pLog(d, p) = 0.15 · (40 · p)(d+1)/2 (1)
that we call logical error rate. This heuristic formula, de-
rived from the numerical results of [79], provides a good
approximation of the logical error rate in the regime of
low error rate (p << 10−2). It is valid in the context
of the current work, that is when using the Union-Find
decoder and for the phenomenological noise model intro-
duced in Section IC.
Throughout this article, we illustrate our design with
numerical results for distance-11 surface codes which is a
reasonable distance for a first generation of fault-tolerant
quantum computer since it allows to implement non-
trivial quantum algorithms on logical qubits while keep-
ing the qubit overhead to a few hundred qubits per log-
ical qubit. Assuming a physical error rate of p = 10−3,
the logical qubits error rate drops to pLog ≈ 6 · 10−10
allowing for the implementation of large depth quantum
algorithms.
C. The decoding problem
In this section, we review the decoding problem and
the graphical formalism introduced in [75].
Quantum error correction is a two-step process. First,
a measurement circuit is executed producing a syndrome
bit for each ancilla qubit. All syndrome bits can be ex-
tracted simultaneously. Then, a decoding algorithm is
used to identify errors on data qubits based on the syn-
drome information. To avoid any confusion with other
decoding operations used in this architecture, we some-
times refer to the decoder as the error decoder. The
4decoding subroutine is a purely classical operation that
can be performed on highly reliable hardware. On the
other hand, the syndrome extraction is implemented on
noisy quantum hardware. In order to obtain sufficiently
accurate information about the errors occurring on data
qubits despite measurement errors, multiple rounds of
syndrome extraction are performed. The decoder ana-
lyzes d consecutive rounds of syndrome data to produce
an estimation of the error induced on the data qubits.
(a) (b)
Figure 3: (a) The planar decoding graph corresponding to
a single round of syndrome extraction without measurement
error. Qubits are supported on edges and syndrome bits cor-
respond to vertices. Red edges mark the presence of X-type
errors on data qubits and red nodes show the non-trivial syn-
drome bits extracted. (b) To fight against measurement er-
rors, three rounds of syndrome extraction are performed, rep-
resented by three layers of the graph (a) connected by vertical
edges corresponding to measurement errors. Errors are rep-
resented by red edges and the endpoints of an error path is
detected by a non-trivial syndrome (red node). No syndrome
bit is extracted on the left and right boundaries of the lattice
nor on the bottom and top boundary.
In the absence of measurement errors, the decoding
problem can be mapped onto a matching problem in a
square grid as shown in Fig. 3. A bit-flip error X act-
ing on a single data qubit is detected by a non-trivial
syndrome bit on the incident X-type ancilla qubits as
Fig. 2(b) shows. More generally, a chain of X-errors is
detected by its endpoints. In order to recover the chain
of X-errors given the syndrome values (its endpoints),
the basic idea of the decoding algorithm is to build a
short chain of X-type errors that matches the detected
endpoints. Errors on the boundary of the lattice are de-
tected only on one of their endpoints.
To address the issue of measurement errors, d rounds of
syndrome extraction are performed, resulting in a match-
ing problem in a three-dimensional graph [75]. Fig. 3
shows the cubic graph representing errors and syndromes
bits for the distance-3 surface code. In what follows, we
refer to this graph as the decoding graph.
We simulate errors occurring on data qubits and incor-
rect syndrome values using the phenomenological noise
model [75]. Each edge of the decoding graph corresponds
to a potential error, with horizontal edges representing
X-errors on data qubits and vertical edges correspond-
ing to syndrome bit-flips. We assume that an error oc-
curs on each edge, independently, with probability p. For
each vertex v in the bulk of the lattice, a syndrome value
s(v) ∈ {0, 1} is extracted, which is the parity of the num-
ber of errors incident to v. Just like in the planar case,
the goal of the decoder is to estimate the error by match-
ing together the vertices v supporting a non-trivial syn-
drome s(v) = 1. This formalism allows to handle both
qubit errors and measurement errors in the same way.
The relevance of the phenomenological noise model is
justified in [75]. For further study of our decoding ar-
chitecture tailored to a specific type of qubit, one may
consider a more precise noise model such as the circuit-
level noise model [75]. In this work, we chose the phe-
nomenological noise model because it is simple enough to
develop a good intuition about the decoder and it cap-
tures the essential properties of the quantum hardware
which guarantees that all the ideas proposed in the cur-
rent work generalize to a more precise model and remain
relevant for a practical device.
D. Existing Error Decoding Algorithms
In this section, we discuss different decoding strategies.
The noise model describes the probability of all possible
errors. Given this information, we can derive the proba-
bility of each error when a given syndrome is observed.
The ultimate goal of the decoder is to return an error
whose probability is maximal given the syndrome mea-
sured, i.e., a most likely error. A decoder that achieves
this performance is said to be a maximum likelihood de-
coder [75]. For an arbitrary error-correcting code, it is
generally not possible to implement efficiently a maxi-
mum likelihood decoder [80]. However in the case of the
surface code, several algorithms achieve a good error cor-
rection performance. In what follows, we discuss several
promising decoding strategies.
Lookup Table (LUT) decoder [81]: This de-
coder implements a maximum likelihood decoder using
a lookup table indexed by the syndrome bits. The cor-
responding LUT entry stores the correction to be ap-
plied to the data qubits. The LUT size grows exponen-
tially with code distance making this design unsuitable
for large FTQCs.
Minimum Weight Perfect Matching (MWPM)
decoder [14]: The MWPM decoder provides an estima-
tion of a most likely error based on a graph pairing algo-
rithm, the MWPM algorithm, that can be implemented
in polynomial time [82]. This decoder is one of the most
effective in terms of error correction capabilities, even
though its worst-case time complexity, O(|V |3) ∝ O(d9)
where |V | is the size of the decoding graph, makes it too
slow for large-distance surface codes. Fowler suggested
a parallel implementation of this algorithm that reduces
5the average time complexity to O(1), although the worst
case complexity remains significant [20]. This decoder re-
lies on large amounts of parallelism from several ASICs
for each logical qubits but this study does not discuss the
system architecture or the number of ASICs needed.
Machine Learning (ML) decoder: ML-decoders
train neural networks with the underlying error prob-
ability distribution and decoding is then treated as an
inference problem where the syndrome data is an input
to the neural network which infers the correction [50–
68] ML-Decoders require substantial computational re-
sources and the size of the training data grows quickly
with the code distance. They also require large train-
ing times, and are primarily studied for small code dis-
tances. There exists some preliminary studies for larger
code distances [56, 59] and proposals to obtain better
performance through distributed neural networks [64],
and hardware platforms [56] such as GPUs, FPGAs, and
TPU [83].
Tensor Network (TN) decoder [46–49] The proba-
bility of each possible error can be represented as a tensor
network which leads to a decoding algorithm that con-
tracts this tensor network. The contraction of the ten-
sor network requires extensive matrix operations which
may be hard to scale, however the algorithm achieves a
very good error-correction performance that is optimal
[47] or quasi-optimal. Although it has not been stud-
ied precisely, tensor network decoders could benefit from
a hardware speed-up from neural accelerators such as a
TPU [83].
Union-Find (UF) decoder [79, 84]: This is a re-
cently proposed algorithm that offers a correction in
almost-linear time O(nα(n)), where α(n) is ≤ 5 for all
practical purposes. It uses Union-Find data structure in
order to achieve a similar performance as the MWPM de-
coder without using computationally intensive matching
algorithms.
Table I: Abstract comparison of decoder accuracy, latency and
scalability (adapted from [64])
Decoder Acuracy Latency Scalability
LUT Very High Low Poor
TN Very high Moderate Moderate
MWPM High to Very high Moderate Moderate
ML High Low Moderate
UF High Low High
Table I provides an abstract comparison of prominent
decoding algorithms in terms of the three key design con-
straints highlighted in introduction: accuracy, runtine
and scalability.
Given the simplicity of the algorithm and low time-
complexity, we use the UF decoder as the default algo-
rithm for our studies. However, the design principles, op-
timizations, and scalability analysis of the present work
will hold true for other decoders as well. Similarly, the
syndrome compression analysis in Section IV applies for
any decoder-quantum substrate interface irrespective of
the decoder and qubit technology in use.
E. Union-Find decoder
In this section, we review the strategy of the Union-
Find decoder [79, 84]. As explained in the introduction,
our principal motivation for choosing this decoding algo-
rithm is its rapidity and its simplicity.
X
X X
X
X X
(a) (b)
(c) (d)
Figure 4: The three main steps of the Union-Find decoder
for a 2d decoding graph with the surface code SC(7). (a) A
X-type error and its syndrome in the decoding graph. The
goal is to recover the error given the red syndrome nodes. To
mark half-edges we will add a vertex in the middle of each
edge of the decoding graph. (b) Cluster Growth: We keep
growing clusters around the red nodes by adding half-edges in
all directions. The growth of a cluster stops when it contains
an even number of red nodes or if it meets the boundary. The
top cluster grows only one step. The bottom cluster requires
two growth steps. (c) Spanning Tree: Build a spanning
tree for each grown cluster in the decoding graph. Ignore
half-edges. (d) Peeling: Identify the error on the edges of
the spanning trees from the leaves to the root.
The Union-Find decoder operates in three steps that
we review in Figure 4. We illustrate the decoding proce-
dure using a two-dimensional decoding graph since there
is no major difference with the cubic case that are rele-
vant in practice. The algorithm takes as an input a set
of nodes supporting non-trivial syndrome values and its
goal is to recover the error living on the edges of the de-
coding graph. (a) During the first step, even clusters are
grown around non-trivial syndrome nodes. (b) Then, a
spanning tree is built for each cluster and is oriented from
6a root to the leaves. (c) Finally, we build an estimation
of the error using the syndrome by traversing the clusters
in reverse order.
II. HARDWARE DESIGN AND PIPELINING
In this section, we design three hardware units imple-
menting the three stages of the Union-Find decoder as
Figure 5 shows. The Graph Generator (Gr-Gen) pro-
duces the grown clusters obtained is Figure 4(b). The
Depth First Search (DFS) engine generates the spanning
forest from Figure 4(c). The correction is implemented
as in Figure 4(d) by the Correction (Corr) engine.
To reduce the bandwidth and latency issues it is more
favorable to operate the decoding circuitry close to the
quantum substrate in a cryogenic regime. Our main de-
sign constraint is the limited hardware resources available
in a cold environment. We optimize our design to mini-
mize the memory requirement and the number of memory
reads.
Our implementation of the decoding algorithm ben-
efits from hardware acceleration in two ways. First, a
fully pipelined design allows performance improvement
through enhanced parallelism across the different pro-
cessing units. While the correction engine works on clus-
ter i, the DFS engine can build the spanning tree for
cluster i− 1. Second, in a general purpose processor, the
read latency depends upon where the data is present and
it can range up to several hundreds of cycles if it needs to
be fetched from the off-chip main memory to the on-chip
caches. For our specialized hardware, the processing el-
ements can directly access the data stored on-chip that
require much fewer cycles. In this work, we assume a
readout time of four cycles to read 32-bit data.
Graph 
Generator
(Gr-Gen)
Depth First 
Search 
Engine
(DFS)
Correction 
Engine
(Corr)
Error 
LogSyndrome
Figure 5: Block Diagram of UF decoder pipeline with three
units corresponding to the three steps of the UF decoding
algorithm.
A. Graph Generator
The Graph Generator module takes the syndrome as
an input and generates a spanning forest by growing
clusters around non-trivial syndrome bits (non zero syn-
drome bits). Instead of growing all surrounding edges
as in Figure 4(b) we only add the edges that reach new
vertices. This directly produces a spanning forest with-
out extra cost. The spanning forest is built using two
fundamental graph operations: Union() and Find() [85].
Figure 6 shows the design of the three modules that
implements the decoding algorithm. The Gr-Gen module
V V
V V
V V
V
V
V
E E E E
E E E E
E E E E
E
E
E
E
E
E
E
E
E
E
E
E
Control Logic
Root Table
Size Table
Fusion Edge
 Stack (FES)
Parity and Traversal 
Registers 
Zero Data 
Register (ZDR)
Spanning Tree
 Memory (STM)
Read/Write Interface Finite State
Machine
Edge Stack (S0)
Alternate 
Edge Stack (S1)
Pending Edge 
Stack
Control Logic
Syndrome Hold
Registers
Error Log
Read/Write 
Interface
Graph-Generator (Gr-Gen) Depth-First Search (DFS) Engine Correction (Corr) Engine
Figure 6: Block Diagram for the whole decoding pipepline
including the three decoding modules.
consists of the Spanning Tree Memory (STM), a Zero
Data Register (ZDR), a root table, a size table, parity
registers, and a fusion edge stack (FES) 1. The size of
each component is a function of the code distance d. The
STM stores one bit for each vertex, and two bits per
edge. Two bits per edge are required since clusters grow
around a vertex or existing cluster boundary by half edge
width as per the original decoding algorithm. The ZDR
indicates whether a row of the STM contains all zeros
by storing a "0" for rows where all bits are 0, and a
"1" for rows where there are one or more non-zero bits.
Since the total number of edges in the spanning forest are
generally small, the ZDR accelerates the STM traversal.
The fusion edge stack (FES) stores the newly grown edges
so that they can be added to existing clusters. The root
table and size table stores the root and size of a cluster,
respectively.
The root table entries are initialized to the indices
RootTable[i] = i, as shown in Figure 7. The size table
entries for the non-trivial syndrome bits are initialized to
1 as shown in Figure 7. These tables aid the Union()
and Find() operations to merge clusters after the growth
phase. They are indexed by cluster indices. The tables
are sized for the maximum number of clusters possible
which is equal to the total number of vertices in the sur-
face code lattice. The tree traversal registers store the
vertices of each cluster visited in the Find() operation.
Since the decoding algorithm grows all odd clusters until
the parity is even, odd clusters must be detected quickly.
To do the same, we use parity registers as shown in Fig-
ure 6. The parity registers store 1 bit parity per cluster
depending upon whether it is odd or even. For a rea-
sonable code distance of 11, seven 32-bit registers are
enough. For larger code distances, we store the addi-
tional parity information in the memory and read them
in advance when required to hide the memory latency.
The control logic reads the parity registers and grows
clusters with odd parity (called the growth phase) by
1 Our proposed design is slightly different from the actual UF algo-
rithm since our objective is not obtain achieve an optimal asymp-
totic complexity but to reduce the cost of hardware resources for
a given system size
7writing to the STM, ZDR, and adding newly added edges
that touches other cluster boundaries to the FES. The
STM is not updated for edges that connect to other
clusters to prevent double growth. It is updated when
clusters are merged by reading from the FES. The logic
checks if a newly added edge connects two clusters by
reading the root table entries of the vertices connected
by the edge (call these the primary vertices). This is
equivalent to the Find() operation. The vertices visited
on the path to find the root of each primary vertex are
stored on the tree traversal registers as shown in Fig-
ure 8(a). The root table entries for these vertices are
updated to directly point to the root of the cluster to
minimize the depth of the tree for future traversals. This
operation, called path compression, is a key feature of the
Union Find algorithm and enables the reduction of the
tree depth, amortizing the cost of Find() operation. For
example, Figure 8(a) shows the state of two clusters and
root table at an instant in time. Let us assume that after
a growth step, vertices 0 and 6 are connected and the
two clusters must be merged. The tree traversal regis-
ters are used to update the root of vertex 0 as shown in
Figure 8(b). Since the depth of the tree is compressed
during every Find() operation, only a few 32-bit registers
are sufficient. The proposed design uses 5 registers per
primary vertex. If the primary vertices belong to differ-
ent clusters, the root of the smaller cluster is updated to
point to the root of the larger cluster.
Delfosse et. al. store the boundary of each cluster
in their algorithm [79]. Based on a Monte-Carlo simu-
lation that shows that the average cluster diameter re-
mains very small in the noise regime that is relevant for
practical applications, we decided to compute the cluster
boundary when it is needed in the growth phase, instead
of consuming extra memory to store it 2.
To summarize, the Gr-Gen module detects odd par-
ity clusters using the parity registers and grows then by
reading and writing to the STM. The cluster growth is
1 1
1
1 1 1
1
1
1
1
1
1
1 1 1
1
1
1
1
1
0 0 0 1 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 1 1 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 0 0 0 1 1 0 0 0
0 0 1 1 1 0 0
0 0 0 0 1 1 0 0 0
0 1 2 3 4 4 6 7 8
0 0 0 0 2 0 0 0 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0
ZDR
Parity Reg
Size Table
Root Table
STM
Initial State Growth Step Final State
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Figure 7: State of the major components in Gr-Gen module
during in graph generation
2 By computing the boundary indices during cluster growth, we
reduce the required memory capacity by 10% at the cost of 4
adders in the Gr-Gen.
4
3 2
1 0
7
6 5
2 2 4 4 4 7 7 7
0 1 2 3 4 5 6 7
0
2
6
4
3 2
1
0 76
5
4 2 4 4 4 7 4 4
0 1 2 3 4 5 6 7
(a) (b)
Root 
Table
Tree Traversal 
Registers
Left: Cluster 0
Right: Cluster 1
Figure 8: (a) State of two clusters and root table entries.
Primary vertices 0 and 6 are now connected by an edge and
the two clusters needs to be merged. Vertices encountered on
the Find() path are saved on the tree traversal registers (b)
Root table entries updated for vertices encountered on the
Find() path.
aided by the information stored on the root table, size
table and FES.
B. Depth First Search Engine
The DFS engine processes the STM data produced by
the Gr-Gen that stores the set of grown even clusters. It
uses the depth first search algorithm to generate the list
of edges that forms a spanning tree for each cluster in
the STM3. The logic is implemented using a finite state
machine and two stacks as shown in Figure 6. Stacks
are used since the order in which edges are visited in the
spanning tree must be reversed to perform correction by
peeling [84]. The edge stack stores the list of visited edges
while the pending edge stack is used as to queue the next
edges to explore in the on-going DFS.
To enable pipelining and improve performance, we de-
sign the micro-architecture to consist of an alternate edge
stack (Edge Stack 1 as shown in Figure 6). When there
is more than one cluster, the correction engine works on
the edge list of one of the traversed clusters when the
DFS engine traverses through the other. The DFS En-
gine generates the list of edges visited to traverse a cluster
using DFS algorithm and hence the number of memory
reads required is directly proportional to the size of the
clusters. By going over the STM row-wise and using the
ZDR to visit only non-zero rows, the effective cost of
generating clusters is reduced.
3 A breadth first search exploration works too but we prefer DFS
since it is generally more memory efficient
81 0 1
e0 e1
v0 v1 v2
{e1,right,0,1}
{e0,right,1,0}
Edge 
Stack
Syndrome 
Hold Registers
1 0 1
e0 e1
v0 v1 v2
{e0,right,1,0}
V1, 1
Ze0
Ie1
Error Log
1 0 1
e0 e1
v0 v1 v2
{e0,right,1,0}
V0, 0
Ze0
Ze1
Ie0
Ze1
? ? ? Z Z Z
Error 
Graph
Step 1 Step 2 Step 3
Figure 9: Peeling for an example error graph performed in
the Correction Engine. The status of the edge stack, error
log, and syndrome hold register are shown for each step.
C. Correction Engine
The correction engine performs the peeling process of
the decoder [84] and identifies the Pauli correction to ap-
ply. This requires access to the edge list (which is stored
on the stack) and syndrome bits corresponding to the
vertices along the edge list. The syndrome bits can be ac-
cessed by accessing the STM. However, this increases the
logical complexity, latency, and the number of memory
requests that the STM is required to handle. To reduce
the incoming memory traffic for the STM and eliminate
the need for additional logic, the syndrome information is
saved on the stack along with the edge index information
by the DFS Engine. The temporary syndrome changes
caused by peeling are saved on local registers (Syndrome
Hold Registers shown in Figure 9). The Corr Engine also
reads the last surface code cycle error log and updates the
Pauli correction for the current edge. For example, if the
error on a edge e0 was Z in the previous logical cycle and
it encounters a Z error in the current cycle too, the Pauli
error for e0 is updated to I as shown in Figure 9.
D. Hardware cost
We measure the hardware cost by estimating the
amount of memory required. Table II shows the different
contributions to the memory requirement.
The spanning tree memory (STM) used by the Gr-Gen
and DFS engine accounts for most of the storage costs.
It contains 1 bit per node of the decoding graph and at
most 2 bits per edge (only 1 bit on the boundary). The
decoding graph is a 3d cubic lattice with about d3 vertices
which leads to a total of 7d3 bits for the STM.
The root table and the size table used in the Gr-Gen
module contains d3 entries and each entry consists of
an integer index which can be uniquely identified using
log2(d
3) bits. Thus, the total sizes of the root and size
tables are 3d3 log2(d) each.
The size of the edge stacks S0 and S1 used by the DFS
and the Corr engine is given by the maximum number
of edges of a spanning tree of a cluster. The size of the
spanning tree of a cluster Ci is given by |V (Ci)|−1 where
|V (Ci)| denotes the number of vertices of the cluster Ci.
To fit any possible cluster, one could pick a stack that
can store d3 edges, which requires about d3 log2(d3) bits.
For simplicity, we ignore the Fusion Edge Stack and
the Pending Edge Stack that are in general significantly
smaller the two edge stacks S0 and S1. This is because
these stacks contain only a small subset of edges of a
cluster.
Table II: Memory requirement for our hardware design of the
Union-Find decoder as a function of the minimum distance
d of the surface code. Note that we consider the worst case
but the two Edge Stacks can be made significantly smaller as
explained in Section IID.
size in bits d = 5 d = 15 d = 25
STM 7d3 100 Bytes 3 KBytes 14 KBytes
Tables (×2) 3d3 log2(d) 100 Bytes 5 KBytes 27 KBytes
Parity Reg. d3 15 Bytes 400 Bytes 2 KBytes
ZDR 3d3 46 Bytes 1.3 KBytes 6 KBytes
Edge stacks (×2) 3d3 log2(d) 100 Bytes 5 KBytes 27 KBytes
0 5 10 15 20 25 30 35 40
Vertices per Cluster
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Pr
ob
ab
ilit
y
Figure 10: Distribution of the number of vertices per cluster
after the growth step of the UF decoder for code distance
d = 11 and physical error rate p = 10−3. This distribution is
estimated by a Monte-Carlo simulation using 107 samples.
The size of the edge stacks can in principle reach the
previous upper bound d3 log2(d3), in the case of a clus-
ter that covers the whole decoding graph. This makes it
the most expensive element of the design. However, the
probability that such a large cluster is reached remains
extremely small in a practical noise regime 4. By analyz-
ing the maximum size of a cluster after the growth step
using Monte Carlo simulations, we optimize the stack size
for a chosen code distance d and a given physical error
rate p. Figure 10 shows the cluster size distribution for
4 This is because a large cluster generally correspond to an uncor-
rectable error configuration and by definition, the probability of
such an error is at most plog.
9code distance d = 11 and physical error rate p = 10−3.
For these parameters, the probability of a cluster of size
larger than 80 is smaller than the logical error rate for
this code. One can thus ignore the clusters of size larger
than 80 without significantly affecting error correction
performance. This drops the stack memory requirement
by a factor 10x from 1.7 KBytes to 0.13 KBytes.
To reduce further the memory requirement, the edge
stacks can be sized to half the maximum size of a clus-
ter spanning tree (derived from simulation), optimizing
for common case. In the rare event of an overflow, the
alternate stack is used.
In general, the memory required for each of the
two DFS stacks is approximately 3S(d, p) log2(d) where
S(d, p) is the minimum integer s such that the proba-
bility to have a cluster with more than s edges on the
output of the Gr-Gen is at below pLog(d, p). We say that
a stack overflow failure occurs if the DFS engine encoun-
ters a cluster that does not fit in a stack, that is with
more than S(d, p) edges. By construction, the stack sizes
are optimize in such a way
pSof(d, p) ≤ pLog(d, p) (2)
where pSof denote the probability of a decoding failure
due to a stack overflow error.
E. Comparison with other decoders
For comparison, we provide a rough estimate of the
memory capacity required for the MWPM decoder. The
average number of faults in the decoding graph for a given
noise rate is p|E| where |E| is the number of edges of the
decoding graph. For low values of p (such as p = 10−3),
many of these configurations of p|E| faulty edges are non-
overlapping, and most of the edges are in the bulk of the
lattice, which results in about w = d2p|E|e non-zero syn-
drome bits. Given a set of w non-zero syndrome nodes,
the MWPM generates a complete graph with w + 1 ver-
tices (and w(w+1)/2 edges) and it performs a Minimum
Weight Matching algorithm in this graph. A state of the
art implementation of this algorithm due to Kolmogorov
[82] consumes about 161 bits per edge (4 pointers, 1 inte-
ger and 1 bit per edge) which brings the memory capacity
for the MWPM decoder to at least
161 · d2p|E|e · (d2p|E|e+ 1)/2 ≈ 2900 · p2d6 (3)
bits. Therein, we use |E| ≈ 3d3. In order to provide
a non-trivial lower bound when d is small, we use the
fact that the MWPM decoder must correct at any set of
(d − 1)/2 faults. Such a set may lead to w = (d − 1)
non-zero syndrome bits, resulting in a lower bound of
161(d− 1)d/2 ≈ 80d2 bits.
Taking the best of both lower bounds, we obtain the re-
sult of Figure 11 which shows that our UF decoder design
requires slightly higher memory than a MWPM decoder
for low code distances; and outperforms the MWPM for
larger distances, making it more scalable. Given that
we only consider average weight fault configurations and
not the worst case for the MWPM decoder, we believe
that our lower bound on the MWPM memory capacity
is very optimistic and the UF decoder actually surpasses
the MWPM decoder in term of memory capacity much
before distance 20 as observed in the figure.
3 7 11 15 19 23 27 31 35
Code distance (d)
102
103
104
105
106
M
em
or
y 
Ca
pa
cit
y
(in
 B
yt
es
)
MWPM Decoder Proposed UFD
Figure 11: Comparison of memory capacity required for the
MWPM decoder and for our implementation of the UF de-
coder.
Other decoders are much more memory intensive than
our design. The LUT decoder requires the storage of
more than 21000 correction bit-strings for d = 11 and ML
decoders cost several MBs to GBs of memory depending
on implementation and the code distance.
III. RESOURCE OPTIMIZATION
For a system with large number L of logical qubits,
the most straightforward implementation allocates two
decoders per logical qubits, one for each type of error,
X and Z. Thus, for the baseline design, the decoding
logic uses 2L decoders. However, the utilization of each
pipeline stage varies causing under-utilization of certain
stages. Ideally, we want to reduce the number of decoders
required for the overall system. In this section, we opti-
mize not only the total number of decoders but the exact
number of module of each type Gr-Gen, DFS engine and
Corr engine.
DFS 
Engine
Corr 
Engine
DFS 
Engine
DFS 
Engine
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
Gr-
Gen
N Logical qubits share a decoder block
αN DFS Engines 
α<1
DFS 
Engine
Gr-
Gen
Gr-
Gen
Gr-
Gen
Corr 
Engine
βN Correction Engines 
β<α<1
To Error Log To Error Log
... ... ... ...
... ...
Select 
Logic
To
 o
th
er
 
m
u
xe
s
Figure 12: Design of decoder block that contains N Gr-Gen
modules, αN DFS engines and βN Corr engines.
10
A. Sharing hardware modules
The core component of our optimized architecture is a
decoder block as shown in Figure 12 which uses a reduced
number of pipeline units to perform the decoding of N
logical qubits. Our Error Decoding Architecture (EDA)
shown in Figure 1 uses L/N decoder blocks to perform
error correction over the L logical qubits of the computer.
The motivation for our design is that the growth stage
implemented by the N Gr-Gen modules is significantly
more complex than the DFS stage. Therefore, we expect
the DFS engine to wait for a fraction of the time while
the Gr-Gen module terminates. Instead of waiting, we
prefer to use a smaller number of DFS engines (αL < L)
that share the work of L Gr-Gen modules. The value α
will be optimized to keep the waiting time of DFS engines
minimum, saving a fraction (1−α) of the DFS hardware.
We proceed in the same way to optimize the number βL
Corr engines.
The hardware overhead of this optimization is multi-
plexors and demultiplexors on the datapath as shown in
Figure 12. Memory requests generated by the Corr En-
gine are routed to the correct memory locations using a
demultiplexor. The select logic prioritizes the first ready
component and uses round robin arbitration to generate
appropriate select signals for the multiplexors. For exam-
ple, if four Gr-Gen units share a DFS Engine, and the sec-
ond Gr-Gen finishes cluster formation earlier than other
units, it receives access to the DFS Engine. The round
robin policy ensures fairness while sharing resources.
B. Decode block design constraint
As explained previously, in order to correct circuit
faults and measurement errors, the decoder needs d con-
secutive rounds of syndrome data, which form a logical
cycle. To prevent backlog problems, the decoder must
provide a correction before the end of the next logical
cycle when a new decoding request arrives. If a decoder
block fails to terminate its work within a logical cycle, er-
rors start accumulating and spreading over the quantum
computer. We refer to this type of failure as timeout fail-
ure. In order to ensure that timeout failures do not sig-
nificantly degrade the decoding performance, we impose
the following constraint for the decoder block design.
pTof(d, p)/N ≤ pLog(d, p) (4)
where pTof(d, p) is the timeout failure probability for the
decoder block. The timeout failure probability per logi-
cal qubit must be lower than the probability of a logical
error. This condition ensures that the fault rate on the
output of the whole quantum computation is at most
doubled due to time out failures. In what follows, we
propose a fast and hardware efficient decoder block de-
sign that respects the constraint (4).
C. Modules runtime simulation
We model the decoder performance by studying the
number of reads. The write operations performed are
read-modify-write, and the writeback is not on the criti-
cal path. We assume 4 cycles latency for memory accesses
and a 4 GHz clock frequency [86, 87].
Denote by C1, . . . Cm the set of clusters generated by
the Gr-Gen module. A single growth step for a cluster C
requires to read a set of rows of the STM that cover the
cluster. We estimate this number by diam(C)2, assum-
ing that the cluster spreads roughly uniformly in all the
directions. Summing of all clusters and growth steps, the
total number of memory requests generated in Gr-Gen is
approximated by the sum
τGG =
m∑
i=1
diam(Ci)∑
j=1
j2 (5)
because each growth steps increases the diameter by 1.
The number of memory requests in the DFS engine and
the Corr engine to treat a cluster Ci are both given by
the size of its spanning tree which is given by |V (Ci)|−1
where |V (Ci)| is the number of vertices of Ci. Including
all clusters, we obtain the estimate
τDFS = τCE =
m∑
i=1
|V (Ci)| (6)
for the total number of reads in the DFS engine or in the
Corr engine.
0 10 20 30 40 50 60 70 80
Gr-Gen Engine Execution Cost
0
10
20
30
40
50
60
DF
S 
En
gi
ne
 E
xe
cu
tio
n 
Co
st
Figure 13: Correlation between Gr-Gen and DFS Engine ex-
ecution time for code distance d = 11 and physical error rate
p = 10−3. Each dot corresponds to a random error configura-
tion and the runtimes of the two modules are estimated using
Eq. (5) and (6). Duplicate data points cannot be observed on
this plot.
In order to select the DFS engine ratio α, we esti-
mate the ratio between the execution time of the Gr-Gen
and the DFS engine by a Monte-Carlo simulation of τGG
and τDFS for the distance-11 surface code with an error
rate p = 10−3 as shown in Figure 13. To estimate these
runtimes, we sample families of clusters by generating
11
random errors according to the phenomenological noise
model described in Section IC, and by simulating the
growth step of the decoder for these errors. This pro-
vides us with samples of cluster families from the output
of the Gr-Gen module. Our Monte-Carlo simulation pro-
duces the result of Figure 13 which shows the correlation
between the execution times in the Gr-Gen and DFS en-
gine. As expected more time is spent in the Gr-Gen unit.
We observe roughly a factor two between the execution
times of the two units which suggest one can eliminate
half of the DFS units.
D. Optimized decoder block
The results of Section III C encourage us to consider
a decoder block with parameters α = 0.5 and β = 1.
However, nothing guarantees that this choice will lead
to a decoder block that is fast enough to satisfy condi-
tion (4). In this section, we design an optimized decoder
block for the surface code with distance 11 that satis-
fies (4) and that can be implemented in only 325ns in
the noise regime p = 10−3, under the memory frequency
and latency assumptions above. This demonstrates that
our decoder block is clearly fast enough to perform the
surface code decoding. The decoder is actually 30 times
faster that the logical cycle time of the distance-11 sur-
face code which is about 11 µs [88].
We consider the smallest decoder block with α = 0.5
and β = 1. It includes two logical qubits, that is
four error configurations to correct (two X-type and two
Z-type), two Gr-Gen units, one DFS engine and one
Corr engine. We refer to this optimized design as the
(4, 2, 1, 1)-decoder block. Figure 14 shows our estima-
tion of the execution time of the whole block to decode
the two logical qubits obtained by simulating the whole
pipepline of the (4, 2, 1, 1)-block with a Monte-Carlo sim-
ulation with importance sampling. We observe that by
interrupting the decoding after 325 ns, we obtain a block
that satisfy (4).
For L logical qubits, the number of Gr-Gen units, DFS
engines, and Corr engines required are L, L/2, and L/2
respectively in the optimized architecture. Thus, the to-
tal number of Gr-Gen units, DFS engines, and Corr en-
gines are reduced by 2×, 4×,and 4× respectively.
The total memory capacity required for the baseline
design and for our optimized decoder block are sum-
marized in Table III for 1000 qubits encoded with the
distance-11 surface code. The previous (4, 2, 1, 1)-block
leads to a saving of about 50% of the memory capac-
ity. In order to reduce further the memory requirement,
we can use a shared root table and a shared size table
between the two Gr-Gen modules of the decoder block.
This leads to a slight slow down of the decoder because
both Gr-Gen modules cannot simultaneously perform the
growth step, but the two STM can be used in parallel.
While the first STM is used by a DFS engine, the second
one can be used by a Gr-Gen module to grow clusters. A
0 50 100 150 200 250 300 350 400
Estimated Execution Time [ns]
10-14
10-13
10-12
10-11
10-10
10-9
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Pr
ob
ab
ilit
y
Figure 14: Distribution of execution time of the (4, 2, 1, 1)-
block with code distance 11 and error rate 10−3. The shaded
region has probability smaller than NpLog (for N = 2 log-
ical qubits), which means that by interrupting the decoder
block after 325 ns, we obtain a timeout failure that satisfies
condition (4).
(4, 2, 1, 1)-block with a single root table and a single size
table achieve 70% (3.5 X) of memory reduction compare
to the naive design.
Table III: Reduction in the total memory capacity required
to correct both X-type and Z-type errors for a 1000 logical
qubits with code distance 11 and error rate 10−3.
Design Component Baseline Optimized Design Savings
STM (Gr-Gen) 1.97 MB 0.99 MB (2X)
Root Table (Gr-Gen) 3.17 MB 0.79 MB (4X)
Size Table (Gr-Gen) 3.46 MB 0.87 MB (4X)
Stacks (DFS Engine) 1.35 MB 0.34 MB (4X)
Total 9.96 2.81 (3.5X)
IV. SYNDROME DATA COMPRESSION
A major challenge in designing any error decoding ar-
chitecture is the large bandwidth required between the
quantum substrate and the decoding logic. In order to
perform error decoding, the syndrome measurement data
must be transported from the quantum substrate to the
decoding units. For a given qubit plane with L logical
qubits and each qubit encoded using a surface code of
distance d, about 2d2L bits must be sent at the end of
each syndrome measurement cycle, which requires signif-
icant bandwidth ranging on the order of several Gb/s for
a reasonable number of logical qubits and code distance.
Data transmission at a lower bandwidth on the other
hand reduces the effective time left for error decoding
since a decoder must provide an estimation of the error
within d syndrome measurement cycles. In this section,
we consider three compression techniques to reduce the
bandwidth needs and we analyze their performance in dif-
ferent noise regimes. We focus on compression schemes
12
0 0 0 0 1 0Syndrome Data
Zero Indicator Bit (ZIB) 1 0
Data to be sent 1 0 0 1 0
Compression Width (W)
ZIB Non-Zero
Data
0 0 0 0 1 0Syndrome Data
Data to be sent 1 0 0 1
5 4 3 2 1 0
Index
Bit indicator to 
imply at least 
one syndrome 
bit is non-zero
Index of the non-
zero syndrome bit
(a) (b)
Figure 15: (a) Dynamic Zero Compression (DZC). The com-
pressed data consists in the zero bit indicators, that provides
the locations of blocks ’000’, followed by the content of non-
zero blocks. (b) Sparse Representation. We can store sparse
efficiently data by providing the number of non-zero bits and
their index.
that uses simple encoding and do not require large hard-
ware complexity.
A. Can data compression work?
An approach to reduce the memory bandwidth require-
ments in conventional memory systems is data compres-
sion [89–94]. Data compression works well when data is
sparse and has low entropy. To justify the potential of
syndrome data compression, we estimate the sparsity of
the syndrome data analytically for realistic noise regimes.
For a noise strength p in the phenomenological noise
model introduced in Section IC, the expected number of
faults in the 3d decoding graph which contains 7d3 fault
locations is about 3d3p. In the case of a distance-11 sur-
face code, we expect only 4 errors on average, resulting
in a non-trivial syndrome vector of length ≈ 1, 000 with
Hamming weight ≤ 8, since each error is detected by at
most two non-trivial syndrome bits. The expected Ham-
ming weight of the syndrome vectors drops even further
for lower noise strength p.
B. Three low-overhead compression techniques
We consider the three following compression meth-
ods that allow for a simple hardware implementation.
Figure 15 provides the basic idea of the Sparse Rep-
resentation and the Dynamic Zero Compression. The
Geometry-based compression is a variant of the Dynamic
Zero Compression that exploits the geometry of the lat-
tice of qubits.
1. Sparse Representation: This is similar to the
traditional technique of storing only the indices of
non-zero elements of sparse matrices. Instead of
sending a sparse syndrome vectors, we send a bit
that indicates if all the syndrome bits are zero, fol-
lowed by the indices of non-trivial bits in the case
of non-zero syndrome. With this method, a syn-
drome vector with length ` and Hamming weight
w is compressed into 1 + w log2(`) bits
2. Dynamic Zero Compression (DZC): We adopt
a DZC technique [90] as shown in Figure 15(a) to
compress syndrome data. A syndrome vector of
length ` is grouped into b blocks ofm bits each. We
store the indicator vector of non-trivial blocks con-
catenated with the exact value of non-trivial blocks.
A syndrome with b block has length bm. However,
if it contains only w non-trivial blocks, it can be
transmitted with the DZC technique using only b
bits for the non-trivial block indicator vector and
wm bits to send the non-trivial blocks, that is a
total of b+ wm bits.
3. Geometry-based compression (Geo-Comp):
This is an adaptation of DZC that also accounts
for the geometry of the surface code lattice. The
basic idea is that non-trivial syndrome values gen-
erally appear by pairs of neighbor bits. We can
therefore increase the compression rate of the DZC
technique by using square blocks that respect the
structure of the lattice. With this block decomposi-
tion, two neighbors bits are more likely to fall in the
same block, reducing the number w of non-trivial
blocks to send.
In this work, we treat X-type and Z-type errors sep-
arately. However, for any of the three compression tech-
niques described above, a slightly better compression rate
can be obtained by compressing together the syndrome
data corresponding to both types of errors.
In general, the number and size of the ZDC blocks can
be adjusted for a given noise model by computing the ex-
pected number of non-trivial syndrome blocks. Regions
of larger size improves the compression ratio but also
leads to complex hardware by adding to the logic depth.
With this in mind, we analyze small block sizes even for
very low error rates.
3 5 7 9 11 13 15 17 19 21
Code Distance
10
100
M
ea
n 
co
m
pr
es
sio
n 
ra
tio
 
fo
r d
iff
er
en
t e
rro
r r
at
es
p=10 5
p=10 4
p=10 3
p=10 2
Figure 16: Mean syndrome compression ratio as a function
of the code distance d for different physical error rates. We
plot the best compression ratio between the three techniques
considered in the present work: Sparse Representation, ZDC
and Geometry-based compression.
13
We determine the effectiveness of a compression
scheme by analyzing the compression ratio:
Compression Ratio =
Actual Syndrome Length
Compressed Syndrome Length
(7)
The best compression scheme depends on the noise
model. Sparse compression appears to be a good choice in
the regime of very low error rates because the syndrome
vector is often trivial and it is then sent as a single bit.
In a noisy regime and for small distance codes, we prefer
the other compression schemes like DZC and Geometry-
based compression. For any noise rate below 10−3, we
achieve a compression rate which varies between 20× for
d = 3 and 400× for distance-21 surface codes.
V. DISCUSSION
A. Scalability
In this paper, we use the Union-Find decoder to anal-
yse the scope of architectural optimizations in design-
ing high performance and scalable decoders. However,
the design principles apply to other decoders in general
such as machine learning or graph algorithm-based de-
coders. For example, machine learning decoders require
several network layers. Overall resources can be reduced
by pipelining layers in inference and by sharing these
layers between logical qubits. Similarly, while our study
only focuses on regular surface code, the same analysis
holds true for other types of QEC codes based on Eu-
clidean lattices [95, 96] or color codes [97]. However, it
seems non-trivial to adapt the STM used in our design to
the non-trivial lattice topology of hyperbolic codes [98–
100] and thus a different micro-architecture is needed.
Lastly, the syndrome compression is valid for an arbi-
trary decoder-quantum substrate interface, independent
of their types, although it can be improved by exploit-
ing the code structure as we propose with the geometry
based compression.
B. Assumptions
We assume all syndrome extraction circuits can be
executed in parallel, an assumption used by most de-
coders [101]. However, the amount of parallelism depends
on the qubit technology and a large amount of paral-
lelism is achievable on modern superconducting qubits,
although other types of qubits may offer less parallelism.
C. Noise Model
We use the phenomenological noise model for our study
and there is scope to further optimize the design for en-
hanced noise models and account for correlation in er-
rors. We consider a physical error rate of 10−3 because
QEC codes cannot lower the logical error rate substan-
tially unless the initial physical error rate is lower than
the threshold which is about 1%. For the system sizes
we have considered in this paper with about 100− 1000
qubits, error rates of about 10−3 are required to run prac-
tical applications of scientific and commercial value.
VI. RELATED WORK
Designing a quantum computer requires full-stack
solutions [102] and interdisciplinary research [103].
This has led to developments in programming lan-
guages QPL [104–129] compilers [116, 130–136], micro-
architecture [137–140], control circuits [141–146], and
quantum devices. Although existing Noisy Intermedi-
ate Scale Quantum (NISQ) computers [147–151] are ex-
pected to scale up to hundreds of qubits and may out-
perform classical computers for some problems [152–154],
they will still be too small to achieve fault-tolerance. On
the contrary, the scope of a quantum computer greatly
broadens in the presence of fault-tolerance and therefore,
designing FTQCs is an important area of research.
QEC plays a seminal role in FTQCs. In addition to
the standard [70, 71], we suggest the following recent
reviews to learn more about QEC codes and fault toler-
ance [155–157]. Recent experiment results suggest that
quantum error correction is reaching an inflection point
where the quality of encoded qubits is better than the
quality of raw qubits [158, 159]. Few articles consider the
hardware aspects of decoder designs are necessary such
as discussing the potential of GPUs and ASICs [20, 56],
and high-speed circuits [19] or describing the architec-
ture of the neural networks in the case of ML-Decoders.
These studies focus primarily on achieving higher perfor-
mance and accuracy for a single logical qubit. In order to
support the design of large scale error-corrected quantum
devices, more studies on the hardware aspects of decoder
designs and their scalability to many logical blocks are
necessary.
In a system level study [138], Tannu et. al. identified
the requirement of large bandwidth in sending control in-
structions from the control processor to qubits and pro-
posed sharing of micro-code between neighboring qubits.
Using micro-code to deliver control pulses [139] only fo-
cuses on communication from the control processor to
the qubits whereas, availability of large bandwidth is
also essential to transmit syndrome data back from the
qubits to the decoders. Since syndromes differ across
logical qubits, depending on the error, it is not possible
to send one syndrome for multiple qubits (sharing). So,
we explore the possibility of using syndrome compression
through low-overhead compression schemes.
14
VII. CONCLUSION
The decoder is a key component of a fault tolerant
quantum computer, which is in charge of translating the
output of the syndrome measurement circuits into er-
ror types and locations. Many decoding algorithms have
been studied in the past 20 years [14, 17–68]. They gen-
erally focus on improving either the decoder accuracy or
the decoder runtime, or providing a good compromise
between them. In this work, we introduce a third con-
straint, the scalability constraint, which states that the
decoder must scale to the regime of practical applications
for which thousand of logical qubits must be decoded si-
multaneously and we design a decoder that satisfy the
three design constraints: accuracy, latency and scalabil-
ity. Namely, the decoder properly identifies the error
which occurs (accuracy), it is fast enough to avoid accu-
mulation of errors during the computation (latency) and
we propose an resource-efficient hardware implementa-
tion that scales to the massive size required for industrial
applications (scalability).
In order to achieve the scalability constraint, we study
the scope and impact of micro-architectural optimiza-
tions in designing decoders for QEC and study in-depth
the Union-Find decoder as a case-study. We also in-
vestigate a system level framework for Error Decoding
Architecture whereby instead of using dedicated decod-
ing units per logical unit, we multiplex the decoding re-
sources across neighbouring logical qubits while minimiz-
ing the timeout errors due to lack of decoding resources
and limiting the possibility of system failure. Finally,
we investigate the feasibility of low-cost syndrome com-
pression to reduce the memory bandwidth required for
transmitting the syndrome information from the quan-
tum substrate to the decoding hardware. Our solutions
reduce the number of decoders by more than 50%, the
amount of memory required by 70%, and memory band-
width by more than 30x for large FTQCs. Although we
use the Union-Find decoder and surface codes for our
study, the design principles, optimizations, and results
from our study applies to other types of decoders and cer-
tain other QEC codes as well. The compression schemes
discussed applies to any qubit technology and decoder.
In addition to substantial hardware savings, our op-
timized decoder micro-architecture significantly speeds
up the Union-Find decoder. Our numerical simulations
suggest that our design provides a decoder that is fast
enough to perform error correction with superconducting
qubits assuming a surface code syndrome round of 1µs
[88]. Further study is necessary to confirm the validity
of our model in a real device. Ultimately, we would like
to consider a FPGA implementation or the fabrication of
an ASIC based on our micro-architecture.
[1] Peter W Shor. Polynomial-time algorithms for prime
factorization and discrete logarithms on a quantum
computer. SIAM review, 41(2):303–332, 1999.
[2] Richard P Feynman. Simulating physics with com-
puters. International journal of theoretical physics,
21(6):467–488, 1982.
[3] Markus Reiher, Nathan Wiebe, Krysta M Svore, Dave
Wecker, and Matthias Troyer. Elucidating reaction
mechanisms on quantum computers. Proceedings of
the National Academy of Sciences, 114(29):7555–7560,
2017.
[4] Katherine L Brown, William J Munro, and Vivien M
Kendon. Using quantum computers for quantum simu-
lation. Entropy, 12(11):2268–2307, 2010.
[5] Lov K Grover. A fast quantum mechanical algorithm
for database search. arXiv preprint quant-ph/9605043,
1996.
[6] Peter W Shor. Scheme for reducing decoherence
in quantum computer memory. Physical review A,
52(4):R2493, 1995.
[7] A Robert Calderbank and Peter W Shor. Good quan-
tum error-correcting codes exist. Physical Review A,
54(2):1098, 1996.
[8] Andrew Steane. Multiple-particle interference and
quantum error correction. Proceedings of the Royal So-
ciety of London. Series A: Mathematical, Physical and
Engineering Sciences, 452(1954):2551–2577, 1996.
[9] Daniel Gottesman. Stabilizer codes and quantum error
correction. arXiv preprint quant-ph/9705052, 1997.
[10] Florence Jessie MacWilliams and Neil James Alexander
Sloane. The theory of error-correcting codes, volume 16.
Elsevier, 1977.
[11] Peter W Shor. Fault-tolerant quantum computation.
In Proceedings of 37th Conference on Foundations of
Computer Science, pages 56–65. IEEE, 1996.
[12] Dorit Aharonov and Michael Ben-Or. Fault-tolerant
quantum computation with constant error rate. arXiv
preprint quant-ph/9906129, 1999.
[13] Panos Aliferis, Daniel Gottesman, and John Preskill.
Quantum accuracy threshold for concatenated distance-
3 codes. arXiv preprint quant-ph/0504218, 2005.
[14] Eric Dennis, Alexei Kitaev, Andrew Landahl, and John
Preskill. Topological quantum memory. Journal of
Mathematical Physics, 43(9):4452–4505, 2002.
[15] Austin G Fowler, Matteo Mariantoni, John M Martinis,
and Andrew N Cleland. Surface codes: Towards practi-
cal large-scale quantum computation. Physical Review
A, 86(3):032324, 2012.
[16] Krysta M Svore, David P Divincenzo, and Barbara M
Terhal. Noise threshold for a fault-tolerant two-
dimensional lattice architecture. Quantum Information
& Computation, 7(4):297–318, 2007.
[17] Austin G Fowler, Adam CWhiteside, Angus L McInnes,
and Alimohammad Rabbani. Topological code auto-
tune. Physical Review X, 2(4):041003, 2012.
[18] Austin G Fowler, Adam C Whiteside, and Lloyd CL
Hollenberg. Towards practical classical process-
ing for the surface code. Physical review letters,
108(18):180501, 2012.
[19] Austin G Fowler, Adam C Whiteside, and Lloyd CL
15
Hollenberg. Towards practical classical processing for
the surface code: Timing analysis. Physical Review A,
86(4):042313, 2012.
[20] Austin G Fowler. Minimum weight perfect matching of
fault-tolerant topological quantum error correction in
average o(1) parallel time. Quantum Information and
Computation, 15(1&2):0145–0158, 2015.
[21] Guillaume Duclos-Cianci and David Poulin. Kitaev’s
z d-code threshold estimates. Physical Review A,
87(6):062338, 2013.
[22] Hussain Anwar, Benjamin J Brown, Earl T Campbell,
and Dan E Browne. Fast decoders for qudit topological
codes. New Journal of Physics, 16(6):063038, 2014.
[23] Fern HE Watson, Hussain Anwar, and Dan E Browne.
Fast fault-tolerant decoder for qubit and qudit surface
codes. Physical Review A, 92(3):032309, 2015.
[24] Adrian Hutter, Daniel Loss, and James R Wootton.
Improved hdrg decoders for qudit and non-abelian
quantum error correction. New Journal of Physics,
17(3):035017, 2015.
[25] Guillaume Duclos-Cianci and David Poulin. Fast de-
coders for topological quantum codes. Physical review
letters, 104(5):050504, 2010.
[26] Sergey Bravyi and Jeongwan Haah. Quantum self-
correction in the 3d cubic code model. Physical review
letters, 111(20):200501, 2013.
[27] Guillaume Duclos-Cianci and David Poulin. Fault-
tolerant renormalization group decoder for abelian topo-
logical codes. arXiv preprint arXiv:1304.6100, 2013.
[28] Sean D Barrett and Thomas M Stace. Fault tolerant
quantum computation with very high threshold for loss
errors. Physical review letters, 105(20):200502, 2010.
[29] Thomas M Stace and Sean D Barrett. Error correction
and degeneracy in surface codes suffering loss. Physical
Review A, 81(2):022317, 2010.
[30] Nicolas Delfosse and Gilles Zémor. Linear-time maxi-
mum likelihood decoding of surface codes over the quan-
tum erasure channel. arXiv preprint arXiv:1703.01517,
2017.
[31] Austin G Fowler. Optimal complexity correction of
correlated errors in the surface code. arXiv preprint
arXiv:1310.0863, 2013.
[32] Nicolas Delfosse and Jean-Pierre Tillich. A decoding al-
gorithm for css codes using the x/z correlations. In 2014
IEEE International Symposium on Information Theory,
pages 1071–1075. IEEE, 2014.
[33] Ben Criger and Imran Ashraf. Multi-path summation
for decoding 2d topological codes. Quantum, 2:102,
2018.
[34] Yu Tomita and Krysta M Svore. Low-distance surface
codes under realistic quantum noise. Physical Review
A, 90(6):062320, 2014.
[35] Bettina Heim, Krysta M Svore, and Matthew B Hast-
ings. Optimal circuit-level decoding for surface codes.
arXiv preprint arXiv:1609.06373, 2016.
[36] Eric Dennis. Purifying quantum states: Quantum and
classical algorithms. arXiv preprint quant-ph/0503169,
2005.
[37] James William Harrington. Analysis of quantum error-
correcting codes: symplectic lattice codes and toric
codes. PhD thesis, California Institute of Technology,
2004.
[38] James RWootton and Daniel Loss. High threshold error
correction for the surface code. Physical review letters,
109(16):160503, 2012.
[39] Adrian Hutter, James RWootton, and Daniel Loss. Effi-
cient markov chain monte carlo algorithm for the surface
code. Physical Review A, 89(2):022326, 2014.
[40] James Wootton. A simple decoder for topological codes.
Entropy, 17(4):1946–1957, 2015.
[41] Michael Herold, Michael J Kastoryano, Earl T Camp-
bell, and Jens Eisert. Fault tolerant dynamical decoders
for topological quantum memories. arXiv preprint
arXiv:1511.05579, 2015.
[42] Nikolas P Breuckmann, Kasper Duivenvoorden, Do-
minik Michels, and Barbara M Terhal. Local de-
coders for the 2d and 4d toric code. arXiv preprint
arXiv:1609.00510, 2016.
[43] Nicolas Delfosse and Naomi H Nickerson. Almost-linear
time decoding algorithm for topological codes. arXiv
preprint arXiv:1709.06218, 2017.
[44] Aleksander Kubica and John Preskill. Cellular-
automaton decoders with provable thresholds for topo-
logical codes. Physical review letters, 123(2):020501,
2019.
[45] Naomi H Nickerson and Benjamin J Brown. Analysing
correlated noise on the surface code using adaptive de-
coding algorithms. Quantum, 3:131, 2019.
[46] Andrew J Ferris and David Poulin. Tensor networks
and quantum error correction. Physical review letters,
113(3):030501, 2014.
[47] Sergey Bravyi, Martin Suchara, and Alexander Vargo.
Efficient algorithms for maximum likelihood decoding
in the surface code. Physical Review A, 90(3):032326,
2014.
[48] Andrew S Darmawan and David Poulin. Linear-time
general decoding algorithm for the surface code. Phys-
ical Review E, 97(5):051302, 2018.
[49] David K Tuckett, Christopher T Chubb, Sergey Bravyi,
Stephen D Bartlett, and Steven T Flammia. Tailoring
surface codes for highly biased noise. arXiv preprint
arXiv:1812.08186, 2018.
[50] Giacomo Torlai and Roger G Melko. Neural de-
coder for topological codes. Physical review letters,
119(3):030501, 2017.
[51] Paul Baireuther, Thomas E O’Brien, Brian Tarasinski,
and Carlo WJ Beenakker. Machine-learning-assisted
correction of correlated qubit errors in a topological
code. Quantum, 2:48, 2018.
[52] Stefan Krastanov and Liang Jiang. Deep neural network
probabilistic decoder for stabilizer codes. Scientific re-
ports, 7(1):11003, 2017.
[53] Savvas Varsamopoulos, Ben Criger, and Koen Ber-
tels. Decoding small surface codes with feedforward
neural networks. Quantum Science and Technology,
3(1):015004, 2017.
[54] Christopher Chamberland and Pooya Ronagh. Deep
neural decoders for near term fault-tolerant experi-
ments. Quantum Science and Technology, 3(4):044002,
2018.
[55] Nishad Maskara, Aleksander Kubica, and Tomas
Jochym-O’Connor. Advantages of versatile neural-
network decoding for topological codes. Physical Review
A, 99(5):052351, 2019.
[56] Nikolas P Breuckmann and Xiaotong Ni. Scalable neu-
ral network decoders for higher dimensional quantum
codes. Quantum, 2:68–92, 2018.
[57] Ryan Sweke, Markus S Kesselring, Evert PL van
16
Nieuwenburg, and Jens Eisert. Reinforcement learning
decoders for fault-tolerant quantum computation. arXiv
preprint arXiv:1810.07207, 2018.
[58] Paul Baireuther, MD Caio, B Criger, Carlo WJ
Beenakker, and Thomas E O’Brien. Neural network de-
coder for topological color codes with circuit level noise.
New Journal of Physics, 21(1):013003, 2019.
[59] Xiaotong Ni. Neural network decoders for large-distance
2d toric codes. arXiv preprint arXiv:1809.06640, 2018.
[60] Philip Andreasson, Joel Johansson, Simon Liljestrand,
and Mats Granath. Quantum error correction for the
toric code using deep reinforcement learning. Quantum,
3:183, 2019.
[61] Amarsanaa Davaasuren, Yasunari Suzuki, Keisuke Fujii,
and Masato Koashi. General framework for construct-
ing fast and near-optimal machine-learning-based de-
coder of the topological stabilizer codes. arXiv preprint
arXiv:1801.04377, 2018.
[62] Ye-Hua Liu and David Poulin. Neural belief-
propagation decoders for quantum error-correcting
codes. Physical review letters, 122(20):200501, 2019.
[63] Savvas Varsamopoulos, Koen Bertels, and Carmen G
Almudever. Designing neural network based decoders
for surface codes. arXiv preprint arXiv:1811.12456,
2018.
[64] Savvas Varsamopoulos, Koen Bertels, and Carmen G
Almudever. Decoding surface code with a dis-
tributed neural network based decoder. arXiv preprint
arXiv:1901.10847, 2019.
[65] Thomas Wagner, Hermann Kampermann, and Dagmar
Bruß. Symmetries for a high level neural decoder on the
toric code. arXiv preprint arXiv:1910.01662, 2019.
[66] Chaitanya Chinni, Abhishek Kulkarni, and Dheeraj M
Pai. Neural decoder for topological codes using
pseudo-inverse of parity check matrix. arXiv preprint
arXiv:1901.07535, 2019.
[67] Milap Sheth, Sara Zafar Jafarzadeh, and Vlad Ghe-
orghiu. Neural ensemble decoding for topologi-
cal quantum error-correcting codes. arXiv preprint
arXiv:1905.02345, 2019.
[68] Laia Domingo Colomer, Michalis Skotiniotis, and Ra-
mon Muñoz-Tapia. Reinforcement learning for opti-
mal error correction of toric codes. arXiv preprint
arXiv:1911.02308, 2019.
[69] Austin Fowler. Towards sufficiently fast quantum error
correction. Conference QEC 2017, 2017.
[70] John Preskill. Lecture notes for physics 229: Quantum
information and computation. California Institute of
Technology, 16, 1998.
[71] Michael A Nielsen and Isaac Chuang. Quantum com-
putation and quantum information, 2002.
[72] Austin G Fowler. Optimal complexity correction of
correlated errors in the surface code. arXiv preprint
arXiv:1310.0863, 2013.
[73] Nicolas Delfosse and Jean-Pierre Tillich. A decoding al-
gorithm for css codes using the x/z correlations. In 2014
IEEE International Symposium on Information Theory,
pages 1071–1075. IEEE, 2014.
[74] David K Tuckett, Stephen D Bartlett, and Steven T
Flammia. Ultrahigh error threshold for surface
codes with biased noise. Physical review letters,
120(5):050505, 2018.
[75] Eric Dennis, Alexei Kitaev, Andrew Landahl, and John
Preskill. Topological quantum memory. Journal of
Mathematical Physics, 43(9):4452–4505, 2002.
[76] Robert Raussendorf and Jim Harrington. Fault-tolerant
quantum computation with high threshold in two di-
mensions. Physical review letters, 98(19):190504, 2007.
[77] Robert Raussendorf, Jim Harrington, and Kovid Goyal.
Topological fault-tolerance in cluster state quantum
computation. New Journal of Physics, 9(6):199, 2007.
[78] Austin G Fowler, Ashley M Stephens, and Peter
Groszkowski. High-threshold universal quantum com-
putation on the surface code. Physical Review A,
80(5):052312, 2009.
[79] Nicolas Delfosse and Naomi H Nickerson. Almost-linear
time decoding algorithm for topological codes. arXiv
preprint arXiv:1709.06218, 2017.
[80] Min-Hsiu Hsieh and François Le Gall. Np-hardness of
decoding quantum error-correction codes. Physical Re-
view A, 83(5):052331, 2011.
[81] Yu Tomita and Krysta M Svore. Low-distance surface
codes under realistic quantum noise. Physical Review
A, 90(6):062320, 2014.
[82] Vladimir Kolmogorov. Blossom v: a new implemen-
tation of a minimum cost perfect matching algorithm.
Mathematical Programming Computation, 1(1):43–67,
2009.
[83] Norman P Jouppi, Cliff Young, Nishant Patil, David
Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah
Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
In-datacenter performance analysis of a tensor process-
ing unit. In 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), pages 1–
12. IEEE, 2017.
[84] Nicolas Delfosse and Gilles Zémor. Linear-time maxi-
mum likelihood decoding of surface codes over the quan-
tum erasure channel. arXiv preprint arXiv:1703.01517,
2017.
[85] Robert Endre Tarjan. Efficiency of a good but not lin-
ear set union algorithm. Journal of the ACM (JACM),
22(2):215–225, 1975.
[86] Naveen Muralimanohar, Rajeev Balasubramonian, and
Norman P Jouppi. Cacti 6.0: A tool to model large
caches. HP laboratories, 27:28, 2009.
[87] Gyu-hyeon Lee, Dongmoon Min, Ilkwon Byun, and
Jangwoo Kim. Cryogenic computer architecture mod-
eling with memory-side case studies. In Proceedings of
the 46th International Symposium on Computer Archi-
tecture, pages 774–787. ACM, 2019.
[88] Craig Gidney and Martin Ekerå. How to factor 2048
bit rsa integers in 8 hours using 20 million noisy qubits.
arXiv preprint arXiv:1905.09749, 2019.
[89] Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim,
Hongyi Xin, Onur Mutlu, Phillip B Gibbons, Michael A
Kozuch, and Todd C Mowry. Linearly compressed
pages: a low-complexity, low-latency main memory
compression framework. In Proceedings of the 46th An-
nual IEEE/ACM International Symposium on Microar-
chitecture, pages 172–184. ACM, 2013.
[90] Luis Villa, Michael Zhang, and Krste Asanovic. Dy-
namic zero compression for cache energy reduction.
In Proceedings 33rd Annual IEEE/ACM International
Symposium on Microarchitecture. MICRO-33 2000,
pages 214–220. IEEE, 2000.
[91] Jishen Zhao, Sheng Li, Jichuan Chang, John L Byrne,
Laura L Ramirez, Kevin Lim, Yuan Xie, and Paolo
Faraboschi. Buri: Scaling big-memory computing with
17
hardware-based memory expansion. ACM Transac-
tions on Architecture and Code Optimization (TACO),
12(3):31, 2015.
[92] Jungrae Kim, Michael Sullivan, Esha Choukse, and
Mattan Erez. Bit-plane compression: Transforming
data for better compression in many-core architectures.
In 2016 ACM/IEEE 43rd Annual International Sympo-
sium on Computer Architecture (ISCA), pages 329–340.
IEEE, 2016.
[93] Alaa R Alameldeen and David A Wood. Adap-
tive cache compression for high-performance proces-
sors. ACM SIGARCH Computer Architecture News,
32(2):212, 2004.
[94] Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu,
Phillip B Gibbons, Michael A Kozuch, and Todd C
Mowry. Base-delta-immediate compression: practical
data compression for on-chip caches. In Proceedings of
the 21st international conference on Parallel architec-
tures and compilation techniques, pages 377–388. ACM,
2012.
[95] Keisuke Fujii and Yuuki Tokunaga. Error and loss tol-
erances of surface codes with general lattice structures.
Physical Review A, 86(2):020303, 2012.
[96] Nicolas Delfosse, Pavithran Iyer, and David Poulin.
Generalized surface codes and packing of logical qubits.
arXiv preprint arXiv:1606.07116, 2016.
[97] Hector Bombin and Miguel Angel Martin-Delgado.
Topological quantum distillation. Physical review let-
ters, 97(18):180501, 2006.
[98] Gilles Zémor. On cayley graphs, surface codes, and the
limits of homological coding for quantum error correc-
tion. In International Conference on Coding and Cryp-
tology, pages 259–273. Springer, 2009.
[99] Nicolas Delfosse. Tradeoffs for reliable quantum infor-
mation storage in surface codes and color codes. In 2013
IEEE International Symposium on Information Theory,
pages 917–921. IEEE, 2013.
[100] Nikolas P Breuckmann and Barbara M Terhal. Con-
structions and noise threshold of hyperbolic surface
codes. IEEE transactions on Information Theory,
62(6):3731–3744, 2016.
[101] Richard Versluis, Stefano Poletto, Nader Khammassi,
Brian Tarasinski, Nadia Haider, David J Michalak,
Alessandro Bruno, Koen Bertels, and Leonardo Di-
Carlo. Scalable quantum circuit and control for a su-
perconducting surface code. Physical Review Applied,
8(3):034021, 2017.
[102] Frederic T Chong, Diana Franklin, and Margaret
Martonosi. Programming languages and compiler design
for realistic quantum hardware. Nature, 549(7671):180–
187, 2017.
[103] Margaret Martonosi and Martin Roetteler. Next steps
in quantum computing: Computer science’s role. arXiv
preprint arXiv:1903.10541, 2019.
[104] Dave Wecker and Krysta M Svore. Liqui|>: A software
design architecture and domain-specific language for
quantum computing. arXiv preprint arXiv:1402.4467,
2014.
[105] Krysta M Svore, Alan Geller, Matthias Troyer, John
Azariah, Christopher Granade, Bettina Heim, Vadym
Kliuchnikov, Mariia Mykhailova, Andres Paz, and Mar-
tin Roetteler. Q#: Enabling scalable quantum comput-
ing and development with a high-level domain-specific
language. arXiv preprint arXiv:1803.00652, 2018.
[106] Microsoft. Q#, Accessed: November 19, 2019.
https://docs.microsoft.com/en-us/quantum/?view=
qsharp-preview.
[107] Andrew W Cross, Lev S Bishop, John A Smolin, and
Jay M Gambetta. Open quantum assembly language.
arXiv preprint arXiv:1707.03429, 2017.
[108] IBM. Qiskit, Accessed: November 19, 2019. https:
//qiskit.org/.
[109] Robert S Smith, Michael J Curtis, and William J Zeng.
A practical quantum instruction set architecture. arXiv
preprint arXiv:1608.03355, 2016.
[110] Rigetti Computing. pyquil, Accessed: November 19,
2019. http://docs.rigetti.com/en/stable/.
[111] Google. Cirq, Accessed: November 19, 2019. https:
//github.com/quantumlib/Cirq.
[112] TU Delft. Qasm, Accessed: November 19, 2019. https:
//www.quantum-inspire.com/kbase/qasm/.
[113] N Khammassi, GG Guerreschi, I Ashraf,
JW Hogaboam, CG Almudever, and K Bertels.
cqasm v1. 0: Towards a common quantum assembly
language. arXiv preprint arXiv:1805.09607, 2018.
[114] Alexander S Green, Peter LeFanu Lumsdaine, Neil J
Ross, Peter Selinger, and Benoît Valiron. Quipper: a
scalable quantum programming language. In ACM SIG-
PLAN Notices, volume 48, pages 333–342. ACM, 2013.
[115] Quipper, Accessed: November 19, 2019. https://www.
mathstat.dal.ca/~selinger/quipper/.
[116] Ali J Abhari, Arvin Faruque, Mohammad J Dousti,
Lukas Svec, Oana Catu, Amlan Chakrabati, Chen-Fu
Chiang, Seth Vanderwilt, John Black, and Fred Chong.
Scaffold: Quantum programming language. Technical
report, Princeton Univ. NJ Dept. of Computer Science,
2012.
[117] Scaffold, Accessed: November 19, 2019. https://
scaffcc.llvm.org.cn/.
[118] Jennifer Paykin, Robert Rand, and Steve Zdancewic.
Qwire: a core language for quantum circuits. In ACM
SIGPLAN Notices, volume 52, pages 846–858. ACM,
2017.
[119] Qwire, Accessed: November 19, 2019. https://github.
com/inQWIRE/QWIRE.
[120] Alexander J McCaskey, Dmitry I Lyakh, Eugene F
Dumitrescu, Sarah S Powers, and Travis S Humble.
Xacc: A system-level software infrastructure for hetero-
geneous quantum-classical computing. arXiv preprint
arXiv:1911.02452, 2019.
[121] Qcor, Accessed: November 19, 2019. https://github.
com/ORNL-QCI/qcor.
[122] Ville Bergholm, Josh Izaac, Maria Schuld, Christian
Gogolin, Carsten Blank, Keri McKiernan, and Nathan
Killoran. Pennylane: Automatic differentiation of hy-
brid quantum-classical computations. arXiv preprint
arXiv:1811.04968, 2018.
[123] Xanadu. Pennylane, Accessed: November 19, 2019.
https://pennylane.readthedocs.io/en/latest/.
[124] Nathan Killoran, Josh Izaac, Nicolás Quesada, Ville
Bergholm, Matthew Amy, and Christian Weedbrook.
Strawberry fields: A software platform for photonic
quantum computing. Quantum, 3:129, 2019.
[125] Xanadu. Strawberry fields, Accessed: November 19,
2019. https://strawberryfields.readthedocs.io/
en/latest/.
[126] Axel Dahlberg and Stephanie Wehner. Simulaqron—a
simulator for developing quantum internet software.
18
Quantum Science and Technology, 4(1):015001, 2018.
[127] Simulacron, Accessed: November 19, 2019. http://
www.simulaqron.org/.
[128] Damian S Steiger, Thomas Häner, and Matthias Troyer.
Projectq: an open source software framework for quan-
tum computing. Quantum, 2(49):10–22331, 2018.
[129] Projectq, Accessed: November 19, 2019. https://
projectq.ch/.
[130] Andrew Cross. The ibm q experience and qiskit open-
source quantum computing software. In APS Meeting
Abstracts, 2018.
[131] Prakash Murali, Jonathan M Baker, Ali Javadi Abhari,
Frederic T Chong, and Margaret Martonosi. Noise-
adaptive compiler mappings for noisy intermediate-scale
quantum computers. arXiv preprint arXiv:1901.11054,
2019.
[132] Swamit S Tannu and Moinuddin K Qureshi. A case for
variability-aware policies for nisq-era quantum comput-
ers. arXiv preprint arXiv:1805.10224, 2018.
[133] Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit
mapping problem for nisq-era quantum devices. arXiv
preprint arXiv:1809.02573, 2018.
[134] Alwin Zulehner, Alexandru Paler, and Robert Wille.
Efficient mapping of quantum circuits to the ibm qx
architectures. In 2018 Design, Automation & Test in
Europe Conference & Exhibition (DATE), pages 1135–
1138. IEEE, 2018.
[135] Pranav Gokhale, Yongshan Ding, Thomas Propson,
Christopher Winkler, Nelson Leung, Yunong Shi,
David I Schuster, Henry Hoffmann, and Frederic T
Chong. Partial compilation of variational algorithms
for noisy intermediate-scale quantum machines. In Pro-
ceedings of the 52nd Annual IEEE/ACM International
Symposium on Microarchitecture, pages 266–278. ACM,
2019.
[136] Jeff Heckey, Shruti Patil, Ali JavadiAbhari, Adam
Holmes, Daniel Kudrow, Kenneth R Brown, Diana
Franklin, Frederic T Chong, and Margaret Martonosi.
Compiler management of communication and paral-
lelism for quantum computation. In ACM SIGARCH
Computer Architecture News, volume 43, pages 445–
456. ACM, 2015.
[137] Yongshan Ding, Adam Holmes, Ali Javadi-Abhari, Di-
ana Franklin, Margaret Martonosi, and Frederic Chong.
Magic-state functional units: Mapping and scheduling
multi-level distillation circuits for fault-tolerant quan-
tum architectures. In 2018 51st Annual IEEE/ACM In-
ternational Symposium on Microarchitecture (MICRO),
pages 828–840. IEEE, 2018.
[138] Swamit S Tannu, Zachary A Myers, Prashant J Nair,
Douglas M Carmean, and Moinuddin K Qureshi. Tam-
ing the instruction bandwidth of quantum computers
via hardware-managed error correction. In 2017 50th
Annual IEEE/ACM International Symposium on Mi-
croarchitecture (MICRO), pages 679–691. IEEE, 2017.
[139] Xiang Fu, MA Rol, CC Bultink, J Van Someren,
Nader Khammassi, Imran Ashraf, RFL Vermeulen,
JC De Sterke, WJ Vlothuizen, RN Schouten, et al. An
experimental microarchitecture for a superconducting
quantum processor. In Proceedings of the 50th Annual
IEEE/ACM International Symposium on Microarchi-
tecture, pages 813–825. ACM, 2017.
[140] Ali Javadi-Abhari, Pranav Gokhale, Adam Holmes, Di-
ana Franklin, Kenneth R Brown, Margaret Martonosi,
and Frederic T Chong. Optimized surface code com-
munication in superconducting quantum computers. In
Proceedings of the 50th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture, pages 692–705.
ACM, 2017.
[141] DJ Reilly. Challenges in scaling-up the control in-
terface of a quantum computer. arXiv preprint
arXiv:1912.05114, 2019.
[142] R McDermott and MG Vavilov. Accurate qubit con-
trol with single flux quantum pulses. Physical Review
Applied, 2(1):014007, 2014.
[143] R McDermott, MG Vavilov, BLT Plourde, FK Wil-
helm, PJ Liebermann, OA Mukhanov, and TA Ohki.
Quantum–classical interface based on single flux quan-
tum digital logic. Quantum science and technology,
3(2):024004, 2018.
[144] Kangbo Li, R McDermott, and Maxim G Vav-
ilov. Hardware-efficient qubit control with single-flux-
quantum pulse sequences. Physical Review Applied,
12(1):014044, 2019.
[145] Joseph C Bardin, Evan Jeffrey, Erik Lucero, Trent
Huang, Ofer Naaman, Rami Barends, Ted White,
Marissa Giustina, Daniel Sank, Pedram Roushan, et al.
29.1 a 28nm bulk-cmos 4-to-8ghz¡ 2mw cryogenic pulse
modulator for scalable quantum computing. In 2019
IEEE International Solid-State Circuits Conference-
(ISSCC), pages 456–458. IEEE, 2019.
[146] SJ Pauka, K Das, R Kalra, A Moini, Y Yang, M Trainer,
A Bousquet, C Cantaloube, N Dick, GC Gardner, et al.
A cryogenic interface for controlling many qubits. arXiv
preprint arXiv:1912.01299, 2019.
[147] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon,
Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio
Boixo, Fernando GSL Brandao, David A Buell, et al.
Quantum supremacy using a programmable supercon-
ducting processor. Nature, 574(7779):505–510, 2019.
[148] Jeremy Hsu. Ces 2018: Intels 49-qubit chip shoots for
quantum supremacy. IEEE Spectrum Tech Talk, 2018.
[149] The International Business Machines Corpo-
ration. Ibm raises the bar with a 50-qubit
quantum computer. Sighted at Newsroom IBM:
https://newsroom.ibm.com/2019-09-18-IBM-Opens-
Quantum-Computation-Center-in-New-York-Brings-
Worlds-Largest-Fleet-of-Quantum-Computing-Systems-
Online-Unveils-New-53-Qubit-Quantum-System-for-
Broad-Use, 2019.
[150] IonQ. Ionq harnesses single-atom qubits to build the
world’s most powerful quantum computer. Sighted at
IonQ News: https://ionq.com/news/december-11-2018,
2019.
[151] Alpine quantum technologies, Accessed: November 19,
2019. https://www.aqt.eu/.
[152] Jarrod R McClean, Jonathan Romero, Ryan Babbush,
and Alán Aspuru-Guzik. The theory of variational
hybrid quantum-classical algorithms. New Journal of
Physics, 18(2):023023, 2016.
[153] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann.
A quantum approximate optimization algorithm. arXiv
preprint arXiv:1411.4028, 2014.
[154] Roman Orus, Samuel Mugel, and Enrique Lizaso. Quan-
tum computing for finance: overview and prospects. Re-
views in Physics, page 100028, 2019.
[155] Simon J Devitt, William J Munro, and Kae Nemoto.
Quantum error correction for beginners. Reports on
19
Progress in Physics, 76(7):076001, 2013.
[156] Barbara M Terhal. Quantum error correction for quan-
tum memories. Reviews of Modern Physics, 87(2):307,
2015.
[157] Earl T Campbell, Barbara M Terhal, and Christophe
Vuillot. Roads towards fault-tolerant universal quantum
computation. Nature, 549(7671):172, 2017.
[158] Christophe Vuillot. Is error detection helpful on ibm 5q
chips? arXiv preprint arXiv:1705.08957, 2017.
[159] Robin Harper and Steven T Flammia. Fault-tolerant
logical gates in the ibm quantum experience. Physical
review letters, 122(8):080504, 2019.
