Quantum computation promises significant computational advantages over classical computation for some problems. However, quantum hardware suffers from much higher error rates than in classical hardware. As a result, extensive quantum error correction is required to execute a useful quantum algorithm. The decoder is a key component of the error correction scheme whose role is to identify errors faster than they accumulate in the quantum computer and that must be implemented with minimum hardware resources in order to scale to the regime of practical applications. In this work, we consider surface code error correction, which is the most popular family of error correcting codes for quantum computing, and we design a decoder micro-architecture for the Union-Find decoding algorithm. We propose a three-stage fully pipelined hardware implementation of the decoder that significantly speeds up the decoder. Then, we optimize the amount of decoding hardware required to perform error correction simultaneously over all the logical qubits of the quantum computer. By sharing resources between logical qubits, we obtain a 67% reduction of the number of hardware units and the memory capacity is reduced by 70%. Moreover, we reduce the bandwidth required for the decoding process by a factor at least 30x using low-overhead compression algorithms. Finally, we provide numerical evidence that our optimized micro-architecture can be executed fast enough to correct errors in a quantum computer.
Quantum computation promises significant computational advantages over classical computation for some problems. However, quantum hardware suffers from much higher error rates than in classical hardware. As a result, extensive quantum error correction is required to execute a useful quantum algorithm. The decoder is a key component of the error correction scheme whose role is to identify errors faster than they accumulate in the quantum computer and that must be implemented with minimum hardware resources in order to scale to the regime of practical applications. In this work, we consider surface code error correction, which is the most popular family of error correcting codes for quantum computing, and we design a decoder micro-architecture for the Union-Find decoding algorithm. We propose a three-stage fully pipelined hardware implementation of the decoder that significantly speeds up the decoder. Then, we optimize the amount of decoding hardware required to perform error correction simultaneously over all the logical qubits of the quantum computer. By sharing resources between logical qubits, we obtain a 67% reduction of the number of hardware units and the memory capacity is reduced by 70%. Moreover, we reduce the bandwidth required for the decoding process by a factor at least 30x using low-overhead compression algorithms. Finally, we provide numerical evidence that our optimized micro-architecture can be executed fast enough to correct errors in a quantum computer.
Quantum computing promises significant speed-up over conventional computers for specific applications such as integer factorization [1] , physics and chemistry simulations [2] [3] [4] and database search [5] .
The primary obstacle to the implementation of quantum algorithms solving industrial size problems is the high noise rate in any quantum device that makes the output of a quantum computation indistinguishable from a random output. A fault-tolerant quantum computer, in which quantum bits, or qubits, are regularly refreshed by quantum error correction [6] , is necessary to perform a useful computation on noisy quantum hardware.
Most classical error correction schemes can be adapted to the quantum setting thanks to the CSS construction [7, 8] and the stabilizer formalism [9] , providing a quantum version of standard families of classical errorcorrecting codes [10] such as repetition codes, Hamming codes, Reed-Muller codes, BCH codes, LDPC codes and polar codes.
The main difference with the classical setting is the very high noise rate that affects current quantum hardware, often of the order of 1% per quantum gate, which makes quantum error correction much more challenging to implement. This is because it relies on the measurement of quantum parity checks that are likely to introduce additional noise to the qubits. Moreover, one must be able to implement a universal set of quantum gates on encoded qubits, or logical qubits. In a fault-tolerant quantum computer (FTQC), the computation is performed by alternating logical quantum gates and quantum error cor- * nidelfos@microsoft.com rection cycles that removes the noise injected by logical gates. Both quantum error correction and logical quantum gates must be implemented through fault-tolerant gadgets [11] [12] [13] to avoid the injection of pathological error configurations that would be uncorrectable by the subsequent correction cycle.
In this work, we consider a fault-tolerant quantum computer based on surface codes [14, 15] which is the most promising family of error-correcting codes for very noisy quantum hardware. Surface codes can correct up to 1% of noise on all the basic components of the quantum computer and they can be implemented on a square grid of qubits using exclusively local quantum gates acting on nearest neighbor qubits. For comparison, most quantum error correction codes such as quantum Hamming code (Steane code) requires an error rate below 10 −5 [16] .
In this paper, we focus on the design of an error decoder, or simply decoder, which is the primary building block in charge of error correction. The decoder takes as an input the syndrome, which is the data extracted from quantum parity check measurements and it returns an estimation of the error. Given this estimation, the effect of the error can be easily reversed. In a fault-tolerant quantum computer, the decoder must satisfy the three following design constraints. cal applications that may require millions of physical qubits.
A number of surface code decoding algorithms have been proposed in the past 20 years [14, and many satisfy the Accuracy Constraint which is often the main motivation. However, it remains unclear whether the decoding can be made fast enough to satisfy the Latency Constraint without degrading the decoder performance [69] . Moreover, the decoding problem is generally studied for a single logical qubit, ignoring the Scalability Constraint, whereas practical applications require hundreds or thousand of logical qubits encoded into millions of physical qubits [3] . A substantial amount of decoding hardware is required to decoding simultaneously all the qubits of the quantum computer. Most of the work in quantum error correction decoders focuses on algorithmic aspects of the problem. Here, we consider this problem through the lens of computer architecture and we propose a decoder micro-architecture that satisfies simultaneously our three design contraints. Figure 1 : A naive decoding architecture with a decoder associated with each logical qubit and our Error Decoding Architecture based on optimized decoder blocks shared across multiple logical qubits. A low-overhead compression algorithm is used to reduce the bandwidth cost. Our optimized decoder block is described in Fig. 12 .
We propose a decoder micro-architecture based on the Union-Find decoding algorithm [30, 43] . We choose the Union-Find decoder for its accuracy and its simplicity. It is proven to achieve good decoding performance. Moreover, it comes with a almost-linear time complexity and it is also very fast in practice because it requires no floatingpoint arithmetic and no matrix operations. Finally, the simplicity of the decoder allows us to design a specialized hardware implementation that leads to a significant speed-up. Our main contributions are the following.
We propose a hardware implementation of the
Union-Find decoder based on three hardware units corresponding to each step of the algorithm.
2. We design a three-stage pipepline based on our three hardware units that brings an important speed-up to the Union-Find decoder by parallelizing the implementation of the decoding stages.
3. We observe that the utilization of different components of the Union-Find decoder pipeline varies with the unit type and across the logical qubits and hence propose an efficient time-division multiplexing that allows logical qubits to share decoding resources within a decoder block without compromising the error correction capabilities.
4. We propose different compression algorithms adapted to a cryogenic setting in order to reduce the bandwidth consumed to send the syndrome data to the decoder.
5.
Combining all the previous results, we describe an Error Decoding Architecture (EDA) represented in Fig. 1 that scales the decoder design for a large FTQC while reducing hardware costs. The number of hardware units is reduced by 67% and we obtain a 30× bandwidth reduction while preserving the decoder accuracy. Moreover, we demonstrate by numerical simulations that our architecture leads to a decoder that is fast enough to satisfy the Latency Constraint, despite the high noise rate of quantum hardware.
Item 1 and 2 provide a hardware acceleration of the decoder and 3 and 4 lead to a satisfying solution to the Scalability Constraint. Our EDA is optimized carefully in order to guarantee that the error correction capability of the initial UF decoder is preserved, which guarantees that the Accuracy Constraint is satisfied. The numerical simulation of our optimized micro-architecture (item 5) accounts for the limitation imposed by shared hardware resources which demonstrates that our decoder satisfy simultaneously the three design constraints.
This article exploits a number of ideas from computer architecture such as pipelining and resource optimization. We believe this approach is necessary in order to scale quantum machines. Even though we provide a detailed micro-architecture for the Union-Find decoder, the general principles of our design apply to any decoding algorithm. A key ingredient is the simplicity of the decoder and the decomposition in independent steps, which leads to a natural pipeline and a speed-up by instruction-level parallelism.
I. BACKGROUND AND MOTIVATION

A. Qubits and decoherence
We refer to Preskill's lecture notes [70] and Nielsen and Chuang's book [71] for a great overview of field of quantum computation and quantum information theory. A qubit is the basic unit of information in a quantum computer. A qubit is described by complex vector |ψ = α|0 + β|1 , which represents the superposition of two basis state |0 and |1 , with α, β ∈ C such that |α| 2 + |β| 2 = 1. Without error correction, a quantum state rapidly decoheres due to the accumulation of tiny rotations. By constantly measuring the system, one can project these tiny rotations onto three types of errors denoted X, Y and Z and called Pauli errors. The bit-flip error X swaps the basis state |0 and |1 and maps the qubit |ψ into β|0 + α|1 . The phase-flip error Z is defined by Z|ψ = α|0 −β|1 . It introduces a relative phase between the two basis states. The error Y corresponds to a simultaneous bit-flip and phase-flip error, i.e., Y = XZ up to an overall phase which does not affect the result of the computation. We use the notation I for the identity operation that corresponds to the error-free case.
By definition of Y , it is enough to correct X-type and Z-type errors. In this work, we focus on the correction of X-type errors. By symmetry of the error correction scheme consider in this article, Z-type errors can be corrected using the exact same mechanisms by swapping the roles of X and Z. For simplicity, we assume that the correction of X-type and Z-type errors is performed independently, although a performance gain is possible by taking into account the correlations between X and Z. [72] [73] [74] .
B. Surface codes
In order to combat decoherence, we must perform the computation on encoded data, also called logical data, which is corrected at regular time intervals using a quantum error correcting code [6] [7] [8] [9] . In this work, we focus on the surface code [15, 75] which is the most promising quantum error correcting code for a quantum computing architecture due to its high error threshold, which means that the error correction protocol can be implemented with very noisy quantum hardware. An error rate below 1% for all the components of the quantum computer is sufficient in order to obtain encoded qubits with better quality than our initial physical qubits [76] [77] [78] . In order to scale to the massive sizes required for practical applications, it is necessary to build quantum hardware whose fault rate is far below the 1%-threshold. This is because error correction does not decrease the error rate sufficiently if the initial qubit error rate is too close to the threshold. In this work, we consider a noise rate of 10 −3 .
The family of surface codes is the most widely considered candidate for designing a fault-tolerant quantum computer. The distance-d surface code, that we denote SC(d), encodes a logical qubit into a square grid of (2d − 1) × (2d − 1) qubits, alternating data qubits, which store the logical information, and ancilla qubits, used to detect errors as shown in Figure 2 with the distance-three surface code SC(3). The main reason for the success of the surface code is its locality which significantly simplifies the quantum chip design. Error correction with surface codes only requires local interactions between ancilla qubits and their nearest neighbor data qubits, that is at most four qubits. The minimum distance d of the code measures the error correction capability. A larger minimum distance d results in an increased error tolerance at the price of an increased qubit overhead. Encoding physical qubits that suffer from an error rate p using a distance-d surface code, we obtain a logical qubits with error rate
that we call logical error rate. This heuristic formula, derived from the numerical results of [79] , provides a good approximation of the logical error rate in the regime of low error rate (p << 10 −2 ). It is valid in the context of the current work, that is when using the Union-Find decoder and for the phenomenological noise model introduced in Section I C. Throughout this article, we illustrate our design with numerical results for distance-11 surface codes which is a reasonable distance for a first generation of fault-tolerant quantum computer since it allows to implement nontrivial quantum algorithms on logical qubits while keeping the qubit overhead to a few hundred qubits per logical qubit. Assuming a physical error rate of p = 10 −3 , the logical qubits error rate drops to p Log ≈ 6 · 10 −10 allowing for the implementation of large depth quantum algorithms.
C. The decoding problem
In this section, we review the decoding problem and the graphical formalism introduced in [75] .
Quantum error correction is a two-step process. First, a measurement circuit is executed producing a syndrome bit for each ancilla qubit. All syndrome bits can be extracted simultaneously. Then, a decoding algorithm is used to identify errors on data qubits based on the syndrome information. To avoid any confusion with other decoding operations used in this architecture, we sometimes refer to the decoder as the error decoder. The decoding subroutine is a purely classical operation that can be performed on highly reliable hardware. On the other hand, the syndrome extraction is implemented on noisy quantum hardware. In order to obtain sufficiently accurate information about the errors occurring on data qubits despite measurement errors, multiple rounds of syndrome extraction are performed. The decoder analyzes d consecutive rounds of syndrome data to produce an estimation of the error induced on the data qubits. In the absence of measurement errors, the decoding problem can be mapped onto a matching problem in a square grid as shown in Fig. 3 . A bit-flip error X acting on a single data qubit is detected by a non-trivial syndrome bit on the incident X-type ancilla qubits as Fig. 2(b) shows. More generally, a chain of X-errors is detected by its endpoints. In order to recover the chain of X-errors given the syndrome values (its endpoints), the basic idea of the decoding algorithm is to build a short chain of X-type errors that matches the detected endpoints. Errors on the boundary of the lattice are detected only on one of their endpoints.
To address the issue of measurement errors, d rounds of syndrome extraction are performed, resulting in a matching problem in a three-dimensional graph [75] . Fig. 3 shows the cubic graph representing errors and syndromes bits for the distance-3 surface code. In what follows, we refer to this graph as the decoding graph.
We simulate errors occurring on data qubits and incorrect syndrome values using the phenomenological noise model [75] . Each edge of the decoding graph corresponds to a potential error, with horizontal edges representing X-errors on data qubits and vertical edges corresponding to syndrome bit-flips. We assume that an error occurs on each edge, independently, with probability p. For each vertex v in the bulk of the lattice, a syndrome value s(v) ∈ {0, 1} is extracted, which is the parity of the number of errors incident to v. Just like in the planar case, the goal of the decoder is to estimate the error by matching together the vertices v supporting a non-trivial syndrome s(v) = 1. This formalism allows to handle both qubit errors and measurement errors in the same way.
The relevance of the phenomenological noise model is justified in [75] . For further study of our decoding architecture tailored to a specific type of qubit, one may consider a more precise noise model such as the circuitlevel noise model [75] . In this work, we chose the phenomenological noise model because it is simple enough to develop a good intuition about the decoder and it captures the essential properties of the quantum hardware which guarantees that all the ideas proposed in the current work generalize to a more precise model and remain relevant for a practical device.
D. Existing Error Decoding Algorithms
In this section, we discuss different decoding strategies. The noise model describes the probability of all possible errors. Given this information, we can derive the probability of each error when a given syndrome is observed. The ultimate goal of the decoder is to return an error whose probability is maximal given the syndrome measured, i.e., a most likely error. A decoder that achieves this performance is said to be a maximum likelihood decoder [75] . For an arbitrary error-correcting code, it is generally not possible to implement efficiently a maximum likelihood decoder [80] . However in the case of the surface code, several algorithms achieve a good error correction performance. In what follows, we discuss several promising decoding strategies.
Lookup Table ( LUT) decoder [81] : This decoder implements a maximum likelihood decoder using a lookup table indexed by the syndrome bits. The corresponding LUT entry stores the correction to be applied to the data qubits. The LUT size grows exponentially with code distance making this design unsuitable for large FTQCs.
Minimum Weight Perfect Matching (MWPM) decoder [14] : The MWPM decoder provides an estimation of a most likely error based on a graph pairing algorithm, the MWPM algorithm, that can be implemented in polynomial time [82] . This decoder is one of the most effective in terms of error correction capabilities, even though its worst-case time complexity,
where |V | is the size of the decoding graph, makes it too slow for large-distance surface codes. Fowler suggested a parallel implementation of this algorithm that reduces the average time complexity to O(1), although the worst case complexity remains significant [20] . This decoder relies on large amounts of parallelism from several ASICs for each logical qubits but this study does not discuss the system architecture or the number of ASICs needed.
Machine Learning (ML) decoder: ML-decoders train neural networks with the underlying error probability distribution and decoding is then treated as an inference problem where the syndrome data is an input to the neural network which infers the correction [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] ML-Decoders require substantial computational resources and the size of the training data grows quickly with the code distance. They also require large training times, and are primarily studied for small code distances. There exists some preliminary studies for larger code distances [56, 59] and proposals to obtain better performance through distributed neural networks [64] , and hardware platforms [56] such as GPUs, FPGAs, and TPU [83] .
Tensor Network (TN) decoder [46] [47] [48] [49] The probability of each possible error can be represented as a tensor network which leads to a decoding algorithm that contracts this tensor network. The contraction of the tensor network requires extensive matrix operations which may be hard to scale, however the algorithm achieves a very good error-correction performance that is optimal [47] or quasi-optimal. Although it has not been studied precisely, tensor network decoders could benefit from a hardware speed-up from neural accelerators such as a TPU [83] .
Union-Find (UF) decoder [79, 84] : This is a recently proposed algorithm that offers a correction in almost-linear time O(nα(n)), where α(n) is ≤ 5 for all practical purposes. It uses Union-Find data structure in order to achieve a similar performance as the MWPM decoder without using computationally intensive matching algorithms. Table I provides an abstract comparison of prominent decoding algorithms in terms of the three key design constraints highlighted in introduction: accuracy, runtine and scalability.
Given the simplicity of the algorithm and low timecomplexity, we use the UF decoder as the default algorithm for our studies. However, the design principles, optimizations, and scalability analysis of the present work will hold true for other decoders as well. Similarly, the syndrome compression analysis in Section IV applies for any decoder-quantum substrate interface irrespective of the decoder and qubit technology in use.
E. Union-Find decoder
In this section, we review the strategy of the Union-Find decoder [79, 84] . As explained in the introduction, our principal motivation for choosing this decoding algorithm is its rapidity and its simplicity. Figure 4 : The three main steps of the Union-Find decoder for a 2d decoding graph with the surface code SC (7) . (a) A X-type error and its syndrome in the decoding graph. The goal is to recover the error given the red syndrome nodes. To mark half-edges we will add a vertex in the middle of each edge of the decoding graph. The Union-Find decoder operates in three steps that we review in Figure 4 . We illustrate the decoding procedure using a two-dimensional decoding graph since there is no major difference with the cubic case that are relevant in practice. The algorithm takes as an input a set of nodes supporting non-trivial syndrome values and its goal is to recover the error living on the edges of the decoding graph. (a) During the first step, even clusters are grown around non-trivial syndrome nodes. (b) Then, a spanning tree is built for each cluster and is oriented from a root to the leaves. (c) Finally, we build an estimation of the error using the syndrome by traversing the clusters in reverse order.
II. HARDWARE DESIGN AND PIPELINING
In this section, we design three hardware units implementing the three stages of the Union-Find decoder as Figure 5 shows. The Graph Generator (Gr-Gen) produces the grown clusters obtained is Figure 4 To reduce the bandwidth and latency issues it is more favorable to operate the decoding circuitry close to the quantum substrate in a cryogenic regime. Our main design constraint is the limited hardware resources available in a cold environment. We optimize our design to minimize the memory requirement and the number of memory reads.
Our implementation of the decoding algorithm benefits from hardware acceleration in two ways. First, a fully pipelined design allows performance improvement through enhanced parallelism across the different processing units. While the correction engine works on cluster i, the DFS engine can build the spanning tree for cluster i − 1. Second, in a general purpose processor, the read latency depends upon where the data is present and it can range up to several hundreds of cycles if it needs to be fetched from the off-chip main memory to the on-chip caches. For our specialized hardware, the processing elements can directly access the data stored on-chip that require much fewer cycles. In this work, we assume a readout time of four cycles to read 32-bit data. 
A. Graph Generator
The Graph Generator module takes the syndrome as an input and generates a spanning forest by growing clusters around non-trivial syndrome bits (non zero syndrome bits). Instead of growing all surrounding edges as in Figure 4 (b) we only add the edges that reach new vertices. This directly produces a spanning forest without extra cost. The spanning forest is built using two fundamental graph operations: Union() and Find() [85] . Figure 6 shows the design of the three modules that implements the decoding algorithm. The Gr-Gen module Figure 7 . The size table entries for the non-trivial syndrome bits are initialized to 1 as shown in Figure 7 . These tables aid the Union() and Find() operations to merge clusters after the growth phase. They are indexed by cluster indices. The tables are sized for the maximum number of clusters possible which is equal to the total number of vertices in the surface code lattice. The tree traversal registers store the vertices of each cluster visited in the Find() operation. Since the decoding algorithm grows all odd clusters until the parity is even, odd clusters must be detected quickly. To do the same, we use parity registers as shown in Figure 6 . The parity registers store 1 bit parity per cluster depending upon whether it is odd or even. For a reasonable code distance of 11, seven 32-bit registers are enough. For larger code distances, we store the additional parity information in the memory and read them in advance when required to hide the memory latency.
The control logic reads the parity registers and grows clusters with odd parity (called the growth phase) by writing to the STM, ZDR, and adding newly added edges that touches other cluster boundaries to the FES. The STM is not updated for edges that connect to other clusters to prevent double growth. It is updated when clusters are merged by reading from the FES. The logic checks if a newly added edge connects two clusters by reading the root table entries of the vertices connected by the edge (call these the primary vertices). This is equivalent to the Find() operation. The vertices visited on the path to find the root of each primary vertex are stored on the tree traversal registers as shown in Figure 8(a) . The root table entries for these vertices are updated to directly point to the root of the cluster to minimize the depth of the tree for future traversals. This operation, called path compression, is a key feature of the Union Find algorithm and enables the reduction of the tree depth, amortizing the cost of Find() operation. For example, Figure 8 (a) shows the state of two clusters and root table at an instant in time. Let us assume that after a growth step, vertices 0 and 6 are connected and the two clusters must be merged. The tree traversal registers are used to update the root of vertex 0 as shown in Figure 8 (b). Since the depth of the tree is compressed during every Find() operation, only a few 32-bit registers are sufficient. The proposed design uses 5 registers per primary vertex. If the primary vertices belong to different clusters, the root of the smaller cluster is updated to point to the root of the larger cluster.
Delfosse et. al. store the boundary of each cluster in their algorithm [79] . Based on a Monte-Carlo simulation that shows that the average cluster diameter remains very small in the noise regime that is relevant for practical applications, we decided to compute the cluster boundary when it is needed in the growth phase, instead of consuming extra memory to store it 2 .
To summarize, the Gr-Gen module detects odd parity clusters using the parity registers and grows then by reading and writing to the STM. The cluster growth is aided by the information stored on the root table, size table and FES.
B. Depth First Search Engine
The DFS engine processes the STM data produced by the Gr-Gen that stores the set of grown even clusters. It uses the depth first search algorithm to generate the list of edges that forms a spanning tree for each cluster in the STM 3 . The logic is implemented using a finite state machine and two stacks as shown in Figure 6 . Stacks are used since the order in which edges are visited in the spanning tree must be reversed to perform correction by peeling [84] . The edge stack stores the list of visited edges while the pending edge stack is used as to queue the next edges to explore in the on-going DFS.
To enable pipelining and improve performance, we design the micro-architecture to consist of an alternate edge stack (Edge Stack 1 as shown in Figure 6 ). When there is more than one cluster, the correction engine works on the edge list of one of the traversed clusters when the DFS engine traverses through the other. The DFS Engine generates the list of edges visited to traverse a cluster using DFS algorithm and hence the number of memory reads required is directly proportional to the size of the clusters. By going over the STM row-wise and using the ZDR to visit only non-zero rows, the effective cost of generating clusters is reduced. ?
Error Graph
Step 1
Step 2
Step 3 Figure 9 : Peeling for an example error graph performed in the Correction Engine. The status of the edge stack, error log, and syndrome hold register are shown for each step.
C. Correction Engine
The correction engine performs the peeling process of the decoder [84] and identifies the Pauli correction to apply. This requires access to the edge list (which is stored on the stack) and syndrome bits corresponding to the vertices along the edge list. The syndrome bits can be accessed by accessing the STM. However, this increases the logical complexity, latency, and the number of memory requests that the STM is required to handle. To reduce the incoming memory traffic for the STM and eliminate the need for additional logic, the syndrome information is saved on the stack along with the edge index information by the DFS Engine. The temporary syndrome changes caused by peeling are saved on local registers (Syndrome Hold Registers shown in Figure 9 ). The Corr Engine also reads the last surface code cycle error log and updates the Pauli correction for the current edge. For example, if the error on a edge e 0 was Z in the previous logical cycle and it encounters a Z error in the current cycle too, the Pauli error for e 0 is updated to I as shown in Figure 9 .
D. Hardware cost
We measure the hardware cost by estimating the amount of memory required. Table II shows the different contributions to the memory requirement.
The spanning tree memory (STM) used by the Gr-Gen and DFS engine accounts for most of the storage costs. It contains 1 bit per node of the decoding graph and at most 2 bits per edge (only 1 bit on the boundary). The decoding graph is a 3d cubic lattice with about d 3 vertices which leads to a total of 7d 3 bits for the STM.
The root table and the size table used in the Gr-Gen module contains d 3 entries and each entry consists of an integer index which can be uniquely identified using log 2 (d 3 ) bits. Thus, the total sizes of the root and size tables are 3d 3 log 2 (d) each.
The size of the edge stacks S 0 and S 1 used by the DFS and the Corr engine is given by the maximum number of edges of a spanning tree of a cluster. The size of the spanning tree of a cluster C i is given by |V (C i )|−1 where |V (C i )| denotes the number of vertices of the cluster C i . To fit any possible cluster, one could pick a stack that can store d 3 edges, which requires about d 3 log 2 (d 3 ) bits. For simplicity, we ignore the Fusion Edge Stack and the Pending Edge Stack that are in general significantly smaller the two edge stacks S 0 and S 1 . This is because these stacks contain only a small subset of edges of a cluster. Vertices per Cluster 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 Probability Figure 10 : Distribution of the number of vertices per cluster after the growth step of the UF decoder for code distance d = 11 and physical error rate p = 10 −3 . This distribution is estimated by a Monte-Carlo simulation using 10 7 samples.
The size of the edge stacks can in principle reach the previous upper bound d 3 log 2 (d 3 ), in the case of a cluster that covers the whole decoding graph. This makes it the most expensive element of the design. However, the probability that such a large cluster is reached remains extremely small in a practical noise regime 4 . By analyzing the maximum size of a cluster after the growth step using Monte Carlo simulations, we optimize the stack size for a chosen code distance d and a given physical error rate p. Figure 10 shows the cluster size distribution for code distance d = 11 and physical error rate p = 10 −3 . For these parameters, the probability of a cluster of size larger than 80 is smaller than the logical error rate for this code. One can thus ignore the clusters of size larger than 80 without significantly affecting error correction performance. This drops the stack memory requirement by a factor 10x from 1.7 KBytes to 0.13 KBytes.
To reduce further the memory requirement, the edge stacks can be sized to half the maximum size of a cluster spanning tree (derived from simulation), optimizing for common case. In the rare event of an overflow, the alternate stack is used.
In general, the memory required for each of the two DFS stacks is approximately 3S(d, p) log 2 (d) where  S(d, p) is the minimum integer s such that the probability to have a cluster with more than s edges on the output of the Gr-Gen is at below p Log (d, p) . We say that a stack overflow failure occurs if the DFS engine encounters a cluster that does not fit in a stack, that is with more than S(d, p) edges. By construction, the stack sizes are optimize in such a way
where p Sof denote the probability of a decoding failure due to a stack overflow error.
E. Comparison with other decoders
For comparison, we provide a rough estimate of the memory capacity required for the MWPM decoder. The average number of faults in the decoding graph for a given noise rate is p|E| where |E| is the number of edges of the decoding graph. For low values of p (such as p = 10 −3 ), many of these configurations of p|E| faulty edges are nonoverlapping, and most of the edges are in the bulk of the lattice, which results in about w = 2p|E| non-zero syndrome bits. Given a set of w non-zero syndrome nodes, the MWPM generates a complete graph with w + 1 vertices (and w(w + 1)/2 edges) and it performs a Minimum Weight Matching algorithm in this graph. A state of the art implementation of this algorithm due to Kolmogorov [82] consumes about 161 bits per edge (4 pointers, 1 integer and 1 bit per edge) which brings the memory capacity for the MWPM decoder to at least
bits. Therein, we use |E| ≈ 3d 3 . In order to provide a non-trivial lower bound when d is small, we use the fact that the MWPM decoder must correct at any set of (d − 1)/2 faults. Such a set may lead to w = (d − 1) non-zero syndrome bits, resulting in a lower bound of 161(d − 1)d/2 ≈ 80d 2 bits. Taking the best of both lower bounds, we obtain the result of Figure 11 which shows that our UF decoder design requires slightly higher memory than a MWPM decoder for low code distances; and outperforms the MWPM for larger distances, making it more scalable. Given that we only consider average weight fault configurations and not the worst case for the MWPM decoder, we believe that our lower bound on the MWPM memory capacity is very optimistic and the UF decoder actually surpasses the MWPM decoder in term of memory capacity much before distance 20 as observed in the figure. Other decoders are much more memory intensive than our design. The LUT decoder requires the storage of more than 2 1000 correction bit-strings for d = 11 and ML decoders cost several MBs to GBs of memory depending on implementation and the code distance.
III. RESOURCE OPTIMIZATION
For a system with large number L of logical qubits, the most straightforward implementation allocates two decoders per logical qubits, one for each type of error, X and Z. Thus, for the baseline design, the decoding logic uses 2L decoders. However, the utilization of each pipeline stage varies causing under-utilization of certain stages. Ideally, we want to reduce the number of decoders required for the overall system. In this section, we optimize not only the total number of decoders but the exact number of module of each type Gr-Gen, DFS engine and Corr engine. 
Select Logic
To other muxes Figure 12 : Design of decoder block that contains N Gr-Gen modules, αN DFS engines and βN Corr engines.
A. Sharing hardware modules
The core component of our optimized architecture is a decoder block as shown in Figure 12 which uses a reduced number of pipeline units to perform the decoding of N logical qubits. Our Error Decoding Architecture (EDA) shown in Figure 1 uses L/N decoder blocks to perform error correction over the L logical qubits of the computer.
The motivation for our design is that the growth stage implemented by the N Gr-Gen modules is significantly more complex than the DFS stage. Therefore, we expect the DFS engine to wait for a fraction of the time while the Gr-Gen module terminates. Instead of waiting, we prefer to use a smaller number of DFS engines (αL < L) that share the work of L Gr-Gen modules. The value α will be optimized to keep the waiting time of DFS engines minimum, saving a fraction (1 − α) of the DFS hardware. We proceed in the same way to optimize the number βL Corr engines.
The hardware overhead of this optimization is multiplexors and demultiplexors on the datapath as shown in Figure 12 . Memory requests generated by the Corr Engine are routed to the correct memory locations using a demultiplexor. The select logic prioritizes the first ready component and uses round robin arbitration to generate appropriate select signals for the multiplexors. For example, if four Gr-Gen units share a DFS Engine, and the second Gr-Gen finishes cluster formation earlier than other units, it receives access to the DFS Engine. The round robin policy ensures fairness while sharing resources.
B. Decode block design constraint
As explained previously, in order to correct circuit faults and measurement errors, the decoder needs d consecutive rounds of syndrome data, which form a logical cycle. To prevent backlog problems, the decoder must provide a correction before the end of the next logical cycle when a new decoding request arrives. If a decoder block fails to terminate its work within a logical cycle, errors start accumulating and spreading over the quantum computer. We refer to this type of failure as timeout failure. In order to ensure that timeout failures do not significantly degrade the decoding performance, we impose the following constraint for the decoder block design.
where p Tof (d, p) is the timeout failure probability for the decoder block. The timeout failure probability per logical qubit must be lower than the probability of a logical error. This condition ensures that the fault rate on the output of the whole quantum computation is at most doubled due to time out failures. In what follows, we propose a fast and hardware efficient decoder block design that respects the constraint (4).
C. Modules runtime simulation
We model the decoder performance by studying the number of reads. The write operations performed are read-modify-write, and the writeback is not on the critical path. We assume 4 cycles latency for memory accesses and a 4 GHz clock frequency [86, 87] .
Denote by C 1 , . . . C m the set of clusters generated by the Gr-Gen module. A single growth step for a cluster C requires to read a set of rows of the STM that cover the cluster. We estimate this number by diam(C) 2 , assuming that the cluster spreads roughly uniformly in all the directions. Summing of all clusters and growth steps, the total number of memory requests generated in Gr-Gen is approximated by the sum
because each growth steps increases the diameter by 1.
The number of memory requests in the DFS engine and the Corr engine to treat a cluster C i are both given by the size of its spanning tree which is given by |V (C i )| − 1 where |V (C i )| is the number of vertices of C i . Including all clusters, we obtain the estimate
for the total number of reads in the DFS engine or in the Corr engine. Figure 13 : Correlation between Gr-Gen and DFS Engine execution time for code distance d = 11 and physical error rate p = 10 −3 . Each dot corresponds to a random error configuration and the runtimes of the two modules are estimated using Eq. (5) and (6) . Duplicate data points cannot be observed on this plot.
DFS Engine Execution Cost
In order to select the DFS engine ratio α, we estimate the ratio between the execution time of the Gr-Gen and the DFS engine by a Monte-Carlo simulation of τ GG and τ DFS for the distance-11 surface code with an error rate p = 10 −3 as shown in Figure 13 . To estimate these runtimes, we sample families of clusters by generating random errors according to the phenomenological noise model described in Section I C, and by simulating the growth step of the decoder for these errors. This provides us with samples of cluster families from the output of the Gr-Gen module. Our Monte-Carlo simulation produces the result of Figure 13 which shows the correlation between the execution times in the Gr-Gen and DFS engine. As expected more time is spent in the Gr-Gen unit. We observe roughly a factor two between the execution times of the two units which suggest one can eliminate half of the DFS units.
D. Optimized decoder block
The results of Section III C encourage us to consider a decoder block with parameters α = 0.5 and β = 1. However, nothing guarantees that this choice will lead to a decoder block that is fast enough to satisfy condition (4) . In this section, we design an optimized decoder block for the surface code with distance 11 that satisfies (4) and that can be implemented in only 325ns in the noise regime p = 10 −3 , under the memory frequency and latency assumptions above. This demonstrates that our decoder block is clearly fast enough to perform the surface code decoding. The decoder is actually 30 times faster that the logical cycle time of the distance-11 surface code which is about 11 µs [88] .
We consider the smallest decoder block with α = 0.5 and β = 1. It includes two logical qubits, that is four error configurations to correct (two X-type and two Z-type), two Gr-Gen units, one DFS engine and one Corr engine. We refer to this optimized design as the (4, 2, 1, 1)-decoder block. Figure 14 shows our estimation of the execution time of the whole block to decode the two logical qubits obtained by simulating the whole pipepline of the (4, 2, 1, 1)-block with a Monte-Carlo simulation with importance sampling. We observe that by interrupting the decoding after 325 ns, we obtain a block that satisfy (4) .
For L logical qubits, the number of Gr-Gen units, DFS engines, and Corr engines required are L, L/2, and L/2 respectively in the optimized architecture. Thus, the total number of Gr-Gen units, DFS engines, and Corr engines are reduced by 2×, 4×,and 4× respectively.
The total memory capacity required for the baseline design and for our optimized decoder block are summarized in Table III for 1000 qubits encoded with the distance-11 surface code. The previous (4, 2, 1, 1)-block leads to a saving of about 50% of the memory capacity. In order to reduce further the memory requirement, we can use a shared root table and a shared size table between the two Gr-Gen modules of the decoder block. This leads to a slight slow down of the decoder because both Gr-Gen modules cannot simultaneously perform the growth step, but the two STM can be used in parallel. While the first STM is used by a DFS engine, the second one can be used by a Gr-Gen module to grow clusters. A 10 -14 10 -13 10 -12 10 -11 10 -10 10 -9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 Probability Figure 14 : Distribution of execution time of the (4, 2, 1, 1)block with code distance 11 and error rate 10 −3 . The shaded region has probability smaller than N pLog (for N = 2 logical qubits), which means that by interrupting the decoder block after 325 ns, we obtain a timeout failure that satisfies condition (4).
(4, 2, 1, 1)-block with a single root table and a single size table achieve 70% (3.5 X) of memory reduction compare to the naive design. 
IV. SYNDROME DATA COMPRESSION
A major challenge in designing any error decoding architecture is the large bandwidth required between the quantum substrate and the decoding logic. In order to perform error decoding, the syndrome measurement data must be transported from the quantum substrate to the decoding units. For a given qubit plane with L logical qubits and each qubit encoded using a surface code of distance d, about 2d 2 L bits must be sent at the end of each syndrome measurement cycle, which requires significant bandwidth ranging on the order of several Gb/s for a reasonable number of logical qubits and code distance. Data transmission at a lower bandwidth on the other hand reduces the effective time left for error decoding since a decoder must provide an estimation of the error within d syndrome measurement cycles. In this section, we consider three compression techniques to reduce the bandwidth needs and we analyze their performance in different noise regimes. We focus on compression schemes Figure 15 : (a) Dynamic Zero Compression (DZC). The compressed data consists in the zero bit indicators, that provides the locations of blocks '000', followed by the content of nonzero blocks. (b) Sparse Representation. We can store sparse efficiently data by providing the number of non-zero bits and their index.
that uses simple encoding and do not require large hardware complexity.
A. Can data compression work?
An approach to reduce the memory bandwidth requirements in conventional memory systems is data compression [89] [90] [91] [92] [93] [94] . Data compression works well when data is sparse and has low entropy. To justify the potential of syndrome data compression, we estimate the sparsity of the syndrome data analytically for realistic noise regimes.
For a noise strength p in the phenomenological noise model introduced in Section I C, the expected number of faults in the 3d decoding graph which contains 7d 3 fault locations is about 3d 3 p. In the case of a distance-11 surface code, we expect only 4 errors on average, resulting in a non-trivial syndrome vector of length ≈ 1, 000 with Hamming weight ≤ 8, since each error is detected by at most two non-trivial syndrome bits. The expected Hamming weight of the syndrome vectors drops even further for lower noise strength p.
B. Three low-overhead compression techniques
We consider the three following compression methods that allow for a simple hardware implementation. Figure 15 provides the basic idea of the Sparse Representation and the Dynamic Zero Compression. The Geometry-based compression is a variant of the Dynamic Zero Compression that exploits the geometry of the lattice of qubits. [90] as shown in Figure 15 (a) to compress syndrome data. A syndrome vector of length is grouped into b blocks of m bits each. We store the indicator vector of non-trivial blocks concatenated with the exact value of non-trivial blocks.
A syndrome with b block has length bm. However, if it contains only w non-trivial blocks, it can be transmitted with the DZC technique using only b bits for the non-trivial block indicator vector and wm bits to send the non-trivial blocks, that is a total of b + wm bits.
Geometry-based compression (Geo-Comp):
This is an adaptation of DZC that also accounts for the geometry of the surface code lattice. The basic idea is that non-trivial syndrome values generally appear by pairs of neighbor bits. We can therefore increase the compression rate of the DZC technique by using square blocks that respect the structure of the lattice. With this block decomposition, two neighbors bits are more likely to fall in the same block, reducing the number w of non-trivial blocks to send.
In this work, we treat X-type and Z-type errors separately. However, for any of the three compression techniques described above, a slightly better compression rate can be obtained by compressing together the syndrome data corresponding to both types of errors.
In general, the number and size of the ZDC blocks can be adjusted for a given noise model by computing the expected number of non-trivial syndrome blocks. Regions of larger size improves the compression ratio but also leads to complex hardware by adding to the logic depth. With this in mind, we analyze small block sizes even for very low error rates. Mean compression ratio for different error rates p=10 5 p=10 4 p=10 3 p=10 2 Figure 16 : Mean syndrome compression ratio as a function of the code distance d for different physical error rates. We plot the best compression ratio between the three techniques considered in the present work: Sparse Representation, ZDC and Geometry-based compression.
We determine the effectiveness of a compression scheme by analyzing the compression ratio:
Compression Ratio = Actual Syndrome Length Compressed Syndrome Length (7) The best compression scheme depends on the noise model. Sparse compression appears to be a good choice in the regime of very low error rates because the syndrome vector is often trivial and it is then sent as a single bit. In a noisy regime and for small distance codes, we prefer the other compression schemes like DZC and Geometrybased compression. For any noise rate below 10 −3 , we achieve a compression rate which varies between 20× for d = 3 and 400× for distance-21 surface codes.
V. DISCUSSION
A. Scalability
In this paper, we use the Union-Find decoder to analyse the scope of architectural optimizations in designing high performance and scalable decoders. However, the design principles apply to other decoders in general such as machine learning or graph algorithm-based decoders. For example, machine learning decoders require several network layers. Overall resources can be reduced by pipelining layers in inference and by sharing these layers between logical qubits. Similarly, while our study only focuses on regular surface code, the same analysis holds true for other types of QEC codes based on Euclidean lattices [95, 96] or color codes [97] . However, it seems non-trivial to adapt the STM used in our design to the non-trivial lattice topology of hyperbolic codes [98] [99] [100] and thus a different micro-architecture is needed. Lastly, the syndrome compression is valid for an arbitrary decoder-quantum substrate interface, independent of their types, although it can be improved by exploiting the code structure as we propose with the geometry based compression.
B. Assumptions
We assume all syndrome extraction circuits can be executed in parallel, an assumption used by most decoders [101] . However, the amount of parallelism depends on the qubit technology and a large amount of parallelism is achievable on modern superconducting qubits, although other types of qubits may offer less parallelism.
C. Noise Model
We use the phenomenological noise model for our study and there is scope to further optimize the design for enhanced noise models and account for correlation in er-rors. We consider a physical error rate of 10 −3 because QEC codes cannot lower the logical error rate substantially unless the initial physical error rate is lower than the threshold which is about 1%. For the system sizes we have considered in this paper with about 100 − 1000 qubits, error rates of about 10 −3 are required to run practical applications of scientific and commercial value.
VI. RELATED WORK
Designing a quantum computer requires full-stack solutions [102] and interdisciplinary research [103] . This has led to developments in programming languages QPL compilers [116, [130] [131] [132] [133] [134] [135] [136] , microarchitecture [137] [138] [139] [140] , control circuits [141] [142] [143] [144] [145] [146] , and quantum devices. Although existing Noisy Intermediate Scale Quantum (NISQ) computers [147] [148] [149] [150] [151] are expected to scale up to hundreds of qubits and may outperform classical computers for some problems [152] [153] [154] , they will still be too small to achieve fault-tolerance. On the contrary, the scope of a quantum computer greatly broadens in the presence of fault-tolerance and therefore, designing FTQCs is an important area of research.
QEC plays a seminal role in FTQCs. In addition to the standard [70, 71] , we suggest the following recent reviews to learn more about QEC codes and fault tolerance [155] [156] [157] . Recent experiment results suggest that quantum error correction is reaching an inflection point where the quality of encoded qubits is better than the quality of raw qubits [158, 159] . Few articles consider the hardware aspects of decoder designs are necessary such as discussing the potential of GPUs and ASICs [20, 56] , and high-speed circuits [19] or describing the architecture of the neural networks in the case of ML-Decoders. These studies focus primarily on achieving higher performance and accuracy for a single logical qubit. In order to support the design of large scale error-corrected quantum devices, more studies on the hardware aspects of decoder designs and their scalability to many logical blocks are necessary.
In a system level study [138] , Tannu et. al. identified the requirement of large bandwidth in sending control instructions from the control processor to qubits and proposed sharing of micro-code between neighboring qubits. Using micro-code to deliver control pulses [139] only focuses on communication from the control processor to the qubits whereas, availability of large bandwidth is also essential to transmit syndrome data back from the qubits to the decoders. Since syndromes differ across logical qubits, depending on the error, it is not possible to send one syndrome for multiple qubits (sharing). So, we explore the possibility of using syndrome compression through low-overhead compression schemes.
VII. CONCLUSION
The decoder is a key component of a fault tolerant quantum computer, which is in charge of translating the output of the syndrome measurement circuits into error types and locations. Many decoding algorithms have been studied in the past 20 years [14, . They generally focus on improving either the decoder accuracy or the decoder runtime, or providing a good compromise between them. In this work, we introduce a third constraint, the scalability constraint, which states that the decoder must scale to the regime of practical applications for which thousand of logical qubits must be decoded simultaneously and we design a decoder that satisfy the three design constraints: accuracy, latency and scalability. Namely, the decoder properly identifies the error which occurs (accuracy), it is fast enough to avoid accumulation of errors during the computation (latency) and we propose an resource-efficient hardware implementation that scales to the massive size required for industrial applications (scalability).
In order to achieve the scalability constraint, we study the scope and impact of micro-architectural optimizations in designing decoders for QEC and study in-depth the Union-Find decoder as a case-study. We also investigate a system level framework for Error Decoding Architecture whereby instead of using dedicated decod-ing units per logical unit, we multiplex the decoding resources across neighbouring logical qubits while minimizing the timeout errors due to lack of decoding resources and limiting the possibility of system failure. Finally, we investigate the feasibility of low-cost syndrome compression to reduce the memory bandwidth required for transmitting the syndrome information from the quantum substrate to the decoding hardware. Our solutions reduce the number of decoders by more than 50%, the amount of memory required by 70%, and memory bandwidth by more than 30x for large FTQCs. Although we use the Union-Find decoder and surface codes for our study, the design principles, optimizations, and results from our study applies to other types of decoders and certain other QEC codes as well. The compression schemes discussed applies to any qubit technology and decoder.
In addition to substantial hardware savings, our optimized decoder micro-architecture significantly speeds up the Union-Find decoder. Our numerical simulations suggest that our design provides a decoder that is fast enough to perform error correction with superconducting qubits assuming a surface code syndrome round of 1µs [88] . Further study is necessary to confirm the validity of our model in a real device. Ultimately, we would like to consider a FPGA implementation or the fabrication of an ASIC based on our micro-architecture.
