Extensive quantum error correction is necessary in order to scale quantum hardware to the regime of practical applications. As a result, a significant amount of decoding hardware is necessary to process the colossal amount of data required to constantly detect and correct errors occurring over the millions of physical qubits driving the computation. The implementation of a recent highly optimized version of Shor's algorithm to factor a 2,048-bits integer would require more 7 TBit/s of bandwidth for the sole purpose of quantum error correction and up to 20,000 decoding units.
Hundreds or thousands of high-quality qubits with an error rate of 10 −10 or lower are necessary to implement quantum algorithms with industrial applications. In order to reach such high quality based on current quantum technology, logical qubits must be built from a large number of physical qubits and errors accumulating during the computation must be corrected at regular intervals.
The family of surface codes [1] [2] [3] [4] is the most promising quantum error-correcting scheme to deal with current noise levels that barely reach 0.1%. Using a distance d surface code, a logical qubit is encoded into a square grid of d × d data qubits as Fig. 2 shows. Error correction is based on the measurement of r = d 2 − 1 syndrome bits, extracted using syndrome measurement circuits implemented on the plaquettes of the qubit grid as shown in Fig. 2 . The syndrome data is collected by the readout device and is sent to the decoding unit, which uses this information to detect and correct errors. In order to avoid accumulation of errors during the computation, the syndrome is constantly measured, producing r syndrome bits for each syndrome measurement round. In the present work, we consider a syndrome measurement time of 1µs, which is the time to implement the four rounds of CNOT gates and the final ancilla measurements of the syndrome measurement circuit. Consider as an example the recent RSA factorization algorithm of [5] , which relies on distance-27 surface codes, encoding K ≈ 10, 000 logical qubits (ignoring distillation qubits). The implementation of this algorithm requires a bandwidth of 7.3 TBit/s and two decoding units per logical qubit, that is 20,000 decoders, assuming independent correction of X-type and Z-type Pauli errors. It seems quite 1 Figure 1 : Left: Standard design with a readout device sending all syndrome data to the decoding unit. Two decoding units are used for each logical qubit, one for each type of error. Right: The readout device is equipped with a lazy decoder unit capable of correcting a large number of easy fault configurations, avoiding to transmit syndrome data to the decoding unit. We show the reduction in term bandwidth per logical qubit and number of decoding units obtained by introducing the lazy decoder in the case of physical error rate p = 10 −5 with a logical error rate pL = 10 −15 . In this case, including lazy decoding units, saves 99.9% of the bandwidth requirement and 99.9% of the complex decoding units.
challenging to include such a formidable amount of decoding hardware to ensure fault tolerance in a quantum computer.
In this work, we propose an alternative to the naive design that allocates one decoding unit for each decoding task by introducing a simple hard decision decoder, that we refer to as the lazy decoder. Fig. 1 illustrates our design and the saving obtained for a specific set of parameters. The lazy decoder can be seen as a pre-decoder which only attempt to correct easy error configurations.
arXiv:2001.11427v1 [quant-ph] 30 Jan 2020
If no obvious correction exists, it quits and returns a failure mode. In that case, syndrome data is sent to a decoding unit that hosts a more sophisticated decoder achieving a good performance. Many complex decoding algorithms can play this role [1, The lazy decoder is designed to be as simple as possible. It consists of a single loop over syndrome bits, which makes it an ideal candidate for a low-level hardware implementation with FPGA or CMOS and it is also easy to parallelize. We picture the lazy decoder as a hardware unit, as close as possible to the readout device. In the worst case, we need two lazy decoder per logical qubit but given the speed of this module, one can expect sharing this unit between many of qubits. We assume that the measurement device (and therefore the lazy decoder) are placed in the proximity of the qubits in order to avoid long feedback loops increasing the physical qubit clock cycle.
The decoding unit, which is significantly more complex than the lazy decoder, may be challenging to implement close to the qubits without introducing additional noise to the quantum plane. Therefore, we consider a decoding unit placed at further distance from the qubits, which leads to latency issues, justifying our focus on the bandwidth of the readout-decoding unit link. We ignore the bandwidth between the readout-device and the adjacent lazy decoder.
Introducing the lazy decoder reduces the bandwidth required to send syndrome data from the readout device to the decoding unit because in most cases the nearby lazy decoder takes care of the correction. Moreover, the number of decoding units required is significantly reduced if one can rely on the lazy decoder a large fraction of the time. In what follows, we prove that this design leads to a reduction of the decoding hardware of several orders of magnitudes for good enough qubits. error rates below the standard assumption of 10 −3 .
I. THE SURFACE CODE
The surface code encodes a logical qubit into a square grid of d × d data qubits where d is the minimum distance of the code, as show in Fig. 2 . Error correction relies on the syndrome measurement circuits represented in Fig. 2 , consuming an additional d 2 − 1 qubits.
All plaquettes are measured simultaneously at regular intervals, producing rounds of syndrome data for the decoder which identifies errors based on this information. Fig. 2 shows the schedule used for the sequence of CNOT gates in order to allow for a parallel implementation and to preserve the code distance despite the propagation of errors by CNOT gates.
We simulate the surface code and the syndrome extraction circuit with a circuit level noise that represents imperfections on all qubits, gates, measurements, and waiting steps, by injecting random Pauli faults between any two steps of the circuit. A single qubit fault is included after each state preparation or waiting qubit with probability p. The Pauli fault is chosen uniformly in the set {X, Y, Z}. The outcome of a single qubit measurement is flipped with probability 2p/3. The CNOT noise is modeled by a two-qubit Pauli fault injected after the CNOT with probability p. The fault is selected uniformly between the 15 non-trivial two-qubit Pauli operators acting on the support of the CNOT gate.
Equipped with qubits and quantum gates affected by a circuit level noise with rate p, the surface code encoding provides a logical qubit whose error rate drops to [2] 
where d is the surface code minimum distance. One can use this heuristic formula to estimate the minimum distance d required in order to achieve a given target logical error rate. Most practical applications necessitate a logical error rate that varies between 10 −10 and 10 −15 , which is out of reach on current hardware without error correction.
After producing an encoded surface code state, we start measuring syndrome data. Non-trivial syndrome values indicate the presence of a fault. For simplicity, we focus on Z-type faults, detected by the measurement of green plaquettes as in Fig 3. X-type faults can be treated similarly. In what follows, we describe a graph that represents all possible faults in the syndrome measurement circuit. The whole simulation of the syndrome extraction circuit can be implemented based on this graph.
Consider the space-time locations (x, y, t) of syndrome bits, where (x, y) is the coordinates of the center of a plaquette and t = 0, 1, 2, . . . is the index of the syndrome round. Let s(x, y, t) = 0 or 1 be the syndrome value in location (x, y, t). We record the changes of syndrome values, that iss(x, y, t) = s(x, y, t)−s(x, y, t−1) (mod 2), and settings(x, y, 0) = 0 for the first round. A fault in the measurement circuit is detected by a set of syndrome locations (x, y, t) where the syndrome value changes, i.e. s(x, y, t) = 0 A non-trivial fault is detected either in one location u or in a pair of locations u, v, leading a natural graph structure.
The decoding graph represents all possible faults in the syndrome extraction circuit. The vertex set of the decoding graph is the set of syndrome locations. A half-edge {u, −} or an edge {u, v} is built from each potential fault in the syndrome extraction circuit. The decoding graph is a 3D cubic lattice with additional diagonal edges. Fig 3 shows a horizontal slice of the decoding graph. For each edge, we also store the probability of all circuit faults that map onto that edge. This edge weight can be processed by the decoder to provide a more accurate correction.
A surface code decoder takes as an input a set of consecutive rounds of syndrome data given bys and it aims at identifying the residual error on the d 2 data qubits. A standard decoding strategy consists in identifying a minimum set of edges that matches the observed syndromē s.
II. LAZY DECODER
To simplify, we correct separately X-type and Z-type faults. Our objective is not to design a good decoder but to identify a set of fault configurations that is both very likely and easy to correct. The lazy decoder will correct exclusively this subset of easy configurations.
Let us first describe the most naive version of the lazy decoder. We simply check whether the syndrome is trivial and if so we send no data to the decoding unit. Clearly, it is easy to implement with low hardware costs. However, this does not help to reduce decoding hardware requirements since the probability to observed a trivial syndrome is generally too small. One could design a decoder that corrects any single fault or any fault of weight two, three and so on, but our numerical ex-periments show that either these sets of faults are still not likely enough for reasonable distances or the lazy decoder design becomes just as complex as the design of the whole decoding unit, defeating the whole purpose of the lazy decoder.
Algorithm 1 proposes a satisfying version of the lazy decoder for our architecture. We will prove that, when it succeeds, it returns a minimum set of faults explaining the syndrome observed. Our basic idea is to correct all configurations that can be corrected locally. This also guarantees a high potential for parallelism. If faults are sufficiently separated from each-other in space and time, we can obtain such a globally minimal solution from obvious locally optimal decisions.
The syndrome is given as an input of Algorithm 1 as a set of verticess ⊂ V in the decoding graph induced by Z-type faults and the algorithm returns either an estimation E ⊂ E of the set of edges supporting faults, or a failure mode. The first block of Algorithm 1 looks for edges that match two neighbors syndrome bits u, v ∈ s. Such an edge is locally optimal (it is the minimum number of faults explaining two non-trivial syndrome nodes) and can be safely added to the correction E. The second block of Algorithm 1 takes care of remaining unmatched syndrome vertices, that come from faults on half-edges. The half-edge {u, −} is a locally optimal choice to explain the non-trivial syndromes(u) = 1 only if u has no neighbor v supporting a non-trivial syndrome value. Otherwise, the choice of {u, −} is said to be ambiguous. We use the notation N v for the set of neighbors of a vertex v. In order to guarantee a globally optimal solution, we count the number of ambiguous choices N amb and we return failure if at least two ambiguous half-edges are present in E. Theorem 1 proves the optimality of our strategy.
Algorithm 1 Lazy decoder
Require: A syndrome sets ⊂ V . Ensure: Either a fault set E ⊂ E such thats(E) =s or failure. 1: Set s = s, E = ∅ and N amb = 0. 2: Run over all edges e = {u, v} ∈ E and do:
3:
If u ∈s and v ∈s :
4:
Add e to E and remove u and v froms . If u ∈s do:
7:
Add e to E and remove u froms .
8:
If Nv ∩s = ∅, increment N amb
9:
If N amb > 1 return failure. Proof. Ifs(E) =s, the set E contains a set of paths that connects vertices ofs either by pairs or to the boundary. Naively, we have |E| ≥ |s|/2, with equality if and only if E pairs each vertex ofs with one of its neighbors.
Let ∂s be the set of vertices v ofs indicent to a halfedge {v, −}. Consider the subset ∂s * ⊂ ∂s of vertices v that have no neighbor ins (that is N v ∩s = ∅). For an arbitrary fault set E ⊂ E, any vertex of ∂s * is either part of a half-edge or it is connected to a vertex at distance ≥ 2, leading to the bound |E| ≥ (|s| − |∂s * |)/2 + |∂s * |·
This equation is satisfied for all fault sets E with syndromes.
Consider now the fault set E produced by Algorithm 1 in case of success. The first block finds edges that match bulk vertices v ∈s\∂s to a neighbor. Vertices of ∂s * are linked to a boundary by a half-edge in E. Finally, the vertices v ∈ ∂s\∂s * are all matched to a neighboring vertex except at most one (because in case of success we have N amb ≤ 1).
This proves that the set E, returned by Algorithm 1 in case of success, satisfies |E| ≤ (|s| − |∂s * |)/2 + |∂s * | + 1/2·
Together with the lower bound (2), this demonstrates that the size of E is minimum.
One can perform the lazy decoding on the fly while reading the syndrome rounds. It is enough to store three consecutive rounds of syndrome values to apply the lazy decoder. When a failure of the lazy decoder is detected, we start sending syndrome information to the decoding unit which accumulates d rounds of data to provide a correction. This leads to an asynchronous decoding between different logical blocks, that can be advantageous to share decoding hardware between logical qubits but that could induce stalling in the layout of logical operations. We do not explore the consequences of this asynchronous decoding in the current work. The locality of Algorithm 1 suggests an easy parallel implementation. Only the value N amb is a global data.
III. BANDWIDTH REDUCTION
Without the lazy decoder, the bandwidth used per logical qubit is bw(d) = (d 2 − 1)/τ bits, where d is the code distance and τ is the time required per syndrome extraction round in seconds. All the numerical results of this article are obtained assuming τ = 1µs.
The readout-decoding unit bandwidth used for a logical qubit drops to zero while the lazy decoder succeeds. This induces a significant reduction of the average bandwidth used per logical qubit, as we can see in Fig. 4 . With physical error rate p = 10 −4 , the average bandwidth saving varies between 1 order of magnitude for the distance-35 surface code to more than 3 orders of magnitude for distance d = 5.
We observed a phenomenon of bandwidth saturation which occurs when using a large-distance code with a qubit quality that is not far enough below the threshold, e.g. d ≥ 15 with p = 10 −3 . In this regime, the lazy decoder almost constantly fails and we do not observe any reduction of the bandwidth. Then, it may be preferable to remove the lazy decoder to avoid hurting the decoder's performance by additional latency. This suggests that it is necessary to keep improving qubit quality far below the surface code threshold in order to scale up quantum hardware and its classical control to reach the regime of practical applications.
IV. BANDWIDTH REQUIREMENTS
We observed a neat reduction of the average bandwidth use using the lazy decoder. However, the bandwidth utilization varies with time and the system often requires much more bandwidth than the average use. The required bandwidth and the number of decoding unit needed depends on the maximum number of failures of the lazy decoder over the K logical qubits of the quantum computer.
To simplify, we consider a single communication channel connecting the readout devices of all logical qubits to the decoding units. A bandwidth failure occurs if at given point in time the bandwidth needs for the whole system surpass the bandwidth of the readout-decoder channel.
The bandwidth required is defined to be the minimum bandwidth such that the probability of bandwidth failure is smaller than p L , which guarantees that the bandwidth bottleneck is not the dominant source of system failure (as suggested in [50] ). To obtain the bandwidth required, consider the failure probability p fail = p fail (p, d) for the lazy decoder over d consecutive rounds of syndrome measurement for a single logical qubit. We assume that the noise on different logical qubits is independent, so that the probability of at least m failures of the lazy decoder over the K logical qubits is given by 2K m p m fail . The bandwidth required for the whole system of K logical qubits is given by
where M = M (p, d, K) is the smallest integer such that
which ensures that a bandwidth failure occurs with probability at most p L . It may be challenging to evaluate numerically M (p, d, K) based in Eq. (5) for large values of K. The numerical results presented below rely on Chernoff bound to derive an upper bound on M (p, d, K). Fig. 5 shows the bandwidth required to reach a target logical error rate p target = 10 −15 . Given p target and the physical error rate p of the device, we first pick the smallest minimum distance d that ensures p L (p, d) < p target using Eq. 1. The minimum distance varies discretely with p, inducing brutal jumps in the system requirements. Once the distance is fixed, we estimate the lazy decoder failure probability p fail (p, d) by a Monte-Carlo simulation, from which we derive the value of bw lazy (p, d, K) based on Eq. (4). A better distribution of resources is achieved for a system that contains many logical qubits, dropping the bandwidth required per logical qubits closer to the average use. The bandwidth saturation appears again in the regime p = 10 −3 where we require almost 1GBits/s per logical qubit. For error rates p ≥ 6.10 −4 , we observe no saving for bandwidth requirements.
V. DECODING HARDWARE REQUIREMENTS
In addition to a substantial bandwidth reduction, the lazy decoder induces savings in the decoding hardware. Indeed, the value M (p, d, K) introduced in Eq. (4) is the largest number of decoding tasks to perform simultaneously over the whole system of K logical qubits. Instead of allocating one decoding unit for each logical qubit, one can share M (p, d, K) decoding units without notably affecting the failure rate of the quantum computer. Fig. 5 , shows the saving in term of number of decoding units. In order to reach a target error rate of 10 −15 with a system of K = 10, 000 logical qubits with physical error rate p = 10 −4 (resp. 10 −5 ) a naive design uses 2K = 20, 000 decoding units while only 377 units (resp. 13) are sufficient with the lazy decoder, saving 98% (resp. 99.9%) of the decoding hardware. Table I shows the saving and the hardware requirements for different target noise rate, qubit quality and system size. Again, the saturation in the regime p = 10 −3 limits the saving. We need better qubits in order to scale up quantum computers to the massive size required for practical applications.
VI. THE LAZY DECODER AS A DECODER ACCELERATOR
The lazy decoder can be considered as a (hardware or software) decoder accelerator. It speeds up any decoding algorithm without significantly degrading the correction capability. Fig. 6 shows the average execution time for our implementation in C of two standard decoding algorithms with and without lazy pre-decoding. The speedup reaches a factor 10x for the Union-Find (UF) decoder [25] , which is already one of the fastest decoding algorithms and we obtain a 50x acceleration of the Minimum Weight Perfect Matching (MWPM) decoder [1] . Note that both combinations Lazy + UF and Lazy + MWPM achieve a similar average runtime, although the worstcase execution time, which is a central parameter in the design of a decoding architecture [50] , is substantially larger for the MWPM.
We also confirmed numerically that the lazy decoder does not deteriorate the performance of the MWPM decoder and the UF decoder as Theorem 1 suggests. On the contrary, the lazy decoder provides a slight improvement of the correction capacity of the UF decoder. This is because these two algorithms perform well on different types of fault configurations. The work of Seth et al. [51] explores further the idea of combining different decoding strategies.
Conclusion -Error correction is a major bottleneck in fault-tolerant quantum computing which leads to a huge Figure 6 : Execution time in seconds of the MWPM decoder and the UF decoder with and without lazy decoder. The runtime is estimated over 10 6 trials, over the 2D toric code, assuming perfect measurements and an error rate of 10 −3 . We use an implementation in C of these algorithms executed on a MacBook Pro 2013 with a single thread processor 2,4 GHz Intel Core i5. We observe a 10x speed-up of the UF decoder and 50x for the MWPM decoder.
overhead in the implementation of quantum algorithms [5, 64, 65] . In this article, we designed a simple decoder that can be used in combination with a more complex decoding unit to correct errors simultaneously on many logical qubits with a minimum decoding hardware.
In future work, we plan to explore the impact of serialization latency on the decoder's performance.
Although this work focuses on surface codes, the general principle of this design can be adapted to any quantum error correction code if we can identify a set of easy error configuration that is likely enough.
The lazy decoder applies to any type of surface code, including codes defined on non-trivial topology [52, 53] . The lazy decoder can be directly applied to color codes [54] using for instance the projection decoder [55, 56] .
Beyond topological codes, one can adapt the lazy decoder to quantum LDPC codes [57] [58] [59] [60] [61] [62] . The sparsity of the Tanner graph guarantees the success of our local strategy for low enough physical error rate.
The basic idea of using a pre-decoder dedicated to the correction of simple configurations is also central in the design of a flash memory controller where a hard-decision belief propagation (BP) decoder is used as a pre-decoder and, in case of failure, multiple levels of soft-decision BP are performed [63] . However, the noise rate of flash cells is far more favorable than in quantum hardware, allowing for using a single decoding unit to correct many encoded blocks in flash memory. Note that the execution time current flash BP decoders are far too long for the quantum setting if we suppose that the decoding must be implemented in dµs. (80µs for hard decision decoder + 80µs per level of soft-BP) [63] .
The BP decoder provides a hierarchy of decoding al-gorithms with growing complexity as a function of the number of propagation levels. This flexibility allows for adjusting the number of levels in order to maximize the success probability of the decoder according the decoding available time. The Union-Find decoder [25] offers the same advantage by tuning the number of growth rounds. In the future, it would be interesting to explore further the hardware implementation of the Lazy decoder following the approach of [50] and to fabricate an FPGA or ASIC prototype in order to obtain a better insight on practical applications of the lazy decoder.
