Magic state distillation is one of the leading candidates for implementing universal fault-tolerant logical gates. However, the distillation circuits themselves are not fault-tolerant, so there is additional cost to first implement encoded Clifford gates with negligible error. In this paper we present a scheme to fault-tolerantly and directly prepare magic states using flag qubits. One of these schemes uses a single extra ancilla, even with noisy Clifford gates. We compare the physical qubit and gate cost of this scheme to the magic state distillation protocol of Meier, Eastin, and Knill, which is efficient and uses a small stabilizer circuit. In some regimes, we show that the overhead can be improved by several orders of magnitude.
I. INTRODUCTION
Certain algorithms can be implemented efficiently on quantum computers, whereas the best known classical algorithms require superpolynomial resources [1, 2] . At present, however, quantum devices are dramatically noisier then their classical counterparts. For all but the shortest depth quantum computations to succeed with high probability, operations will need to be performed with very low failure rates. Fault-tolerant quantum error correction provides one way to achieve the desired logical failure rates.
A straightforward way to implement fault-tolerant logical gates is to use transversal operations. However by the Eastin-Knill theorem, for any error correcting code, there will always be at least one gate in a universal gate set which cannot be implemented fault-tolerantly using transversal operations [3] . In recent years, many proposals for fault-tolerant universal gate constructions have been introduced [4] [5] [6] [7] [8] [9] . One of the earliest proposals, known as magic state distillation, uses resource states that, along with stabilizer operations, can simulate a non-Clifford operation [10, 11] . These resource states are called magic states since they can also be distilled using stabilizer operations. Despite considerable effort in alternative schemes, magic state distillation remains a leading candidate for universal fault-tolerant quantum computation.
Recently, very efficient distillation protocols have been developed which require substantially fewer resource states to achieve a desired target error rate of the magic state being distilled [12] [13] [14] [15] . When studying magic state distillation protocols, it is often assumed that all Clifford operations can be implemented perfectly and that only the resource states can introduce errors into the protocol. However, with current quantum devices, two qubit gates are amongst the noisiest components of the circuits. In practice, Clifford gates with very low failure rates can be achieved by performing encoded versions of the gates in large enough error correcting codes, such that the failure rates of the Clifford operations are negligible compared to the non-Clifford components of the circuit. However the overhead cost associated with performing magic state distillation with encoded Clifford operations is often not considered (there are exceptions such as [16] [17] [18] [19] ). If the overhead cost of encoded Clifford operations is taken into account, it is not clear that magic state distillation schemes which minimize the resource state cost would achieve the lowest overhead when used in a quantum algorithm.
Recently, new schemes using flag qubits have been introduced to implement error correction protocols using the minimal number of ancilla qubits to measure the codes stabilizer generators [20] [21] [22] [23] [24] [25] . The idea behind flag error correction is to use extra ancilla qubits which flag when v ≤ t faults result in a data qubit error of weight greater then v (here t = (d − 1)/2 where d is the distance of the code). The flag information can then be used in addition to the error syndrome to deduce the data qubit error.
In this paper, we propose a new scheme to faulttolerantly prepare magic states which requires a minimal amount of extra ancilla qubits (in our case only one extra qubit). In particular, we consider a full circuit level noise model (which includes noisy Clifford operations) where gates can be applied between any pair of qubits. In this model, we compare the overhead cost of our scheme to a magic state distillation scheme introduced by Meier, Eastin and Knill (MEK), due to its efficiency and small circuit size [26] . Using the same encoded Clifford operations in both schemes, we show that in some regimes the qubit and gate overhead cost of our scheme can be smaller by several orders of magnitude.
We point out that a fault-tolerant magic state preparation scheme was previously considered by Aliferis, Gottesman and Preskill (AGP) in [27] . However, the scheme proposed by AGP requires the preparation and arXiv:1811.00566v1 [quant-ph] 1 Nov 2018
verification of large cat state ancillas to perform the logical measurements. Furthermore, Steane error correction, which requires a large number of extra ancilla qubits, was used for the error correction circuits . Our scheme does not require the preparation of large ancilla states, uses smaller error correction circuits, and has an improved threshold compared to the scheme proposed by AGP.
The paper is structured as follows. In Section II we introduce the basic notation and noise model that we used throughout the remainder of the manuscript. In Section III we describe our scheme for fault-tolerantly preparing magic states. We provide both an error detecting and and correcting scheme. Proofs of fault-tolerance are given in Appendix A. In Section IV we briefly review the MEK magic state distillation scheme and in Section V we compare the qubit and gate overhead costs of both schemes. Details of the overhead and numerical analysis are provided in Appendices C to E. and applying this gate if the measurement outcome is −1.
We define P (1) n to be the n-qubit Pauli group (i.e. nfold tensor products of the identity I and the three Pauli matrices X, Y, and Z and all scalar multiples of them by ±1 and ±i). The Clifford group is defined by P (2) n = {U : ∀P ∈ P (1) n , U P U † ∈ P (1) n } and is generated by the single qubit Hadamard and
as well as the two-qubit CNOT gate (which we write as C X ) with
where |a and |b are computational basis states. In general, a controlled-U gate is
The state
is the +1 eigenstate of the Hadamard operator and
does not belong to the Clifford group 1 . We also write
as the −1 eigenstate of the Hadamard operator. The |H state is an example of a magic state. Magic states can be distilled to produce reliable states from a larger number of noisy copies using only stabilizer operations [11] . The reliable magic states can be used, together with stabilizer operations, as resource states for universal quantum computation. In particular, as shown in Fig. 1 , the |H state, along with Y π 2 , C Y , and a measurement in the Y -basis, can be used to simulate a T gate. Note that some schemes choose to distill the state
which is equivalent to the |H state up to products of Clifford gates. For instance, see the |A π 4 distillation scheme in [11] .
Throughout this paper, we will use the following circuit-level noise model in all simulations:
1. With probability p, each single-qubit gate location is followed by a Pauli error drawn uniformly and independently from {X, Y, Z}.
2. With probability p, each two-qubit gate is followed by a two-qubit Pauli error drawn uniformly and independently from {I, X, Y, Z} ⊗2 \ {I ⊗ I}.
3. With probability 2p 3 , the preparation of the |0 state is replaced by |1 = X|0 . Similarly, with probability 2p 3 , the preparation of the |+ state is replaced by |− = Z|+ . 4. With probability p, the preparation of the |H state is replaced by P |H where P is a Pauli error drawn uniformly and independently from {X, Y, Z}.
5. With probability 2p 3 , any single qubit measurement has its outcome flipped.
1 Note that some references define the T gate as T = e iπ/4 √ 2
which is a Clifford gate, while others define T = diag(1, e iπ/4 ) which is Clifford-equivalent to the gate we have defined as T . We follow the definitions in [26] .
6. Lastly, with probability p/100, each idle gate location is followed by a Pauli error drawn uniformly and independently from {X, Y, Z}.
Our assumption of p/100 idle error is valid for systems whose gate errors are far from the limits set by coherence times. For example, trapped ion experiments have coherence times T 2 = 2T 1 on the order of 1 − 10 seconds with gate times on the order of 10µs. This suggests idle error probabilities of 10 −5 to 10 −6 , while two qubit gate infidelities are on the order of 10 −3 to 10 −4 [28, 29] . On the other hand, the assumption may not hold in systems such as superconducting qubits, whose gates currently achieve infidelities near the coherence limit [30] . Regardless, the concrete schemes we analyze all use the same underlying quantum code family and Clifford gate implementations, so we expect comparisons between them to be robust.
In the following sections, all Clifford gates and resource states will be encoded in the Steane code (see Section III) which, paired with flag qubits, will allow us to obtain encoded magic states with low overhead. The code distance will be increased through code concatenation. Since the encoded version of the gates and states are implemented in a fault-tolerant way, the failure probability for each logical fault E of a gate G at a physical error rate p can be upper bounded as
where c(k) denotes the number of weight-k errors which can lead to a logical fault and L G is the total number of circuit locations in the logical gate G. At the first concatenation level, we performed Monte-Carlo simulations with the above noise model to estimate the coefficients c(k) for each logical gate and encoded states. As was shown in [31] [32] [33] , Eq. (8) can be generalized to the level-l concatenation level where each physical gate is replaced by its level-(l − 1) Rec 2 (see [27] for more details). The error rate at the l-th concatenation level can be bounded as
In other words, to estimate the logical failure rate of a level-l gate, the probability of failure for all physical gates G in our level-l simulation will be replaced 2 Rectangles (Rec's) are encoded gates with trailing error correction units. Extended rectangles, which are encoded gates with leading and trailing error correction units, are abbreviated as exRec's. The CH gate can be decomposed into the gates T , CZ and T † . As was shown in Fig. 1 , the T gates can be simulated using only Clifford gates and an |H resource state.
by Pr[mal
For the physical error rates we consider, higher order terms are found to be relatively small, so we only estimate the leading order terms. More details can be found in [31] [32] [33] .
III. PREPARING MAGIC STATES IN THE STEANE CODE
One way to measure the Hadamard operator using only Clifford gates and an |H resource state is shown in Fig. 2 . To obtain resource states with high-fidelity (which is required for universal quantum computation), one method is to encode them into an error detecting or error correcting code and perform several rounds of state-distillation. If the physical error rate is below some threshold (which depends on the codes and distillation routine), the error rate can be exponentially suppressed with the number of distillation rounds. Another, more direct method, is to fault-tolerantly prepare resource states in a large distance error correcting code. The distance is chosen to obtain the desired logical error suppression. In particular, in the presence of noisy Clifford gates, distillation routines require encoded Clifford operations which can substantially increase the qubit and gate overhead of the routine. We now show how to fault-tolerantly prepare an |H state using the Steane code and flag-qubits in order to achieve a lower gate and qubit overhead. We will compare the overhead requirements of our methods with that of the Meier-Eastin-Knill (MEK) distillation routine [26] . A review of the MEK distillation routine is given in Section IV.
The Steane code is a [ [7, 1, 3] ] CSS (Calderbank-ShorSteane) code [34, 35] with stabilizer generators and logical operators given in of an encoded |A π 4 (see Eq. (7)) state has been previously analyzed in [27] . However, Steane error correction (not to confuse with the Steane code) was used for the EC units and cat states were fault-tolerantly prepared in order to measureTXT † . The high cost for preparing cat states and implementing Steane EC's results in a large qubit overhead.
Instead of using Steane-EC circuits, in this work we consider an EC circuit recently introduced by Reichardt [25] for measuring the stabilizers of the Steane code with low circuit-depth and which requires only three ancilla [7, 1, 3] ] Steane code in an error detection scheme. If a single fault results in a data qubit error of weight greater than one, the circuit will flag (the |0 state will be measured as −1) and the |H state preparation scheme will begin anew. (b) Full fault-tolerant error detection scheme for preparing an encoded |H state. The EC circuit is given in Fig. 3 and the non-fault-tolerant circuit for preparing the encoded |H state is given in Fig. 4. qubits (see Fig. 3 ). The small qubit overhead is achieved with the use of flag qubits which can detect events where errors from a single fault spread to an X or Z error of weight greater than one on the data. The fault-tolerant properties of the circuit are discussed in Appendix A 1.
Flag-qubits can also be used to fault-tolerantly measure the logical Hadamard operator with only one extra ancilla qubit (and thus do not require the fault-tolerant preparation of cat states). In an error detection scheme, where the |H state is rejected if errors are detected, the circuit in Fig. 5a can be used to measure the logi- [7, 1, 3] ] Steane code. The extra flag qubits are used to localize errors occurring near the fourth controlled-Hadamard gate since these cannot be distinguished from errors occurring at the other controlled-Hadamard gates. Note that in order to distinguish errors from faults at other locations, the full six bit error syndrome must be considered. That is, we cannot correct X and Z errors separately as is usually done in CSS constructions. (b) Full fault-tolerant error correction scheme for preparing an encoded |H state. The EC circuit is given in Fig. 3 and the non-fault-tolerant circuit for preparing the encoded |H state is given in Fig. 4 . If a flag occurs in the third H cal Hadamard operator. If a fault results in an X or Z data qubit error of weight greater than one, the flag qubit (ancilla prepared in the |0 state) will be measured as −1 and the |H state will be rejected. The full error detecting scheme is shown in Fig. 5b . Note that our scheme uses only 10-qubits and is shown to be fault-tolerant in Appendix A 2.
Error detection schemes have extra overhead arising from starting the process anew when a state is rejected (although they achieve much higher pseudo-thresholds). Alternatively, in Fig. 6 we show how the |H state can be fault-tolerantly prepared in an error correction scheme using three flag qubits. The details of the implementation and a proof of fault-tolerance is given in Appendix A 3. Due to the low pseudo-threshold of the error correction scheme, we will focus on the error detection scheme of The main motivation for concatenating the Steane code in order to increase the code distance is that finding flag circuits to measure high weight operators while maintaining error distinguishability is quite challenging [23] . However if a flag circuit satisfying the desired faulttolerance criteria is found for a small code, the same circuit can be used at level-l where each gate is a level-(l−1) gate. Additionally, if the data used in a quantum computation is encoded in the same code used to prepare the |H state, the encoded |H state does not need to be decoded and can be used to directly apply a logical T gate.
Lastly, we point out that following the 2-flag circuit construction of [23] , it is possible to obtain a faulttolerant circuit to measure the logical Hadamard operator of the [ [17, 1, 5] ] color code. The circuit is given in Fig. 7 and can be used in an error detection scheme analogous to that of Fig. 5 . EC circuits for measuring the stabilizers of the [ [17, 1, 5] ] color code were obtained in [23] and require at most four ancilla qubits. One could also measure the stabilizers using standard topological methods at the cost of requiring O(n) ancillas. This circuit could be useful if a higher distance is required prior to concatenation or if a hybrid state-preparation and magic state distillation scheme is used (see Section V for more details).
IV. MEIER-EASTIN-KNILL DISTILLATION CIRCUITS
In the MEK distillation protocol, |H states are encoded in the [ [4, 2, 2] ] error detecting code, whose stabilizer generators and logical operators are given in Table I The circuit is the same as given in [26] but has been written in detail to illustrate the idle qubit locations (filled rectangles). We reuse the |H ancilla qubit instead of doing gates in parallel in order to minimize the overhead cost. A total of four CH gates are required to measure the operator H1H2. Hadamard gate to one of the two encoded qubits). The routine accepts the pair of |H states if both the H ⊗4 and syndrome measurements are trivial. More details can be found in [26] .
Suppose that the desired output failure probability of FIG. 9. Qubit and gate overhead for our fault-tolerant magic state preparation scheme using flag qubits. We considering target logical failure rates ranging between 10 −6 to 10 −9 . The different jumps in the curves indicate that an additional concatenation level is required in order to achieve the desired logical failure rate. Note that the y axis of the qubit overhead plot is not displayed on a log scale. This is to show the increase in overhead cost due to the probability of rejecting a state when implementing the error detection scheme.
FIG. 10. Qubit and gate overhead for the MEK protocol applied to level-2 and level-3 |H states. In order to achieve the desired logical failure rate, we teleport a level-2 |H state into a level-3 |H state and apply one round of level-3 MEK. In a level-l simulation, all gates in the MEK circuit are encoded in ta level-l. the input resource states is O(p 2 l ). Note that a single fault in some of the two-qubit gates in Fig. 8 can result in multi-qubit errors on the data that go undetected 3 . Since all of the gates in Fig. 8 fail following the noise model described in Section II, they must be encoded in another code (which we choose to be the level-l concatenated Steane code) in order to achieve the desired logical error rate in the output resource states. To provide a fair comparison between the overhead costs of our state preparation scheme with flag qubits and the MEK scheme, all exRec's contain the EC unit of Fig. 3 . These features will play an important role when evaluating the qubit and gate overhead of the MEK scheme.
Many magic state distillation protocols use a randomization called twirling [11] to diagonalize each magic state in a convenient basis. This greatly simplifies the analysis so that the acceptance probabilities and logical error rates can be found in closed form. However, twirling is not necessary in a physical implementation, since there is always another protocol without the randomization that uses the optimal choice of gates [36] . In our case, we observe that the logical error rates increase if we twirl, so we simply use the input states as they occur.
V. RESOURCE OVERHEAD COMPARISON
In this section we compare the overhead cost of our fault-tolerant magic state preparation scheme for preparing an encoded |H state in the concatenated Steane code to the overhead cost of the MEK scheme for distilling |H states (also encoded in the concatenated Steane code). In particular, we will compare the qubit and gate overhead cost of both schemes. In what follows, a level-l state or gate will always correspond to its encoded version in l concatenation levels of the Steane code.
For the MEK magic state distillation protocol, we consider two different approaches. In the first approach, physical |H states are teleported into level-2 |H states and a subsequent round of MEK (with level-2 Clifford gates) is applied to produce a state with a logical error rate well-approximated by the form ap 2 + bp 4 below the level-2 pseudo-threshold. The O(p 2 ) term is dominant at low error rates, and the O(p 4 ) term is dominant at error rates near the pseudo-threshold. To obtain further error suppression, we then teleport the distilled state into a level-3 state and perform a level-3 round of MEK to obtain a state with a logical error rate of the form a p 4 + b p 8 . An additional round of MEK could be performed to obtain an |H state with logical error rate O(p 8 ). However, for the physical error rates considered in this work (p ∈ [10 −5 , 10 −4 ]), we found that there is no advantage in doing so.
In the second approach, we consider a hybrid scheme where a level-2 |H state is first prepared using our faulttolerant flag preparation scheme. The state is then teleported into a level-3 state and a round of MEK is performed resulting in an |H state with logical error rate O(p 8 ). All teleportation schemes are performed using the methods described in Appendix B.
The qubit and gate overhead results for the magic state preparation scheme with flag qubits and the MEK schemes are shown in Figs. 9 to 11. Details of the analysis leading to the plots are given in Appendices C and D.
Comparing the results, it is clear that the qubit and gate overhead cost of the fault-tolerant |H state preparation scheme of Fig. 5 is substantially smaller than schemes involving MEK. The primary reasons for this are due to the high pseudo-threshold of the error detection scheme as well as the small size of the circuits compared to the circuits used by MEK (where we used the teleportation scheme of Appendix B to inject states at a higher concatenation level). Further, note that the hybrid scheme (Fig. 11 ) has smaller qubit and gate overhead costs compared to the full MEK scheme (Fig. 10) , although the overhead is still much larger than the error detection scheme using flag qubits.
Lastly, we point out that due to the lower logical failure rates of the error detection scheme using flag qubits, three concatenation levels were sufficient to achieve logical failure rates of 10 −9 for larger physical error rates compared to MEK (the same is true for other target logical failure rates). For example, suppose one wants to achieve a logical failure rate p L ≤ 10 −9 . In our error detection scheme, one must use at least four concatenation levels for physical error rates p > 6 × 10 −5 . For the MEK scheme, one requires four concatenation levels for physical error rates p > 4 × 10 −5 . Hence for physical error rates 4×10 −5 ≤ p ≤ 6×10 −5 , the error detection scheme will require fewer qubits by several orders of magnitude compared to MEK (see Figs. 9 and 10).
VI. CONCLUSION
The high cost of universal fault-tolerant logic is a challenging and enduring problem. In this work, we propose an alternative to magic state distillation that is based on recently discovered flag fault-tolerance techniques. We have designed a flag-based magic state preparation scheme that reduces the number of qubits and operations to prepare a reliable magic state, and does so by orders of magnitude in some error regimes. Furthermore, the error-detection-based state preparation circuit at level-1 has a high pseudothreshold and uses a total of 10 qubits. The circuit accepts around 80% of the time and has logical error rates near 10 −4 at physical error rates near 3 × 10 −3 (see Table IV ), making it an interesting candidate for experimental consideration.
The magic state preparation scheme relies on a transversal Hadamard operator and a flag circuit for mea-suring it fault-tolerantly. So that the operator's weight does not grow, we concatenate the code with itself to achieve some low logical error rate. The threshold is ultimately that of the concatenated Steane code in all of the schemes we analyze. It would be interesting to account for more realistic constraints on qubit connectivity, and to consider how to extend these techniques to families of higher distance (non-concatenated) codes, but we leave this to future work.
If codes such as topological codes were used to perform the MEK protocol at high physical error rates, then magic state distillation might outperform the flagbased scheme, since a large number of concatenation levels would be necessary. However, the ideas we have presented may be broadly applicable to improve the pseudothreshold of magic state preparation circuits using codes other than the Steane code. In this direction, we have given a 2-flag circuit for fault-tolerantly measuring logical Hadamard of the 17-qubit color code. If lower logical error rates are needed, states could then be used in a hybrid approach that takes advantage of the topological code threshold. Lastly, one could use the w-flag circuit construction and EC circuits of [23] to fault-tolerantly prepare magic states for the family of color codes on a hexagonal lattice.
VII. ACKNOWLEDGEMENTS C.C. acknowledges IBM for its hospitality where all of this work was completed. C.C. also acknowledges the support of NSERC through the PGS D scholarship. We thank Sergey Bravyi, Earl Campbell, Tomas Jochym-O'Connor, and Ted Yoder for useful discussions. We thank Earl Campbell for sharing observations about the MEK protocol. In Definition 1, ideally decoding is equivalent to performing fault-free error correction. Now suppose |ψ is the encoded state to be prepared. If there are s ≤ t faults during a state-preparation protocol satisfying the criteria in Definition 1, then the output state will have the form E|ψ with wt(E) ≤ s (the output state will encode the correct state with no more than s errors).
For CSS codes, this definition can be applied independently to X and Z errors.
Error correction circuit
In this section we provide more details on the properties of the EC circuit of Fig. 3 which is used in all of our schemes.
The first half of the circuit measures the XIXIXIX (green CNOT's), IIIZZZZ (blue CNOT's) and IZZIIZZ (red CNOT's) stabilizers of the Steane code. The second half of the circuit measures the remaining stabilizers. Given that the CNOT gates are not transversal, it is possible for a single fault to result in a weight-two X or Z data-qubit error. However the ancilla qubits also act as flag qubits which can be used to detect such events.
Let us assume that a single-fault occurs. For the first half of the EC circuit, the possible weight-two errors that
In the case of Z 4 Z 6 and Z 3 Z 7 , the measurement in the Xbasis will flag (with the two Z-basis measurements being +1), at which point the entire syndrome measurement is repeated. Since Z 4 Z 6 and Z 3 Z 7 are errors that have different syndromes than all other errors arising from a single fault leading to a − + + measurement outcome of the first three ancilla qubits, after the syndrome measurement is repeated these errors can be distinguished and corrected.
The same argument applies to the case where a single fault leads to the X 3 X 7 data-qubit error. But this time the X 4 error has the same syndrome as X 3 X 7 (s(X 4 ) = s(X 3 X 7 ) = 010000). However, the faults causing an X 3 X 4 error result in a + − − measurement outcome of the first three ancillas whereas X 4 results in + − +. Thus when the syndrome measurement is repeated and the flag outcomes are taken into account, the errors can be distinguished and corrected.
An analogous analysis can be applied to the second half of the EC circuit. See [25] for more details.
Fault-tolerance proof for the error detection scheme
Recall that the error detection scheme can be decomposed into three components as shown in Fig. 5b . Since t = 1 for the [ [7, 1, 3] ] Steane code, we must show that if a single fault occurs in any of the three components, both criteria in Definition 1 will be satisfied.
Case 1: fault in |H nf (see Fig. 4 ). Since the Steane code is a perfect CSS code and the |H nf circuit is not fault-tolerant, the output state of the first component will have the form
where
errors on qubits i and j andĒ
∈ {I, Z} are logical operators of the Steane code. Now the output state of the second component of Fig. 5b , including the contribution from the first ancilla qubit (which will subsequently be measured in the Xbasis), is given by
= I, then the single-qubit error will be detected by the subsequent EC and the state will be rejected. Thus let us assume that E
where we used the identities H =
we see that |ψ (2) will be accepted with probability 1/2 resulting in the state |H . Thus if a single fault occurs in the first component of Fig. 5b , an accepted state will be |H . Case 2: fault in H f1 m (see Fig. 5a ). Since there are no faults in |H nf , the input to the circuit will be |H . If the fault results in any measurement outcome to be −1, the state will be rejected. Thus we only consider faults such that the measurement outcome of all ancilla's is +1. Since the circuit will flag if a single fault results in a data-qubit error of weight greater than one, the output state of the circuit will be |ψ
is nontrivial, it will be detected by the subsequent EC circuit and the state will be rejected. Hence an accepted state will be |H .
Case 3: fault in the EC circuit (see Fig. 3 ). Since there are no faults in the first two components of Fig. 5b , the input state to the EC will be |ψ (3) = |H . If a fault causes a non-trivial measurement, the state will be rejected. Hence let us consider the case where a single fault results in an error which goes undetected. A single fault in the EC can result in a data-qubit error of the form E 
with wt(E (x) i ) ≤ 1 and wt(E (z) j ) ≤ 1. Since errors of the form E = X i Z j are correctable by the Steane code, 4 As an example, consider the error Z ⊗ X on the fourth black CNOT of Fig. 3 which results in the data-qubit error X 6 Z 7 . fault-free error correction of an accepted state of the protocol in Fig. 5b will always result in the state |H . We conclude that if there is at most one fault, both criteria in Definition 1 will be satisfied.
Fault-tolerance proof for the error correction scheme a. Error distinguishability
Recall that for the error correction scheme, the circuit for measuring the logical Hadamard operator is given in Fig. 6a . The first thing to notice is that the controlledHadamard gates are implemented in a different order compared to the gates in Fig. 5a . If a fault occurs at one of the controlled-Hadamard gates resulting in an X or Y error on the control-qubit, the resulting data qubit error can be expressed as a product of Hadamard errors and Pauli errors (see Fig. 12 ). For the second to sixth controlled-Hadamard gates, all errors arising from a single fault at one of these locations which propagate to the data qubits (causing at least one of the flag qubits to flag) must be distinguishable (since H = H ⊗7 , a fault occurring at the first controlled-Hadamard gate will result in an X and Z data-qubit error of weight at most one). This is only possible if errors are corrected based on their full syndrome 5 and with a carefully chosen ordering of the controlled-Hadamard gates. Performing a numerical search of all 7! = 5040 permutations of the controlled-Hadamard gates, there was no permutation that allowed all possible errors (arising from a single fault) to be distinguished. However, ignoring a fault on the fourth controlled-Hadamard gate, a numerical search showed that the ordering found in Fig. 6a allows errors arising from a single fault to be distinguished. Consequently, we need to have the ability to isolate errors arising from a fault at the fourth controlled-Hadamard location. This can be achieved using the two extra flag qubits shown in Fig. 6a (|0 f1 and |0 f3 can be measured using the same qubit).
Suppose for example that an X error arises on the control-qubit of the fourth controlled-Hadamard gate resulting in the data qubit error P 2 H 1 H 3 H 4 , where P 2 ∈ {I, X 2 , Y 2 , Z 2 }. Then |0 f0 , |0 f2 and |0 f3 will flag. We could then apply H 1 H 3 H 4 to the data qubit and be left with the single-qubit Pauli P 2 . Of course, if a fault occurs on the first CNOT connecting |0 f3 to the |+ state resulting in an X ⊗X error, |0 f3 will not flag but |0 f2 will flag (in addition to |0 f0 ). After applying H 1 H 3 H 4 , the resulting data qubit error would be H 2 . Going through all possible cases of a single fault at one of the CNOT gates interacting the ancillas |0 f1 , |0 f2 and |0 f3 with the |+ state, we can guarantee error distinguishability by applying H 1 H 3 H 4 to the data if the following combinations of flag qubits flag:
1. |0 f0 and |0 f2 flag.
2. |0 f0 and |0 f3 flag.
3. |0 f0 , |0 f2 and |0 f3 flag.
Note that |0 f0 would not flag if a single measurement error of the flag qubits |0 f1 , |0 f2 or |0 f3 occurred.
When following the circuit in Fig. 6b by an EC circuit, we will use a lookup-table containing all errors arising from a single fault resulting in a flag (which are all distinguishable) in order to correct.
b. Fault-tolerance proof
Given the above for applying Hadamard corrections to the data qubits depending on the flag outcomes (thus guaranteeing error distinguishability if a single fault occurs), we now show that our error correcting scheme shown in Fig. 6b satisfies both criteria in Definition 1. In what follows, |ψ final will correspond to the output state of the circuit in Fig. 6b . Also, the state at the output of the i'th circuit in Fig. 6b will be labeled as |ψ (i) . For instance, the state at the output of |H nf will be |ψ (1) . Case 1: fault in |H nf .
The output of |H nf is given by Eq. (A1). To simplify the notation, let E =Ē
j , and E p = HE p H. With this notation, the output to the first H f2 m circuit (where we only write the first ancilla since the flag qubits have no effect) can be written as
There are several cases to consider
If E ∈ {X, Z},
Thus a ±1 measurement outcome gives | ± H and all three H For the case where all three measurement outcomes are −1, applying Y to |ψ final , we will have |ψ final = |H .
2.
In this case
If E ∈ {X, Z}, |ψ
The Y i error will be corrected by the following EC round. For a ±1 measurement outcome, |ψ 
The three H f2 m measurements will be ± − − with |ψ final = | − H . Applying Y will give the correct state.
Since in both Eqs. (A10) and (A11) the states are linear combinations of products of single-qubit Pauli's and logical operators, after the first EC circuit the output state will collapse to |ψ (3) = E|H where E ∈ {X, Z} (regardless of whether the first H f2 m measurement outcome was ±1). During the second H f2 m measurement, the output state will become
From Eq. (A12) we conclude that the three H f2 m measurement outcomes will either be ± + + with |ψ final = |H or ± − − with |ψ final = | − H (in which case we apply Y ).
4. E p = X i Z j with i = j. E p will be corrected in the next EC round. A similar calculation as the examples above show that the possible measurement outcomes of the three H Case 2: fault in the first H f2 m circuit. If the fault results in a data-qubit error of weight greater than one, as was shown in Appendix A 3 a, the possible errors can be distinguished when considering the full six-bit syndrome as well as applying H 1 H 3 H 4 or I to the data depending on the flag outcomes. Thus the output state of the first H f2 m measurement can be expressed as |ψ (2) = E f |H |± where E f will be corrected by the following EC. Hence the possible measurement outcomes of the three H f2 m circuits are ± + + and the final output state will be |ψ final = |H .
Case 3: fault in the first EC. The input state to the EC will be |H . If a fault results in a data-qubit error of weight greater than one, there will be a flag and the error will be corrected. The output of the EC will be |H and all three H f2 m measurements will be +1 with |ψ final = |H .
If there are no flags, the output state of the EC can be written as |ψ 
Note that the error will be corrected in the next EC round. However the type of error can affect the H f2 m measurement outcomes.
E p = Y i
In this case |ψ (4) = Y i |H |− and the second H f2 m measurement will be −1 (with all three measurement outcomes being + − +) and |ψ final = |H 6 .
2. E p ∈ {X i , Z i }.
In this case |ψ 
j . Note however that E p is a correctable error so both criteria in Definition 1 are still satisfied. Table II .
Appendix B: Teleporting into code blocks
In this section we review a general method introduced by Knill for preparing an encoded state |ψ from a physical qubit state |ψ using teleportation [38] .
The circuit used to implement the teleportation protocol is illustrated in Fig. 13 . After preparing a logical Bell state, one of the code blocks is decoded. A CNOT is applied between |ψ and the decoded state. After measuring both qubits in the X and Z basis, a logical Pauli operator is applied to the code block in order to complete the teleportation protocol.
In [39] , Aliferis showed that the probability of a logical fault occurring during the teleportation protocol can be bounded by
Here we are assuming that the code block is encoded with k-levels of concatenation. Assuming a stochastic noise model where encoded gates fail with probability at most p (k) , a fault in the encoded Bell state is upper bounded by 3p (k) . Since there are four locations in the physical part of the teleportation circuit, and with p (0) = p, the probability of this part is bounded by 4p. Lastly, p (k) dec is a bound on the probability of failure of the decoding circuit D.
The level-k block is decoded recursively as follows. The level-k circuit comprises level-(k − 1) gates which, when applied to the code block, results in a level-(k − 1) encoded state. Then D is applied again using level-(k − 2) gates and so on. Assuming that D has D locations, we can bound p
where p (j) is a bound on the failure probability of level-j gates.
In this paper, instead of using the bounds in Eqs. (B1) and (B2), we perform a direct simulation of the teleportation circuit using the methods in Section II in order to obtain smaller constant pre-factors (since not all fault locations will lead to a logical fault).
Appendix C: Overhead analysis of the |H state preparation scheme
In this section we provide a detailed description of the qubit and gate overhead analysis for preparing an encoded |H state using our error detection scheme.
Qubit overhead analysis
We begin with a few definitions. Let p H(l) A be the probability that an encoded |H state at level-l passes the verification test of the circuit in Fig. 5b and let n H(l) T be the total number of qubits used to prepare an encoded |H state at level-l. At the first concatenation level, the largest component of the circuit in Fig. 5b is the EC, which requires 10 qubits in its implementation. Hence we have
for a physical error rate p of the depolarizing noise model.
FIG. 15
. Circuit used to measure the logical Hadamard operator for concatenation levels l ≥ 2. The T gates are implemented using the circuit in Fig. 1 . At level-l, two level-(l − 1) |H resource states are required for the entire circuit. We assume that the two resource states can be reused for each parallel implementation of T and T † .
At higher levels, a few considerations are necessary when considering different contributions to the qubit overhead. First, for the level-l circuit in Fig. 4 , it is important to take into account the fault-tolerant preparation of the level-(l − 1) |0 and |+ states. The level-(l − 1) |+ state is obtained by first preparing the state |+ (l−2) ⊗7 (which is a +1 eigenstate of all the Xstabilizers), followed by measuring the Z-stabilizers using the circuit of Fig. 14 (which was shown to be faulttolerant in [21] ). Depending on the measurement and flag outcomes, it might be necessary to repeat the measurement of X or Z stabilizers without the flag qubit. Note that at the first level, the circuit in Fig. 14 requires  11 qubits instead of 10 since there needs to be at least one ancilla qubit prepared in the |+ state in order to detect errors of weight greater than two arising from a single fault. A similar protocol is used to fault-tolerantly prepare a level-(l − 1) |0 state.
Next, the details of the implementation of the controlled-Hadamard gates are considered as follows. Since the controlled-Hadamard gates are decomposed as shown in Fig. 2b with the level-(l − 1) T and T † gates implemented using level-(l − 1) |H states as shown in Fig. 1 , at level l ≥ 2, the logical Hadamard gate is measured using the circuit in Fig. 15 . Due to the way in which we parallelize the the circuit in Fig. 15 , only two level-(l−1) resource states are required at each time step, apart from the first and last time step where we only need one resource state. In addition, a level-(l − 1) resource state is required for the circuit in Fig. 4 .
In order to minimize the qubit overhead, we consider preparing in parallel m Fig. 15 . Since there are three time steps where a single resource state is required and six times steps where two resource states are required, the average number of qubits required to implement our error detection scheme at level-l is given by
Gate overhead analysis
Since the circuit in Fig. 5b can be decomposed into three parts, the number of gates required to implement our error detection scheme at level-l can be written as
where n
is the number of gates used for preparing the
is the number of gates required to measure the logical Hadamard operator at level-l and n
ED is the number of gates used in the EC circuit of Fig. 3 at level-l. We used ED instead of EC since the circuit is used in an error detection scheme. It will also be important to analyze the gate overhead of the circuit in Fig. 3 when used in an error correction scheme since all level-l (l ≥ 2) Clifford gates in our circuits consist of extended rectangles where the EC's are used for error correction instead of error detection.
Performing a gate count of the level-1 circuits in Figs. 3, 4 and 15, we find that
and
7 Note that in this section, all quantities n (l)
G should be written as n (l) G since we are computing average quantities. However, to avoid cluttering in the notation, we omit the brackets.
l are the number of Z and X-basis measurement locations.
In the above, for a gate G ∈ {T, T † }, n (l) G = 7 l since as we will show below, we will treat the EC's separately from the gates for l ≥ 2. We split the EC circuit into the two components shown in Fig. 3 , which we call EC1 and EC2. As we explained in Appendix A 1, depending on the syndrome measurement outcome, the full syndrome measurement can be repeated. Thus n corresponds to the average number of times that the circuits EC1 and EC2 are implemented at level-l. Similarly, n (l) ED2 is the average number of times that EC2 is implemented when the EC circuit is used for error detection. These values were obtained through Monte-Carlo simulations with 10 6 trials (see Fig. 16 ).
Performing a gate count of the circuits EC1 and EC2, we have that
Lastly, at concatenation levels l ≥ 2, the |0 and |+ states are prepared fault-tolerantly using the circuit in Fig. 14 to measure the three Z stabilizers of the Steane code, and a similar circuit to measure the three X stabilizers (see the discussion in Appendix C 1). When preparing a logical |+ state, if there is a flag in the circuit of Fig. 14 , then there could be a Z error of weight greater than one. Thus one must measure the X stabilizers of the Steane code (without using a flag qubit) to correct the Z errors. If there are no flags but the Z stabilizer measurement outcome is nontrivial, then one must repeat the Z syndrome measurement without the flag qubit. A similar analysis holds for preparing a logical |0 state. Hence, as in Eqs. (C8) and (C9), we must also take into account the average number of times the non-flagged X and Z stabilizers are measured when counting the number of gates required to prepare level-l |0 and |+ states. The averages are obtained by performing a Monte-Carlo simulation (see Fig. 16 ). In what follows we define n ZS+/0 to be the average number of times these circuits are used when preparing a level-l | + /0 state. Performing the level-1 gate count, we have that CNOT + 8n
with n (1)
and n (1)
We now have all the tools to obtain the gate overhead at arbitrary concatenation levels. Recall that at concatenation levels l ≥ 2, all physical gates G in the circuits of Fig. 5b are represented by extended rectangles, which consists of the logical gate G preceded and followed by EC units (in our case, the circuit in Fig. 3 ). More details on extended rectangles can be found in [27] . When computing the gate overhead of the error detection scheme (Eq. (C5) with l ≥ 2), we must be careful not to doublecount overlapping EC's since consecutive gates will share an EC (see Fig. 17 for an example).
Lastly, for l ≥ 2, the T and T † gates are implemented as shown in Fig. 1 . We thus see that the overhead for these gates can be computed recursively using
With the above considerations and using Eq. (C16), the gate overhead can be computed recursively using the following relations. First, the recursive relations for the EC unit are given by
Next, the recursive relations for the |0 and |+ states are
Using Eqs. (C17) to (C23) and the results in Fig. 16 , we can compute the following expressions
Note that n 0 EC = 0 and n In this section we provide a detailed description of the qubit and gate overhead analysis required to implement the MEK scheme.
Qubit overhead analysis
We first compute the overhead cost of teleporting a physical |H state to a level-2 |H state and then performing one round of level-2 MEK.
Let n (T ) qij be the qubit overhead cost of a level-i |H state teleported to a level-j |H state. From the teleportation of Fig. 13 , a level-2 |0 and |+ state must be prepared, each requiring 11 2 qubits. We assume that these qubits can be reused when performing the logical gates and EC's that follow (since an EC requires only 10 qubits). Including the qubit for the physical |H state, we have
Next we define n (l) qMEK to be the qubit overhead cost associated with performing a level-l round of MEK. We also define a (l) MEK to be the probability that a pair of level-l |H states are accepted in a level-l round of MEK. When implementing a level-l MEK circuit, we must prepare two level-l |+ states and one level-l |0 state. As was explained in Appendix C 1, we require two additional level-l |H states in order to perform the C H gates. Thus the level-2 MEK circuit qubit overhead with level-2 |H states teleported from physical |H states is given by
Since the MEK circuit produces two distilled |H states, when we compare the qubit overhead cost to our |H state preparation error detection scheme, we will divide Eq. (D2) by two. We note that several additional optimizations are possible. For instance, some of the qubits used for teleporting the |H states that are used to perform the C H gates could be reused. Additionally, the extra qubits used for preparing the level-l |0 and |+ states for the MEK circuit could also be reused in various parts of the protocol. However these additional optimizations are likely to strongly depend on the particular architecture that is being used and their associated constraints. Therefore, to be as general as possible, these will be omitted.
We now consider the overhead cost for performing a round of level-3 MEK. We first teleport the distilled level-2 |H states to level-3 |H states. We have that
Using the level-3 |H states, the qubit overhead for a level-3 MEK simulation is given by
where we will divide Eq. (D4) by two when comparing with the qubit overhead of our error detection scheme. For the case when a level-2 |H state is prepared using our error detection scheme, we must first teleport the level-2 |H state to a level-3 |H state. Defining n
to be the overhead cost for the teleportation step, we have
H(2) T
(p) is obtained from Eq. (C4). Hence the overhead cost for performing a round of level-3 MEK is given by
where a
MEKED is the acceptance probability for the level-3 MEK scheme when the level-2 |H states (teleported to level-3) were obtained from our error detection scheme.
Gate overhead analysis
We now consider the gate overhead analysis for the various rounds of MEK considered in Appendix D 1.
Let n (T ) gij be the qubit overhead cost of a level-i |H state teleported to a level-j |H state. Two EC's must be performed after applying the logical CNOT gate. We must also take into account the gate cost for the decoding circuit. The decoding circuit requires the preparation of four |+ states and three |0 states. There are eight CNOT gates and the circuit requires a total of 16 EC's. Let n (D) ij be the number of gates in the decoding circuit for level-i to level-j teleportation scheme. We have that 
where n 
The last term in Eq. (D8) is due to the four physical operations of the decoding circuit. Next we compute the gate overhead of the level-2 MEK scheme. The circuit has 48 logical gates, 177 EC's and requires the preparation of two level-2 |+ states and one level-2 |0 state. Defining n 
Again when comparing the gate overhead of a level-2 MEK circuit to the gate overhead required to prepare a level-2 |H state using our error detection scheme, we will divide Eq. (D9) since an MEK circuit produces two distilled |H states. Following the same steps that lead to Eq. (D9), it is straightforward to compute the gate overhead for a level-3 MEK circuit. Using 
we get |0 + 48(7 3 ) + 177n
MEK (p)
.
The gate overhead for a level-3 MEK scheme where a level-2 |H state was prepared from our error detection scheme is obtained as follows. First we compute the gate overhead to teleport the level-2 |H state to a level-3 |H state. It is given by 
MEKED is the acceptance probability of the level-3 MEK circuit when the level-2 |H states (teleported to level-3) are obtained from our error detection scheme.
Appendix E: State vector simulations
Here we describe how we simulate each |H state preparation and MEK magic state distillation circuit. A common method for simulating fault-tolerant errorcorrection is to apply the Gottesman-Knill theorem to track how Pauli errors propagate through stabilizer circuits. However, unlike error-correction, circuits for magic state preparation and distillation contain non-Clifford gates that map Pauli errors outside the Pauli group, complicating the analysis. Since all of the circuits we consider act on relatively few qubits, we use the Qiskit state vector simulation [40] to accurately track error propagation through non-Clifford gates.
Each gate, preparation, idle and measurement location is subject to Pauli channel errors whose probabilities are functions of the physical error rate p that are determined by the noise model at level-1 and the logical error probabilities of fault-tolerant gates at level-2 and above (see Section II). For each value of p, we run between 10 6 and 10 8 Monte-Carlo samples, where each sample draws an error at each fault location. The circuits are composed with an ideal decoder so that the output state is one or two physical qubits for state preparation or distillation, respectively. The output for each sample i is a pure state vector |ψ i and a measurement record m i on which we post-select.
For simulations of |H state preparation, we compute overlaps ψ i |P |ψ i for each Pauli P to determine the logical error class, since in this case the logical error is always of this form after ideal decoding. However, for the MEK distillation protocol there are additional failure modes. For example, the two output qubits can be in a maximally entangled state. This can be seen by placing Y errors on the fourth and seventh |H states, which corresponds to failures of the first T gate and third T † gate. Therefore, we solve for Pauli channel parameters using an estimate of the output density matrixρ ≈ i |ψ i ψ i |. First, we compute the reduced state ρ = Tr 2ρ of one output qubit. Next, we model the output state as a Pauli channel applied to the state ρ H = |H H|,
For the parameter ranges we considered, we can solve for the Pauli channel parameters in terms of the matrix elements of ρ,
We substitute these parameters back into the model and verify the density matrices are equal to machine precision. The logical error probabilities and rejection probabilities are fit to functions of the physical error p. State preparation results are summarized in Table IV and distillation results in Table V . For the second round of distillation, the O(p 4 ) contribution to the logical error is too small to observe with 10 8 Monte-Carlo trials, so we do two such simulations. The first case uses ideal |H states and noisy stabilizer operations, while the second case uses noisy |H states and ideal stabilizer operations. We simulate the second case at higher physical error rates to estimate the coefficient of the O(p 4 ) term. Finally, we approximate the logical error rate and acceptance probability of the noisy circuit as a piecewise function (see the caption of Table V) .
