We propose two distinct methods of improving quantum computing protocols based on surface codes. First, we analyze the use of dislocations instead of holes to produce logical qubits, potentially reducing spacetime volume required. Dislocations 8 induce defects which, in many respects, behave like Majorana quasi-particles. We construct circuits to implement these codes and present faulttolerant measurement methods for these and other defects which may reduce spatial overhead. One advantage of these codes is that Hadamard gates take exactly 0 time to implement. We numerically study the performance of these codes using a minimum weight and a greedy decoder using finitesize scaling. Second, we consider state injection of arbitrary ancillas to produce arbitrary rotations. This avoids the logarithmic (in precision) overhead in online cost required if T gates are used to synthesize arbitrary rotations. While this has been considered before 4 , we consider also the parallel performance of this protocol. Arbitrary ancilla injection leads to a probabilistic protocol in which there is a constant chance of success on each round; we use an amortized analysis to show that even in a parallel setting this leads to only a constant factor slowdown as opposed to the logarithmic slowdown that might be expected naively.
We propose two distinct methods of improving quantum computing protocols based on surface codes. First, we analyze the use of dislocations instead of holes to produce logical qubits, potentially reducing spacetime volume required. Dislocations 8 induce defects which, in many respects, behave like Majorana quasi-particles. We construct circuits to implement these codes and present faulttolerant measurement methods for these and other defects which may reduce spatial overhead. One advantage of these codes is that Hadamard gates take exactly 0 time to implement. We numerically study the performance of these codes using a minimum weight and a greedy decoder using finitesize scaling. Second, we consider state injection of arbitrary ancillas to produce arbitrary rotations. This avoids the logarithmic (in precision) overhead in online cost required if T gates are used to synthesize arbitrary rotations. While this has been considered before 4 , we consider also the parallel performance of this protocol. Arbitrary ancilla injection leads to a probabilistic protocol in which there is a constant chance of success on each round; we use an amortized analysis to show that even in a parallel setting this leads to only a constant factor slowdown as opposed to the logarithmic slowdown that might be expected naively.
The surface code [1] [2] [3] , in several different variants, is a promising potential platform for fault-tolerant quantum computation. Some results indicate that present-day hardware is approaching the threshold for fault-tolerance 5 . In these schemes, the idea is first to implement the Clifford group in a topologically protected way by encoding logical qubits within a particular type of stabilizer code; then, some additional operations are added in a way that is not topologically protected. These extra operations enable universality, and given the ability to implement the Clifford group fault-tolerantly, it is possible to error-correct these additional operations up to a relatively high threshold.
One proposal for achieving universality is to distill magic states which allow implementing T gates by state injection 18 . In Ref. 6 , it was proposed to implement a quantum computer using alternating rounds of Clifford and T gates. By teleportation, the Clifford gates are implemented in constant time, regardless of their complexity. Essentially, one prepares in advance some entangled state using Clifford operations; then, a measurement is used to teleport the logical qubits of the code while implementing the Clifford operation. Regardless of the complexity of the Clifford operations when expressed in terms of CNOT and Hadamard gates, the time remain constant, but, if the implementation is done with a surface code, the spacetime volume does increase with increasing complexity. This scheme is in fact not specific to the surface code, but can be applied to much more general quantum error-correcting codes. In this scheme, T gates are implemented by state injection; this applies the T gate probabilistically, with half the time applying T and half the time applying T † ; however, since T 2 is in the Clifford group, this error can be corrected using a Clifford operation from the surface code. To implement an arbitrary rotation to accuracy δ still requires a time logarithmic in 1/δ, even assuming that the magic states to prepare the T gates are impemented perfectly, as the rotation must be compiled into T gates and Clifford operations.
In this paper we consider two distinct topics relevant to this scheme. First, we consider using dislocations 8 , instead of holes, to implement logical qubits in a toric code. Second, we consider injecting arbitrary states instead of T gates; this avoids the logarithmic in 1/δ overhead mentioned above.
We provide circuits to implement a disclocation code "in software", by repeatedly measuring the syndromes of the code. In such schemes, where classical control is used to perform error-correction, there remain many interesting questions, such as ideal circuits to use and the best error-correction scheme to consider. Candidate error-correction schemes include minimum weight matching 2 , renormalization-group decoders 12 , and matrix-product decoders 13 . We do not consider these questions here, ignoring all issues of the classical control required.
We then discuss implementing logical operations. We show that Hadamard can be performed in exactly 0 time. We show how to perform CNOT gates by joint measurements. We discuss several schemes for fault-tolerant measurement, including one that may be useful for other settings. Part of our analysis is based on the distance of the code, and on the complexity of certain operations (Hadamard in particular being much simpler, though CNOT is more complicated). Additionally, we study the statistical properties of error-correction in this code assuming perfect stabilizer measurements; we leave the question of studying imperfect measurements for future simulations.
We then turn to the question of arbitrary ancilla injection in a scheme similar to that mentioned in Ref. 4 . Errors in injecting a given ancilla require injecting further ancillas, and so on; this leads to an interesting question: do the possible delays in implementing a gate on one qubit slow down the other gates by a logarithmically divergent amount? We show that this does not hold (given some assumptions on the complexity of the intermediate Clifford gates used); this result may be of interest elsewhere in other schemes such as in forced measurement 17 or repeat-until-success 22 . 0000 0000 0000   0000  0000 0000 0000   1111 1111 1111   1111  1111 1111 1111   000 000 000   000  000 000 000   111 111 111   111  111 111 111   000 000  000 000  000 000  000   111 111  111 111  111 111  111   0000 0000  0000 0000  0000 0000  0000   1111 1111  1111 1111  1111 1111  1111   000  000  000 000 000 000  111  111  111 111 111 111  0000  0000  0000 0000 0000 0000  1111  1111  1111 1111 1111 1111  000  000  000 000 000 000  111  111  111 111 111 111   000  000  000 000  000  000  000   111  111  111 111  111  111  111   0000  0000  0000 0000  0000  0000  0000   1111  1111  1111 1111  1111  1111  1111  0000  0000  0000 0000   0000 0000   1111  1111  1111 1111   1111 1111   000  000  000 000  000  000  000   111  111  111 111  111  111  111  000  000  000 000   000 000   111  111  111 111   111 111   000 000 000   000  000 000 000   111 111 111   111  111 111 the same operator is used on every plaquette, while we refer to the choice with Z and X operators on light and dark plaquettes as the "original gauge". When we later refer to working in one gauge or another, we do not mean that we actually apply the Hadamards to the physical qubits to change the stabilizers; rather, this is done for notational reasons to simplify certain operators. The idea of adding dislocations to a toric code (or more generally any Z N abelian gauge theory) was introduced in Ref. 8 . One simply introduces a lattice dislocation, using this uniform gauge to define the operator on all plaquettes with four qubits in them. Fig. 3 shows 4 dislocations in a square lattice. Each dislocation gives one plaquette surrounded by 5 qubits; on this plaquette we choose a stabilizer which acts on these 5 qubits, with a Pauli Y on the qubit marked with a circle. Dislocations must arise in pairs in order for the boundary of the lattice to have an even number of qubits around the boundary of the lattice.
If we introduce k pairs of dislocations to the lattice, there are k −1 logical qubits (assuming the boundary conditions are chosen either to be magnetic or electric everywhere, without change from one to another; changing boundary conditions introduces additional logical qubits). Borrowing the language of topology, we will say that a cycle is a product of Pauli operators that exactly commutes with the stabilizers. We regard two cycles which are equivalent up to multiplcation by stabilizers as being in the same homology class. A trivial cycle is one which is in the same homology class as the identity operator. In an annulus surrounding an odd number of dislocations it is impossible to consistently color plaquettes as light and dark so that plaquettes of the same color do not neighbor each other, but around a pair of dislocations such a coloring is possible. Returning to the uniform gauge in an annulus around a dislocation pair, there are cycles which measure either the electric or magnetic charge inside that annulus; however, these two charges must be equal to each other, so that the annulus contains either the identity particle or the em particle. Now consider 4 dislocations as shown in Fig. 4 . Considering the two different nontrivial cycles shown, we find that they anticommute and can thus serve as logical X and Z operators using 4 dislocations to encode a qubit. We now analyze the performance of these codes using dislocations. Some of our comparisons to other choices of defects will be based on analyzing either the distance of the code or the time complexity of logical operators. At the end of the section, we numerically analyze the probability of logical error at non-zero physical error rate; the threshold for the dislocation code seems to be the same as other surface codes to within finite-size error, as expected since all correspond to the same phase transition in a random-field Ising model or random-plaquette gauge model 2 . We begin by giving circuits to perform syndrome measurements in the dislocation code and discuss error correction. We then compare the performance of the code as a quantum memory in terms of dislocation density. Then, we discuss operations, including a fault tolerant method for measurements.
B. Circuits and Implementation
In order to physically implement the dislocation code, we need a means of measuring stabilizers. For the usual surface codes, it is possible to measure stabilizers using an operation that consists of a total of 8 rounds as reviewed in Ref. 10 , using the original gauge. One adds an additional ancilla qubit for each stabilizer. We will refer to the qubits in the code as "data qubits" to differentiate them from these ancilla qubits. Physically the ancilla qubits are centered in the middle of a plaquette so that the four data qubits in the corresponding stabilizer are its neighbors. Then, in the first round, all ancilla qubits are initialized to Z = +1. In the second round, Hadamards are executed on the ancilla corresponding to stabilizers which are products of four X operators; i.e., these are those in the dark plaquettes. On the third through sixth round, CNOT gates are executed between data qubits and ancilla qubits. On the seventh round, the Hadamards are again executed on ancillas corresponding to products of four X operators. Finally, the ancillas are measured to read out the stabiizers.
The CNOT gates are executed in a particular pattern. On each of the four rounds from the third through the sixth, each ancilla qubit participates in a CNOT with one of the four neighboring data qubits. These particular neighboring qubits are executed in a "Z-shaped pattern" such as northwest, northeast, southwest, southeast from the third through the sixth round, respectively. For the ancillas in the light plaquettes, the CNOT uses the ancilla as a target and the data qubit as a source, while for the ancillas in the dark plaquettes, the orientation is reversed.
This pattern is chosen so that in each of these four round, every data qubit particpaptes in exactly one CNOT, and further, the sequence of measurements is chosen to maintain the correct commutation relations between stabilizers. That is, consider a pair of stabilizers, 1, 2 which overlap on a pair of qubits s, t. Let a 1 , a 2 be the ancilla qubits corresponding to the stabilizers. Then, either a 1 interacts with s before a 2 interacts with s and a 1 interacts with t before a 2 interacts with t, or else a 2 interacts with s before a 1 interacts with s and a 2 interacts with t before a 1 interacts with t.
At first, it seems difficult to implement these measurements using the same number of rounds for the dislocation code, because we cannot use the original gauge. However, in fact there is no obstacle. It is possible to again present an 8 round protocol using the uniform gauge. We begin by describing how to measure all stabilizers except the stabilizers involving five qubits. Compared to the case above, all stabilizers are treated equally, rather than having different operations for light and dark plaquettes. Again we introduce one ancilla per stabilizer, and again on the first round the ancilla qubits are initialized to Z = +1. Then, on the second round, we do a CNOT from the data qubit on the northwest corner to the given ancilla. On the third round, we executed a Hadamard on all ancilla qubits. On the fourth round, we do a CNOT from ancilla to the data qubit on the northeast corner. On the fifth round, we do a CNOT from the ancilla to the data qubit on the southwest corner. On the sixth rough, we again execute a Hadamard on all ancillas. On the seventh round, we execute a CNOT from the data qubit on the southeast corner to the ancilla. On the eigth round, we measure the ancilla. By using the same "Z-shaped pattern" for measurements, we again maintain the correct commutation relations.
This procedure does not allow us to measure the stabilizers at dislocations which involve five data qubits, and so these stabilizers cannot be measured in every round. However, it is possible to measure them less frequently, by also measuring some of the neighboring stabilizers less frequently, so that all stabilizers far from the dislocation are measured in every round and those near are measured in a constant fraction of the rounds.
Error correction using minimum weight matching is in principle the same as other surface codes. Given every pair of stabilizers, one first must compute the minimum weight pattern of errors that leads to defects on that pair of stabilizers. Unlike codes without dislocations, where the only pairs that can be connected are those where both stabilizers correspond to plaquettes of the same color, now it is possible to connect arbitrary pairs of stabilizers. Then, given these weights for pairs of plaquettes, one can compute a minimum weight matching given a set of syndrome measurements.
In fact, this error correction does not require knowledge of the stabilizer near the dislocation that acts on five qubits. For the value of this stabilizer to change, it is necessary for the value of some other stabilizer to change also. Hence, one might try error correction without explicitly measuring these stabilizers.
C. Comparison as Memory
We now consider the density of logical qubits when the code is used as a quantum memory. In the original toric code on a torus, there are 2 logical qubits. On a code of size L-by-L, the distance is d = L. Hence, to achieve distance d means a ratio of physical to logical qubits equal to d 2 /2. Consider a dislocation code, with a square lattice of dislocations, with dislocations separated from each other by distance L (we assume that the square lattice is much larger than L). Then, the distance of the code is equal to 2L + O(1) (note that the shortest nontrivial cycle must encircle a pair of dislocations). Thus, using 4 dislocations to represent a logical qubit, the ratio of physical to logical qubits is equal to
This can be improved using only 2 dislocations for a logical qubit giving a ratio
if our goal were simply to obtain a dense memory we could do this, but using 4 dislocations simplifies the implementation of gates, in particular the Hadamard. Now consider a surface code using a square patch of size L-by-L with electric boundary conditions on the north and south sides and magnetic boundary conditions on the east and west sides. In this case, we have 1 logical qubit and again arrive at the ratio of physical to logical qubits of d 2 . Thus, this choice matches that of the dislocation code; however, as we have noted the dislocation code can be made denser by up to a factor of 2 using a denser encoding, and further the dislocation code will have advantages when performing operations. We note that this result for the ratio of physical to logical qubits occurs only using the patch oriented as we have shown, with the plaquettes parallel to the boundaries of the square. Often, one instead writes the toric code with the degrees of freedom on the edges of a square lattice; in this case, if the edges of the patch are parallel to the edges of the lattice, we find that the stabilizers are at a 45-degree angle to the edges of the square. Then, the ratio of physical to logical qubits is 2d 2 − O(d), which is worse; as an example of this rotated geometry, see Fig. 3 of Ref. 10 .
Finally, we can consider a surface code with holes in it, with either electric or magnetic boundary conditions on each hole. Conside a square lattice of holes, with each hole having boundaries of size l and the separation between hole centers equal to L. The distance is equal to min(4l,
. Then, using a pair of holes for a logical qubit, the ratio of physical to logical qubits is equal to 2
corrections, giving a significantly worse ratio. In fact, the ratio is in practice even worse than that, as in practice the physical qubits inside the holes will also be present, even if they are not being used. The scheme of Ref. 6 , and other schemes using the surface code, require the use of holes to perform CNOTs, so it seems not to be possible to use only patches.
Thus, by reviewing these different possibilities, the dislocation code offers asymptotically the highest possible density of any planar code in its dense encoding, and in its sparse encoding it has asymptotically the same density as the optimal patches, while also allowing logical operations. This motivates further consideration of the code.
It is worth also remarking that it is very natural to consider the dislocation code using three dislocations per qubit; note that the logical operators in Fig. 4 only ever encircle some subset of the first three dislocations and do not use the fourth. In this case, it is necessary to have an even number of qubits to have an even total number of dislocations. This increases the density by a factor of 4/3, making it exceed the optimum density for patches and in fact all the logical operations can still be performed as readily using three dislocations per qubit, rather than four.
We have only considered a square lattice of dislocations above. Using a triangular lattice of dislocations (or holes) inside the square lattice of the physical code offers no improvement in distance.
D. Logical Operations and Fault Tolerant Measurements
We now describe how to perform logical operations from the Clifford group in the dislocation code, using a sparse encoding with either three or four dislocations per qubit. We will describe how to implement Hadamard gates and CNOT gates.
Consider a given quadruplet of dislocations, labeled 1, 2, 3, 4. We use a cycle encircling γ 1 γ 2 to measure logical Z for that qubit. We use a cycle encircling γ 1 γ 3 to measure logical X. We implement the Hadamard gates without actually implementing them: each time we implement a Hadamard gate on that qubit, we simply interchange whether we will use γ 1 γ 2 or γ 1 γ 3 to measure logical Z on that qubit when a measurement is next performed. Performing logical Z or logical X operations can be performed by executing the appropriate products of Pauli operators. Initialization of a qubit in a desired Z = +1 state can be performed by measuring Z and then, if the measurement is −1 performing X (or simply modifying the subsequent operations to take into account that fact that the qubit started in a −1 state). However, it is easier to perform the logical X, Y , or Z operations without actually doing any implementation: these unitaries simply change the sign of the expectation value of certain operators. An S gate can be performed similarly. That is, any single qubit Clifford operation does not need to actually be executed as a circuit, but simply modifies which measurements will be performed in the future.
The key to implementing a CNOT is that we have the ability to implement a joint parity measurement Z a Z b on two different logical qubits a, b by measuring the appropriate cycle. Using the ability to perform joint parity measurements and an ancilla, we can perform a CNOT; see for example Ref. 16 . Thus, all Clifford operations are performed simply by measurements. Note that all these measurements (after arbitrary sequences of Hadamards and S gates) use only three of the dislocations, rather than four, so if the number of qubits is even one can equally well use three dislocations per qubit which slightly increases the density at the same distance.
There are two basic techniques described in the literature to perform fault tolerant measurements of logical operators. The operator that one desires to measure is, for example, the product of Zs around some loop. The first method, described in Ref. 11 , is to measure all qubits in the Z basis in an annulus that includes that loop. Using classical processing, one can then make the probability of an error in the logical measurement exponentially small in the width of the annulus. This procedure can readily be adopted to the uniform gauge; in this case, we measure each qubit in either the Z or X basis depending upon its location relative to the desired logical operator.
If the loop encircles a single hole (as in Ref . 11) or a pair of dislocations, then in fact one should just measure all qubits in the Z basis inside the loop. Suppose the loop instead encircles, for example, 4 dislocations from two different logical qubits in a dislocation code, and the goal is to measure a joint parity Z a Z b on both qubits. In this case, to avoid measuring Z a and Z b separately, one must measure just in an annulus near the loop. See Fig. 5 , where logical Z a = γ 1 γ 2 and logical Z b = γ 5 γ 6 . If we measure all qubits inside the dashed line in the appropriate basis, we will measure both Z a and Z b separately; if the goal is to measure only Z a Z b , then we must restrict to measurements inside an annulus.
This technique allows us to perform a logical CNOT in a time that is O(1), independent of the distance between the dislocations. This procedure of using joint parity measurements instead of braiding is reminiscent of the idea of measurement-only topological quantum computing 17 . The above measurement technique is used in the teleportation scheme of Ref. 6 
to allow teleportation in time O(1).
A second technique for fault tolerant measurement is as in Fig. 15 of Ref. 10 . This involves changing stabilizers during the measurement. This measurement technique is used to reduce the spacetime volume required to performed measurements.
We now describe a third fault tolerant measurement that may be useful for measuring logical operators in a dislocation code (or indeed in any other code where we wish to measure a logical operator which is a product of Paulis around a loop). Consider some location in a lattice, around which we wish to measure a logical operator. This location may contain some even number of dislocations, and the particular geometry inside this region is not important for this construction. We choose a gauge such that this operator is product of Zs, for example, around a loop. Refer to Fig. 6 . The original lattice is in (a), with the lattice extending further outside the area shown. The dashed line indicates a location with some unspecified geometry inside. We wish to measure a product of Z in a loop containing this region. We do this by changing stabilizers, turning off the stabilizers in a ring of plaquettes in (a); for example, we will turn them off in the ring just inside the outermost ring shown in (a). This then disconnects the code into two pieces, as shown in (b). We call these the inner and outer pieces; the inner piece contains the dashed line. The Zs inside ovals in (b) denote additional stabilizers which we turn on involving qubits on the outer boundary of the inner piece. The new stabilizers are the same as those in Fig. 1 . This choice of stabilizers prevents there from being any operators supported on the boundary which commute with the stabilizers but which are not themselves products of stabilizers. We similarly turn on stabilizers which are products of Zs on the inner boundary of the outer piece; these stabilizers are not shown.
Crucially, the product of the added stabilizers around the outer boundary of the inner piece is equal to the desired logical operator. Hence, the measurement of these stabilizers gives the logical operator. We cannot expect, of course, that these stabilizers will be measured perfectly. However, we can determine the product using error correction, such as minimum weight matching. The values of these boundary stabilizers are random (subject to the constraint on their product) after they are turned on; this can be mimicked by assuming that the value is equal to +1 for all of them, but allowing there to be an error which flips the sign of a pair of neighboring stabilizers which occurs with probability 1/2. Thus, if we are matching errors in the stabilizers (for example, if one of the boundary stabilizers is equal to −1), the weight to match any two boundary stabilizers is equal to 0. We can alternately assume that the weight is equal to +1 for all of them, except for one of them, and repeat the same matching. This gives two different choices, corresponding to whether the product of the stabilizers is +1 or −1; identifying the choice with the lowest weight determines what the product of stabilizers is. This matching can be carried out over some number of rounds, matching errors in spacetime.
In fact, the same matching can also be applied to the stabilizers on the inner boundary of the outer piece giving an additional way to infer the desired logical measurement. Further, the product of a stabilizer ZZ on the outer boundary of the inner piece and another stabilizer ZZ on the inner boundary of the outer piece is a product of four Zs around a plaquette and the initial value of this product is known with some confidence (as that plaquette stabilizer would have been measured in the configuration in (a) in the figure). Using all this information, however, goes beyond Stabilizers are turned off in a ring, and additional stabilizers are turned on; we only indicate the new stabilizers turn on which act on the outer boundary of the inner piece and additionally we do not show four stabilizers acting on the four qubits on the corners of the inner piece as in Fig. 1. a matching algorithm.
E. Numerical Results
We have numerically analyzed the performance of the dislocation code in the simplest possible error model, assuming that we prepare an initial state, apply random noise, and then attempt to error correct using perfect measurement of the stabilizers. A more realistic treatment, including errors in stabilizer measurement and even more a detailed study of circuits, is left for the future. The noise model considered was that errors are produced with probability p on each qubit, and if there is an error it is equally likely to be an X, Y , or Z error.
We considered two different decoders. One is the standard minimum-weight perfect matching decoder. The second decoder is a greedy decoder, similar to that in Ref. 14. This greedy decoder finds a pair of defect plaquettes with the minimum distance, and matches them; then it finds another pair with minimum weight among those remaining, and matches those, and so on. The main difference from Ref. 14 is in our treatment of edges. When applying this decoder to a patch with open boundary conditions, where it is possible to match a defect in the bulk to an edge, if such a match is possible with weight w, we treat that as being equivalent in cost to matching two defects in the bulk with weight 2w. The reason for this choice is that in this way, it is equally costly to match two bulk defects to the edge each with weight w or match them to each other with weight 2w.
Various comparisons are made using patches (i.e., squares with alternating electric and magnetic boundary conditions) and dislocations. In the figures, "square" refers to a choice of a patch using the geometry in this paper, while "planar" refers to a geometry rotated 45 degrees as in Fig. 3 of Ref. 10 which has a larger number of physical qubits for the given distance (i.e., the patch is still planar, but it is not square to the lattice directions but rotated). For the dislocations code analysis, we took four dislocations with toroidal boundary conditions; it would be better for analysis of large codes to take a larger number and we leave this for the future.
At low error rates and large enough distances, the performance of the two decoders is comparable; see Fig. 7 . The distance for the patch refers to the linear size of the patch, which in this case is the same as the code distance. At distance 10, the two decoders are equivalent within statistical noise, while they differ at distance 5.
However, the performance of the greedy decoder at larger noise is worse, and its threshold seems to be around p = 0.109. See Fig. 8 for this case for several geometries. The threshold seems to be the same for both types of defects, as expected. In Fig. 9 we show a finite-size scaling collapse, plotting the logical error probability as a function of (p − p c ) * L θ with p c = 0.109 and θ = 0.6. All curves for all different sizes for a given geometry are plotted with the same symbol.
Finally, we show the performance of the minimum-weight perfect matching decoder in Fig. 10 . Much more substantial drift in the crossing of the curves is observed, and we were not able to obtain a good finite-size scaling collapse, but the crossing is increasing and is consistent with the expectation that p = 1.5 * 0.1094 = 0.1641, using the value of 0.1094 from Ref. 15 for the critical point of the Ising spin glass on the Nishimori line. Interestingly, for the minimum-weight decoder, the dislocation code has a lower logical error rate at threshold than the patch does; the opposite was true for the greedy decoder.
NON-CLIFFORD OPERATIONS
The schemes of Refs. 6,18 rely on using magic states to perform T gates. Using these T gates that are produced, it is possible to perform universal quantum computation. However, a logarithmic overhead in the accuracy of the desired rotations arises: to represent a single-qubit rotation by an arbitrary angle φ to an accuracy δ will require logarithmically many (in 1/δ) rounds of T gates and Cliffords. However, the use of state injection need not be limited to implementing T gates. Here, we show that we can make the online cost independent of the desired accuracy δ by injecting arbitrary angles. Here, by the online cost, we assume that some ancilla factory separately is preparing states of the form cos(θ/2)|0 + sin(θ/2)|1 for a variety of desired angles, and our goal is to minimize the depth of the circuit that uses these ancillas. One reason to consider this online cost is that the ancilla factory can produce many ancillas in parallel, and so we may wish to cost the gates used to generate the ancillas separately. We assume that the desired quantum circuit is given as a combination of arbitrary single qubit rotations and Clifford operations. The single qubit operations will be by angles θ drawn from some set of angles {θ 1 , θ 2 , ...}. Assume that we want to perform a rotation by each angle θ i a total of n i times. We will prepare in advance n i copies of Y (θ i ) for each i. We will also prepare some number (explained below) of copies of states Y (2θ i ), Y (4θ i ), ...
Then, when we wish to perform R(θ i ), we perform state injection using a copy of Y (θ i ). If successful, this performs R(θ i ). If unsuccessful, it performs R(−θ i ). In the later case, we immediately follow with a state injection of Y (2θ i ). If successful, we perform R(2θ i )R(−θ i ) = R(θ i ). If again unsuccessful, we perform state injection with Y (4θ i ), continuing in this way until we are successful. See Refs. 4; however, those references focused on the case of a single qubit and did not consider the parallelization issue which we now address.
Naively, this scheme still leads to a logarithmic overhead in time: each attempt to perform the desired rotation gate has a probability 1/2 of succeeeding. If we have N qubits and need to perform rotation gates on all of them, it will take a time roughly log 2 (N ) until all of the rotation gates succeed. However, there is no need for other qubits to wait until all of the rotations are done. Suppose, for example, that N = 4, and on the first round of the circuit we apply single qubit rotations to all 4 qubits; on the second round, we wish to apply a CNOT gate from qubit 1 to qubit 2 and another CNOT gate from qubit 3 to qubit 4, and then on the third round we again wish to apply single qubit rotations on all for qubits. Suppose further that the single qubit rotations succeed on qubits 1, 2, 3 on the first attempt on the first round of the circuit, but fail on qubit 4. In this case, we can perform the CNOT gate from qubit 1 to qubit 2 without waiting for qubit 4 to succeed, while qubit 3 must wait idle until qubit 4 finishes. Then, qubits 1, 2 can start their single qubit rotations on round 3 of the circuit immediately after doing the CNOT, without waiting. Thus, the parallelization question arises: if we have a circuit with gates, and a gate can execute once all of its input gates finish, and if a gate has a certain probability of finishing each time we try it, what is the overall slowdown? We define this scheme more formally in the next subsection and we show that given a quantum circuit with r tot rounds of gates and N qubits, the probability that it takes a time T to finish decays exponentially in T − T 0 for T > T 0 , where T 0 = const. * r tot + const. * log(N ). Finally in subsection 2 C, we discuss the number of ancillas required and certain implementation details.
One important aspect of the parallelization scheme is that we will have to assume that all gates in the circuit have a bounded number of wires, and the results depend upon this bound. That is, it is advantageous to have a circuit which does not use completely arbitrary Cliffords but rather to use those Cliffords which can be decomposed into a product of gates acting only on a few qubits at a time. To understand this, consider the example above and suppose that instead of performing a CNOT from qubit 1 to 2 and a CNOT from qubit 3 to 4 on the second round of the circuit, we desired to perform some general Clifford operation on all four qubits which could not be decomposed as a product of operations on less than four qubits; in this case case, the Clifford would have to wait until all qubits finished the first round. This would be a potential advantage of using the circuits in Ref. 19 , for example, in quantum chemistry applications which use fewer Clifford gates; that is, while it is sometimes supposed that there is little advantage in reducing the number of Clifford gates since the non-Clifford gates dominate the costs, in any kind of scheme where the success of the non-Clifford gates is probabilistic, there may be an advantage to simplifying the Clifford gates as it may allow one to start some of the non-Clifford gates earlier.
A. Parallel Scheme
We now define a formal setting for the parallelization issue raised above. This setting in fact has nothing to do with quantum mechanics; it would be applicable to any situation in which computation is done in a circuit, where some circuit elements take a time to finish that is drawn from an exponential distribution, and a given element cannot start computing until all of its inputs have finished.
We consider a circuit diagram with gates connected by wires. Their will be N incoming wires at the start of the circuit and N outgoing wires at the end. The gates are organized into "rounds". Inputs to gates may be either incoming wires or outputs of other gates. Gates whose inputs consist solely of incoming wires to the circuit will be on round 0; otherwise, the round of a gate G is equal to 1 plus the maximum round of gates G such that an output wire of G is an input to G. Let there be a total of r tot rounds. Each gate has some number of incoming wires, where the number of incoming wires is bounded by some constant D. We assume that each gate has the same number of incoming as outgoing wires (one could perhaps consider generalizing to the case that this is not true; note however that if the total number of wires entering or leaving all gates in a given round is bounded and the number of incoming and outgoing wires on every gate is bounded then we can add some "dummy wires" to return to the case where each gate has the same number of incoming and outgoing wires).
Time will proceed discretely and will be labelled by an integer. The evolution will start at time 1. We will define a discrete Markov process, which will label each wire in the circuit by some integer. It will also label each gate by a time at which that gate "finishes". We start by labelling all the incoming wires to the circuit diagram by the integer 0. Initially, no gates have "finished" and so all gates are unlabelled. At time t, let P(t) denote the set of gates which have not yet finished and for which all incoming wires are labelled by a time less than t. For each gate in P(t), with probability P , we label the gate as finishing at time t and we also label the wires leaving the gate with time t; otherwise, with probability 1 − P , no change is made to that gate and those wires. These choices are made independently for each gate in P(t). The circuit is considered to finish when all gates have finished.
We will assume that all gates have the same probability P of finishing; this is not true for the quantum application above, as the Clifford gates always succeed. However, assuming a probability P of finishing for all gates gives a more pessimistic estimate than assuming a probability P of finishing for some gates and a probability 1 of finishing for others.
Our main result, proven in the next subsection is:
Theorem 2.1. Given any D and any P > 0, there exist constants c 1 , c 2 and c 3 > 0 such that the following holds. For any such circuit with N incoming and outgoing wires on each round (this N is the number of qubits) and r tot rounds, the probability that the circuit has not finished in a time T is bounded by exp(−c 3 (T − T 0 )), where
In fact, our proof will allow adversarial adjustment of the circuit. In particular, consider any sequence of events up to time t. Then, there will be some set W of wires which are already labelled but which enter gates which have not yet finished. We allow the adversary to change arbitrarily the gates which have not yet finished, subject to the constraints that each gate have the same number of incoming and outgoing wires, with at most D such wires.
B. Amortized Analysis of Parallel Scheme
We now perform an amortized analysis of the parallel scheme. For analysis purposes, we modify the circuit as follows. Given any circuit, we add additional gates with at most K wires on rounds r + 1, r + 2, ..., extending the circuit indefinitely. These gates may be thought of as identity gates so that they simply preserve the data without doing any further computation. Then, the original circuit finishes when all gates in the first r tot rounds finish in the modified circuit. This is done to simplify the analysis so that we do not need to separately handle the time at which the circuit finishes. When we refer to the computation "finishing" later, we mean that all gates in the first r tot rounds finish.
Also, to simplify the analysis, we add additional identity gates to the circuit as follows. Suppose that a wire leaves a gate G on round r and enters a gate G on some round r > r + 1. In this case, we modify the circuit by adding additional identity gates G 1 , G 2 , ... on rounds r + 1, r + 2, ..., r − 1 with 1 incoming and 1 outgoing wire each, and connect the wire leaving G into gate G 1 , then connect the output of G 1 into G 2 , and so on, and finally into G . In this way, there will always be a total of N incoming wires to each round.
We define a weight function after r rounds of the circuit. Let n(t, r) be the number of wires that leave gates at round r which finish at time t; i.e., these wires and gates are both labelled by t.
Let
so that C(t, r) counts the number of wires that leave gates on round at most r which finish at time t. Define a function W (t, r) by
where A is a constant to be optimized later. Define the weight after time t, W (t), by
where the minimum is over all r for which C(r, t) is non-zero. Define r last (t) to be the minimum r such that n(t, r) > 0. Note that
and hence if the computation is not finished then W (t) ≤ r tot . Our analysis is based on showing that the average value of W (t + 1) is at least equal to W (t) plus some positive constant computed below; further, we will show that W (t) ≥ W (0) + vt for some v > 0 with probability that is exponentially close (in t) to 1; see lemma 2.2. This implies the theorem since W (0) = −A −1 log(N ) and since if W (t) > r tot then the computation finishes. To obtain the constants in the theorem, put c 1 = 1/v, c 2 = 1/(Av).
We briefly motivate the choice of the weight function as follows (this paragraph is purely heuristic and does not play any role in the proof). Suppose at some time t we have some given n(t, r). Suppose all the gates are single wire gates so that D = 1. Then, roughly (1 − P )n(t, r last (t)) of the gates in round r last (t) will not have finished at time t + 1; roughly (1 − P ) 2 n(t, r last (t)) will not have finished at time t + 2, and so on. We expect that eventually at a time roughly t + log 1/(1−P ) (n(t, r last (t))) the last such gate will finish. Thus, the computation might get delayed by an amount roughly t − r last (t) + log 1/(1−P ) (n(t, r last (t))) at time t + log 1/(1−P ) (N ), where the delay at time t is equal to t − r last (t). However, we can also apply the same analysis to the set of C(t, r last (t) + i) gates on rounds r last (t), ..., r last (t) + i which finished at time t; at time roughly t = t + log 1/(1−P ) (C(t, r last (t) + i)) we expect that the last of these will have completed at least one more round, and hence at that time we will have r last (t ) at most equal to r last (t) + 1 (it may of course be less if it is one of the gates from rounds less than r last (t) + i that have not completed). The function W (t) we have defined is a minimum over all choice of r of a quantity inspired by this delay calculation: roughly, it is t minus an estimate of the delay. The reason for the constant A is for technically optimizing estimates later.
One further reason for our definition that W (t, r) = r −
1
A ln(C(t, r)), rather than W (t, r) = r − 1 A ln(n(t, r)), is that given our definition of W (t, r), we have the property that W (t + 1, r) ≥ W (t, r) always.
We now prove the following lemma:
Lemma 2.2. Given any D and any P > 0, there exist constants v > 0 and c > 0 such that the probability that
Proof. Consider some given situation after time t; i.e., our analysis is for a given situation of events on previous times. We will first estimate the average increase W (t + 1) − W (t), where the overline denotes the averaging over possible events at time t + 1. Consider a given r. Let S(t, r) be the set of wires leaving gates in round r which finish at time t (so that |S(t, r)| = n(t, r)). On round r + 1, each of these wires must participate in either a one wire gate or a multi-wire gate. If it participates in a one wire gate, then it has a probability P of finishing round r + 1 at time t + 1.
If it participates in a multi-wire gate, it is possible that it must wait for some other gate in round r − 1 to finish if the other wire in the gate is not in S(t, r); note that the other wires in the gate cannot be in S(t, s) for s > r (the addition of identity gates described above prevents this case). If n(t, r) ≤ (D − 1)C(t, r − 1), it is possible that every single wire in S(t, r) must wait for some other gate to finish round r − 1. However, if n(t, r) > (D − 1)C(t, r − 1), then there must be some gates round r + 1 which do not need to wait. Indeed, the number of wires entering gates in round r + 1 which do not need to wait is at least equal to
If W (t, r) > W (t) + 1/2, then we say that round r is not important. We will show that, for sufficiently small A, the quantity W (t, r) is likely to increase (and we estimate how likely it is to increase) by at least a constant for the rounds r which are important. We do not consider the change in W (t, r) for the rounds which are not important as they will have little effect on the minimum over r.
Assume then that r is important. Then C(t, r) ≥ exp(A/2)C(t, r − 1). Note that C(t, r) − C(t, r − 1) = n(t, r). Hence,
and so
where the last line of the above equation serves as a definition of ω. On average at least P K(t, r) of the wires in S(t, r) enter gates which finish round r + 1 at time t + 1. Hence, C(t, r) − C(t + 1, r) ≥ P K(t, r). Using concavity of the logarithm,
For sufficiently large A so that ω > 0, this means that on average W (t + 1, r) − W (t, r) is greater than some positive constant. This does not yet give what we want; we want to show some lower bound on the probability that W (t+1, r)−W (t, r) is greater than some positive constant for all important rounds r. However, by the assumption that r is an important round, we have C(t, r) ≥ exp(A(r − r last − 1/2) so
Hence, K(t, r) is exponentially large in r. The probability that less than P K(t, r)/2 wires in S(t, r) enter gates which finish round r + 1 at time t + 1 is exponentially small in K(t, r) as the probabilities that different gates finish are independent (the particular constant P/2 in P K(t, r)/2 is unimportant and any constant in (0, P ) would suffice). Since this probability is exponentially small in K(t, r) it is doubly exponentially small in r. If at least P K(t, r)/2 wires do enter gates which finish round r +1 at time t+1 then W (t+1, r) ≥ W (t, r)+A ln(1−P ω 2 ). We sum over these probabilities and apply a union bound to upper bound the probability that there is an important round r such that W (t, r) does not increase by some strictly positive constant. For sufficiently large A, we can bound this probability less than 1. Hence, for some sufficiently large A, there are some constants p, c > 0 such that, with probability at least p, W (t + 1, r) − W (t, r) ≥ c for all important rounds. Hence, for sufficiently large A,
with probability at least p > 0, where c = min(1/2, c). This already gives sufficient information to show that W (t) ≥ W (0) + pc . Note that W (t + 1) ≥ W (t) always. However, we can also show that it is exponentially unlikely (exponentially in t) for W (t) not to be at least a constant times t larger than W (0); the proof of this will be similar to the proof of the Chernoff bound. Let a be a negative constant to be chosen later. Note that exp(aW (t + 1)) ≤ (1 − p) + p exp(ac ) exp(aW (t)). Hence exp(aW (t)) ≤ 
For any given v ≤ pc , we can find an a < 0 such that the expression above is exponentially small in t. Hence the lemma follows. To obtain the constants in the lemma, fix some definite v < pc and minimize the expression on the right-hand side over choice of a in Eq. (13).
C. Number and Accuracy of Ancillas Required, and Implementation Details
Finally, we consider the number of copies of states Y (2θ i ), Y (4θ i ), ... that we will need. We consider two different regimes, depending on the magnitude of n i . Suppose there are a total of A different angles that we need in the entire circuit, indexed by i = 1, ..., A, and suppose that n i = 1 for all i. In this case, for each i, we need roughly log 2 (A) ancillas, with one copy each of Y (θ i ), Y (2θ i ), ..., Y (2 log 2 (A) θ i ), in order for the entire computation to be likely to succeed. This leads to an unfortunate logarithmic overhead in the number of ancillas that we need, which may be expensive as the ancillas Y (θ i ) already need to be prepared to high accuracy and hence are expensive (we discuss the accuracy needed in the ancillas in the next paragraph). However, in many applications, we will have n i >> 1. For example, in applications in quantum chemistry using Trotter-Suzuki evolution, it may be necessary to have a large number A >> 1 of angles (encoding the large number of coupling constants in the Hamiltonian), but each angle will be used many times, as there will be many different Trotter steps. If n i = Ω(log 2 (A)) for all i, then there is indeed only a constant overhead in the number of ancillas required as may be seen as follows. Pick some constant c > 1/2. It is exponentially unlikely (exponentially in n i ) that more than cn i of the gates R(θ i ) will fail on the first attempt. Hence, if we prepare cn i copies of Y (2θ i ) for each i, then it is exponentially unlikely that we will not have enough copies of Y (2θ i ) for the given i. Hence, if n i log(A), it is unlikely that there will be any i for which we will not have enough copies of Y (2θ i ). Similarly, it is exponentially unlikely that c 2 n i copies of Y (4θ i ) will not suffice, and in general c a n i copies of Y (2 a θ i ) will suffice so long as c a n i is large compared to log(A). Once we reach large enough a that c a n i is of order log(A), this estimate breaks down, but then we know that c a n i log(A) ∼ log 2 (A) extra ancillas suffice (by the argument above in the regime that n i = 1). By summing the geometric series n i , cn i , c 2 n i , ... up until c a n i and then adding log 2 (A), we find indeed that there is at most a constant overhead. This regime is quite relevant to the quantum chemistry simulation considered in Ref. 20 ; further, that regime has the advantage that by coalescing different terms by different magnitude we can change the angles needed, possibly reducing the number of different angles required 21 . Now we consider the accuracy of the ancillas that we need. Suppose we need to prepare Y (θ i ) to an accuracy δ in order to implement the gate R(θ i ) to the desired accuracy. One may worry that we will need to prepare the ancillas Y (2θ i ), Y (4θ i ), ... to higher and higher accuracy. Suppose however that there are a total of R different rotation gates in the circuit that we wish to implement. It is common to argue that ancillas should be prepared to an accuracy δ 1/R to ensure that the total error will be small. This estimate may be pessimistic as it assumes that errors add in the worst case rather than possibly averaging. However, if we continue to use this estimate, then since the average number of ancillas that we use to implement the circuit is only a constant amount larger than the number of gates (in fact, twice as large since each gate has probability 1/2 of succeeding), then we need to only increase the accuracy δ by a factor of 2. Further, the accuracy required depends slightly upon the type of errors that arise in state preparation. Suppose that the dominant error is that rather than preparing a state Y (θ) with the desired angle θ, we prepare the state Y (θ ) for some other angle θ . In that case, rather than trying to prepare the ancillas Y (θ), Y (2θ), Y (4θ), we prepare a sequence of ancillas Y (θ ), Y (θ 2 ), Y (θ 3 ), ..., where θ is an approximation to θ within accuracy δ, and θ 2 is an approximation to θ + θ to within accuracy δ, and in general θ k+1 is an approximation to θ + θ 2 + ... + θ k to within accuracy δ.
In actual practice, using this scheme requires many teleportation steps. For example, if the first state injection fails, one might teleport the qubit elsewhere and try a second state injection. If that state injection also fails, the qubit must be further teleported elsewhere for yet another state injection. This means that even after a state injection succeeds on the qubit, some time might be spent teleporting it back to the desired location. However, this sequence of teleports to bring it back can be done in a time that at most doubles the time to do the state injection. Imagine a sequence
