Most work in quantum circuit optimization has been performed in isolation from the results of quantum fault-tolerance. Here we present a polynomial-time algorithm for optimizing quantum circuits that takes the actual implementation of fault-tolerant logical gates into consideration. Our algorithm re-synthesizes quantum circuits composed of Clifford group and T gates, the latter being typically the most costly gate in fault-tolerant models, e.g., those based on the Steane or surface codes, with the purpose of minimizing both T -count and T -depth. A major feature of the algorithm is the ability to re-synthesize circuits with additional ancillae to reduce T -depth at effectively no cost. The tested benchmarks show up to 65.7% reduction in T -count and up to 87.6% reduction in T -depth without ancillae, or 99.7% reduction in T -depth using ancillae.
Introduction
Quantum computers have the potential to efficiently solve important computational problems, including integer factorization [29] and quantum simulation [18] , for which there are no known efficient classical algorithms. However, even with recent advances in quantum information processing technologies [6, 7, 8, 26] , the prospects of scalable quantum computing without some systematic way of mitigating physical errors and noise are bleak.
The active and vibrant fields of quantum error correction and fault-tolerance provide such tools for constructing scalable quantum computers. By combining physical qubits through the use of error correcting codes and providing fault-tolerant logical operations, larger computations can be achieved with high fidelity -by concatenating codes, or in topological codes by increasing code distance -provided the physical operations achieve a certain threshold fidelity. With recent improvements to fault-tolerant thresholds [5, 14, 15] , scalable quantum computation is becoming more and more viable, resulting in a growing need for efficient automated design tools targeting fault-tolerant quantum computers.
Quantum circuit synthesis and optimization is particularly important, given the prevalence of the circuit model of quantum computation, but previous work has been largely isolated from the unique concerns of fault-tolerance. While at the physical level, coupled gates are generally the hardest to perform, most of the common quantum error-correcting codes have efficient CN OT implementations. Moreover, for fault-tolerant models based on (double even, self-dual) CSS codes, e.g., the popular Steane code, as well as the promising surface codes, the Clifford group can be implemented as logical gates with little cost [14, 33] .
For universal quantum computing, however, at least one non-Clifford group gate is needed, which typically requires large ancilla factories and gate teleportation to implement fault-tolerantly. As the non-Clifford T gate has known constructions in most of the common error correction schemes, the standard universal fault-tolerant gate set is taken to be "Clifford + T ". Given the high cost of the fault tolerant implementations of the T gate [1, 14] , exceeding the cost of Clifford group gates by as much as a factor of a hundred or more, it has recently been proposed that efficient circuits should minimize the number of T gates, and more specifically the number of T gates that cannot be performed in parallel [2, 13] -we define these metrics as a circuit's T -count and T -depth, respectively. Indeed, Fowler [13] shows how to perform fault-tolerant computations in time proportional to one round of measurement per layer of T gates, and as a result the T -depth directly determines a circuit's runtime. Likewise, reducing the number of T gates reduces the number of ancilla states that require preparation, vastly reducing circuit volume and at the same time increasing the fidelity of the computation. While the primary purpose of our work is to optimize T -depth (circuit runtime), our algorithm also provides significant reductions to T -count (circuit volume).
Some recent work has been done concerning minimization of T -depth [2, 28] , though these previous results focus on finding small optimal two-and three-qubit circuits [2] , and on classes of circuits that can be parallelized to T -depth 1 by adding ancillae [2, 28] . By contrast, we report a scalable automated tool for the optimization of T -depth that functions with or without ancillae, and is not limited to a few qubits or a specific class of circuits. In particular, we present a polynomialtime algorithm for optimizing both the T -depth and T -count of quantum circuits composed of Clifford group and T gates. The algorithm also makes automatic use of ancillae to optimize T -depth, with the addition of ancillae typically decreasing the runtime of our software implementation. Our experiments show on average 61.1% reduction in T -depth and 39.9% reduction in T -count without adding any ancillae, using the available benchmarks. When the use of ancillae is allowed, the average T -depth reduction is demonstrated to be as high as 80.7% (the more ancillae are allowed the more parallelization becomes possible, in some cases reducing T -depth by as much as 99.7%).
The rest of this paper is structured as follows: Section 2 reviews some background on quantum and reversible computation, and introduces the notations we will use. Sections 3 and 4 describe the algorithmic core -a procedure that optimally parallelizes the T gates in a circuit composed of CN OT and T gates by performing matroid partitioning. Section 5 develops a heuristic extending the optimal {CN OT, T } core to a universal gate set, while Section 6 describes the final algorithm. Section 7 reports our experimental results, and Section 8 concludes the paper.
Preliminaries
We begin by reviewing some basic facts about quantum and reversible circuits necessary for this paper.
In the classical circuit model, the state of a system of n bits is represented as a binary string of length n, with classical gates corresponding to operators that map length-n binary strings to length-m binary strings. More precisely, length-n binary strings are vectors of F n 2 , where F 2 is the two-element finite field with addition corresponding to Boolean exclusive-OR (EXOR, ⊕) and multiplication corresponding to Boolean AND (∧). We then represent classical gates as operators f : F n 2 → F m 2 , and we typically refer to f as a (classical) function. For brevity, if m = 1 we call f Boolean.
The quantum circuit model, one of the prominent models of quantum computation [24] , generalizes the classical circuit model to deal with quantum effects. In particular, it describes the state space of a system of n qubits as a vector in a 2 n -dimensional complex vector space H spanned by the (classical) n bit states. By convention, we refer to the classical states as the standard or computational basis of H and write them in Dirac notation: |x , x ∈ F n 2 . In contrast to the classical circuit model, quantum gates are restricted to a subset of all operators on H -specifically, quantum gates are linear operators U : H → H that preserve the L 2 norm. Such operators U satisfy U † U = U U † = I, where U † denotes the adjoint of U , and are known as unitary. Given that unitary operators are invertible, we see that the subset of quantum transformations that permute the computational basis states are exactly the set of invertible classical transformationswe call such functions reversible, with the intuition that any computation performed by reversible functions can be undone or reversed. The Toffoli gate,
is an example of a reversible function.
We can also have classical/quantum computations that use ancillae -being bits/qubits that can be initialized to the 0/|0 or 1/|1 state and act as a temporary register. Without loss of generality, we require that all ancillae are initialized in the 0/|0 state. In the case of a circuit with n bits, m of which are data bits (i.e. n − m is the number of ancillae), we describe the state space as some subspace V of F n 2 with dimension m. We will typically use n to represent the total number of bits in a system, and m to refer to the number of data bits.
While all reversible classical gates are linear as operators over H, they need not be linear as operators over F n 2 . In particular, we call f :
For instance, the reversible controlled-NOT gate CN OT : |x |y → |x |x ⊕ y is linear over F n 2 . It is a known result that the set of all linear reversible functions are those that can be computed by a circuit consisting of only CN OT gates [25] .
Throughout this paper we will also be interested in linear Boolean functions and their relation to linear reversible functions. For convenience, we refer to the set of n-ary linear Boolean functions F n 2 → F 2 as the dual vector space (F n 2 ) * of F n 2 , and note that a linear Boolean function f : F n 2 → F 2 can be represented as a row vector over F n 2 -i.e., x T for some x ∈ F n 2 . Furthermore, for a set of linear Boolean functions S ⊆ (F n 2 ) * , we define rank(S) as the maximum number of independent (row) vectors in S, or equivalently the dimension of the subspace V * spanned by S.
As this paper is concerned with the optimization of quantum circuits, we also define some quantum gates commonly used in fault tolerant models. In particular, we define the T and Hadamard gates,
We will show that circuits over the gate set {CN OT, T } implement linear reversible functions with discrete phases corresponding to the eighth roots of unity. A side result of this paper is a proof that {CN OT, T } circuits can be simulated on a classical computer in polynomial time. If this set is further extended with the Hadamard gate we achieve a gate set that is universal for quantum computation. We call {H, CN OT, T } the "Clifford + T " gate set, as it contains the Clifford group generators {H, P := T 2 , CN OT } along with the T gate. Moreover, {H, CN OT, T } is a minimal generating set for the Clifford + T gate set. Since quantum gates are commonly defined by unitary matrices, we provide the equivalent matrix definitions below:
Computable sets of linear Boolean functions
While every linear reversible function f over n inputs can be written as n linear Boolean functions, i.e. f (a 1 , a 2 , ..., a n ) = (f 1 (a 1 , ..., a n ), ..., f n (a 1 , ..., a n )) , not every set of linear Boolean functions defines a reversible function. For instance, f (a 1 , a 2 , a 3 ) = (a 1 , a 2 , a 1 ⊕ a 2 ) is irreversible since the input a 3 is effectively destroyed. It is easy to verify that a set of n linear Boolean functions forms the outputs of an n-ary linear reversible function if and only if they have rank equal to the dimension of the input space.
Lemma 1. Given a subspace V of F n 2 and a set of linear Boolean functions S = {f 1 , f 2 . . . f n } ⊆ V * , the linear function f : V → W defined as f (a 1 , a 2 , . . . , a n ) = (f 1 (a 1 , . . . , a n ), . . . , f n (a 1 , . . . , a n )) is reversible if and only if rank(S) = dim(V ).
Since the unitary quantum circuit model is reversible, a set of linear Boolean functions S can only be computed simultaneously (i.e. there exists a quantum circuit implementing the transformation |a 1 a 2 ...a n → |b 1 b 2 ...b n where for each f ∈ S, f (a 1 , a 2 , ..., a n ) = b i for some i) if it defines a reversible function. We call such a set of linear Boolean functions (reversibly) computable over a particular input space V -as we will be concerned strictly with reversible computations, we frequently omit the qualifier reversible.
We may also want compute a set S of linear Boolean functions with a linear reversible function on n > |S| qubits. In this case, since every n-ary linear reversible function corresponds to some set of n linear Boolean functions, a linear reversible function computes S if and only if there exists some computable superset S ′ of S. Equivalently from Lemma 1, a set S ⊆ (F n 2 ) * is (reversibly)
Figure 1: A circuit computing the functions
computable over a subspace V of F n 2 if and only if there exists a superset S ′ of S with cardinality n such that rank(S ′ ) = dim(V ).
Before moving on, we establish the condition under which a computable superset of linear Boolean functions exists. Given a set S = {f 1 , f 2 , ..., f k } ⊆ (F n 2 ) * and subspace V , we can observe that we only need to find some f k+1 , ..., f n ∈ (F n 2 ) * such that rank ({f 1 , f 2 , ..., f n }) = dim(V ). It is not hard to see that such f k+1 , ..., f n exist if and only if the number of linearly independent vectors in (F n 2 ) * needed is at most n − k.
Lemma 2. Given a subspace V of F n 2 and a set of linear Boolean functions S ⊆ (F n 2 ) * , there exists a superset S ′ of S with cardinality n such that rank (S ′ ) = dim(V ) if and only if
We can note that inequality (1) implies |S| = rank(S), i.e., S is linearly independent, when dim(V ) = n.
{CN OT , T } circuits
We first consider circuits over the gate set {CN OT, T, P := T 2 , Z := T 4 , T † := T 7 , P † := T 6 }, as they have a particular property that will be crucial to synthesizing low T -depth circuits. We remind the reader that the even powers of the T gate are all Clifford gates, whereas all odd powers lie outside the Clifford group. This is an essential observation for practical considerations. Furthermore, no power of the T gate requires more than a single non-Clifford T gate to implement. We usually omit the extraneous gates and refer to this gate set by the generating set {CN OT, T }.
It can be observed that since CN OT |x |y = |x |x ⊕ y and T |x = e iπx 4 |x for x, y ∈ F 2 , a {CN OT, T } circuit can be described as computing a linear reversible function on the input basis state, with an added phase that is some power of ω := e iπ/4 . Stated more precisely [2] , Lemma 3. A unitary U ∈ U (2 n ) is exactly implementable by an n-qubit circuit over {CN OT, T } if and only if
for some linear reversible function g ∈ F n 2 → F n 2 and linear Boolean functions f 1 , f 2 , ..., f l ∈ (F n 2 ) * with coefficients c 1 , c 2 , ..., c l ∈ Z 8 .
As a result of Lemma 3, we can fully characterize any unitary U ∈ U (2 n ) implementable by a {CN OT, T } circuit with a set S ⊆ Z 8 × (F n 2 ) * of linear Boolean functions together with coefficients in Z 8 , and a linear reversible output function g : F n 2 → F n 2 , with the interpretation
Moreover, S and g are efficiently computable given a {CN OT, T } circuit, taking time linear in the number of qubits and gates. In computing S it also becomes apparent when T gates, possibly physically separated within the circuit, are applied to the same value and can thus be replaced by a phase gate such as P : |x → ω 2·x |x . Our experimental results show that a large number of T gates can be removed in this way.
Example 1. Consider the following circuit.
If we track the effect of the CN OT gates we see that T 1 and T † 2 (indices are used to mark different T gates within the circuit) are both applied to qubits in the state |x 1 , resulting in a cumulative phase of ω x 1 +7x 1 = ω 8x 1 = 1 x 1 = 1. As such, both gates can be removed. Likewise, T 3 and T 4 are both applied to qubits in the state |x 3 ⊕ x 4 , and their phases combine to ω 2(x 3 ⊕x 4 ) -this pair of T gates can thus be replaced with a single P gate. This results in an optimized circuit:
which may be optimized further to the form SW AP (x 1 , x 2 )CN OT (x 4 , x 3 )P (x 3 ) by rewriting the linear reversible section.
Once S and g have been computed, the proof of Lemma 3 [2] gives a constructive method for synthesizing a circuit implementing U S,g . However, this naïve method of re-synthesis may end up with worse T -depth than the original circuit, despite possibly reduced T -count.
We can instead recall from Section 2 that if a subset A ⊆ S is reversibly computable (i.e. if the linear Boolean functions in A are reversibly computable), we can construct a linear reversible function with outputs simultaneously computing the functions in A; in this case |A| of the necessary phase factors could then be applied in parallel. Synthesis can thus proceed by partitioning S into computable subsets, then for each partition A ⊆ S first compute a (reversible) superset of A with a stage of CN OT gates; many efficient algorithms exist [4, 25, 20] that can decompose a linear reversible function into CN OT gates. Next we apply the relevant phase gates in parallel to add the phase ω (c,f )∈A c·f (x 1 ,x 2 ,...,xn) , and finally uncompute by reversing the CN OT stage. Given that at most one T gate is used to implement any integer power of T , every partition will have a T -depth of at most 1.
Figure 2: {CN OT, T } circuit implementing the doubly controlled Z gate.
As a result, any unitary U implementable over {CN OT, T } can be implemented in T -depth k where k is the minimum number of sets partitioning 1 S 1 = {(c, f ) ∈ S|c ≡ 1 mod 2} into computable subsets, as elements in S 0 = S \ S 1 do not require T gates to implement. In fact, we can trivially see that given a specific set S and output g, k is the minimal T -depth, as any layer of T gates in a circuit implementing U S,g corresponds to a computable subset of S.
It is important to note that while this method reduces and maximally parallelizes the T gates, it will not necessarily find the optimal T -count and by extension T -depth. Specifically, given a {CN OT, T } circuit with phase defined by the set S there may be some distinct S ′ that defines an equivalent computation using fewer T gates. Consider, for instance, the set
2 ) * . We note that the integer sum
2 , and 0 for (0, 0, 0, 0). Since ω 8 = ω 0 = 1, S ∅ computes the trivial phase on every input and is therefore equivalent to the empty set ∅. In this case, all |S ∅ | = 15 T gates can then be removed.
It turns out, through a brute force search, that for n < 4, no two sets S, S ′ requiring distinct numbers of T gates define equivalent computations. As a result, we obtain a direct proof that the doubly controlled-Z gate requires 7 T gates to implement over {CN OT, T } with any number of ancillae. For n ≥ 4 however, the problem of minimizing T -count in {CN OT, T } circuits reduces to a minimization problem over degree 1 polynomials in mixed arithmetic; moreover, since every such polynomial defines the global phase for some {CN OT, T } circuit, the two problems are in fact equivalent.
Parallelizing Λ 2 (Z)
To illustrate {CN OT, T } re-synthesis, consider the circuit in Figure 2 , implementing the doubly controlled-Z gate,
over {CN OT, T }. We track the effect of each CN OT gate on the state of the qubits, as annotated in the circuit. When a T /T † gate is applied, we add/subtract a term in the exponent of ω corresponding to the state of the target qubit. The resulting transformation is then
The T -stages in Figure 2 also correspond to a partition of S, specifically
Figure 3: T -depth 2 implementation of Figure 2 with one ancilla.
It can easily be verified that for each subset S in this partition, dim(V ) − rank(S) ≤ n − |S| (i.e. S satisfies (1)), since dim(V ) = n = 3 and each subset is linearly independent. By contrast, the partition
could not have been synthesized as the set {x 1 ⊕ x 3 , x 2 ⊕ x 3 , x 1 ⊕ x 2 } has rank 2, thus the mapping
is not reversible. While we haven't yet described how to find a minimal computable partition algorithmically, we can observe that the T -depth 3 partition
is computable since each subset satisfies (1), and is also minimal. If, however, we added an ancilla when re-synthesizing Figure 2 , we can use the extra qubit to generate a smaller partition. In particular, we know
is as well. We can then compute the value x 1 ⊕ x 2 ⊕ x 3 into the ancilla with 2 CN OT gates and apply 4 T gates (one for each qubit) at the same time. To connect this intuitive idea with equation (1), we observe that n = 4, dim(V ) = 3 since the input |x 1 x 2 x 3 |0 spans a space of dimension 3, and Figure 3 shows the resulting circuit, implementing Λ 2 (Z) in T -depth 2.
Matroids
We now turn our attention to the problem of determining a minimal partition of a set of linear Boolean functions into computable sets. Due to the connection between the computability condition (1) on sets of linear Boolean functions and linear independence, we are able to phrase the problem as a matroid partitioning problem. To do so, we first introduce the concept of a matroid, an algebraic structure that generalizes the idea of linear independence in vector spaces.
Definition 1.
A finite matroid is a pair (S, I) where S is a finite set and I is a set of subsets of S such that
2. For all A, B ⊂ S, if A ∈ I and B ⊂ A, then B ∈ I.
3. For all A, B ∈ I, if |A| > |B|, then there exists some a ∈ A such that B ∪ {a} ∈ I.
It turns out that a set of linear Boolean functions, together with an independence relation defined by the equality (1), forms a matroid:
Lemma 4. For any subspace V of F n 2 with dimension m and set of linear Boolean functions S = {f 1 , f 2 , . . . , f k } ⊆ V * , let I denote the set
The pair (S, I) is a finite matroid.
Proof. We verify that (S, I) satisfies all three conditions of Definition 1.
1. m − rank(∅) ≤ n − |∅| is trivially true since m ≤ n. Thus ∅ ∈ I. Otherwise, rank(A) > rank(B) and we can let A ′ and B ′ be maximal linearly independent subsets of A and B, respectively. Since
With Lemmas 2 and 4 we see that the problem of finding a minimal partition of the phase factors in a {CN OT, T } circuit reduces to the more general matroid partitioning problem. Figure 4 : The directed graph G s constructed when adding x 1 ⊕ x 2 to the minimal partition
Matroid partitioning
The matroid partitioning problem can be defined as follows:
Matroid partitioning can, perhaps surprisingly, be solved in polynomial time, given an independence oracle for the matroid [12] . As a result, given an oracle for (1), the T gates in a {CN OT, T } circuit can be optimally partitioned efficiently. Since the condition in Lemma 2 can be checked by using Gaussian elimination to compute the matrix rank in O(n 3 ) time, 2 we thus see that a minimal partition can be computed in polynomial time. The rest of this section describes an algorithm for computing such a minimal partition.
The algorithm we use for solving the matroid partitioning problem is based on an algorithm due to Edmonds [12] . Given a matroid (S, I) and a minimal (matroid) partition P of S ′ ⊂ S, we take an element s ∈ S \ S ′ not already partitioned and construct a minimal partition of S ′ ∪ {s}. To create the new partition, we construct a directed graph G s containing a vertex u for every u ∈ S ′ ∪ {s} as well as a vertex ⊥ p for every subset p ∈ P . The edges of G s represent changes to the partition that are invariant under the property of each subset being independent. In particular, for any u, v ∈ S ′ ∪ {s} there is a directed edge v → u in G s if and only if u is contained in some subset p ∈ P and (p \ {u}) ∪ {v} ∈ I, i.e. v can be added to p if we remove u. Additionally, given a subset p ∈ P and element u ∈ S ′ ∪ {s}, there exists an edge u → ⊥ p if and only if p ∪ {u} ∈ I. A path from s to ⊥ p for some subset p gives a set of updates to P that produce a valid partition P ′ of S ′ ∪ {s}. Likewise, if there is no such path, there is no partition of size |P | partitioning S ′ ∪ {s} (see [12] for a proof), and so a new subset {s} is added to P . Figure 4 shows the full graph G s for one iteration when computing a minimal partition for Λ 2 (Z) (Figure 2) .
Rather than generating the graph G s explicitly for each element s, we try to construct a path from s to some ⊥ p breadth-first (Algorithm 1). The time complexity of breadth-first search is O(|E| + |V |) for a graph with edge set E and vertices V . We can note that there are |S ′ | + |P | + 1 vertices and at most |S ′ | 2 + |P | · (|S ′ | + 1) + (|S ′ | + |P |) edges in G s , as well as the fact that 
Algorithm 1 Matroid partitioning algorithm
function Partition(s, P, (S, I)) /* I denotes the independence oracle, P is a minimal partition */ Create path queue Q, Q.enqueue(s → ∅) Unmark each element of S, mark s while Q non-empty do t := Q.dequeue()
Mark u end if end for end if end for end while If no path was found, set P := P ∪ {s} end function Algorithm 1 details the algorithm for adding an element s to a partition P of matroid (S, I) -the full algorithm follows by iteratively adding each element of the ground set to the initially empty partition, and correctness follows from the property that if P is minimal for (S, I), then the new partition P ′ is minimal for (S ∪ {s}, I). Since adding a single element to a partition of i elements takes O (2i) 2 · n 3 + 2i time, and
, we see that Algorithm 1 can be used to partition the full set S in O(|S| 3 · n 3 ) time.
Extending to a universal gate set
In the previous sections we described a method for re-synthesizing {CN OT, T } circuits that removes redundant T -gates by computing the total phase and parallelizing the phase gates through matroid partitioning. However, the usefulness of such an algorithm on its own is marred by the fact that {CN OT, T } circuits are a restricted class of quantum circuits -in particular, they do not create superpositions or interference between basis states. To apply the optimization procedure to more complex quantum circuits, we extend these ideas to deal with the universal gate set {H, CN OT, T }.
We recall that a Hadamard gate H has the effect
on a basis state x 1 ∈ F 2 . We call x 2 a path variable, following in the tradition of similar representations of quantum circuits [10, 27] , called sum over path representations. We note that the state |x 1 effectively ceases to exist, having been replaced with |x 2 . Circuits over {H, CN OT, T } can then be described by a phase polynomial and set of linear Boolean outputs, similar to Lemma 3.
Lemma 5. If a unitary U ∈ U (2 n ) is exactly implementable by an n-qubit circuit over {H, CN OT, T } with k H gates, then for x 1 x 2 ...x n ∈ F n 2 ,
..,x n+k ) |y 1 y 2 ...y n where y i = h i (x 1 , x 2 , . . . , x n+k ) and
for some linear Boolean functions h i , f i , g i and coefficients c i ∈ Z 8 . The k path variables x n+1 , ..., x n+k result from the application of Hadamard gates.
Proof. Follows from the effect of each gate on the computational basis states.
It can be noted that unlike Lemma 3, the converse is not true -some computations of the form in Lemma 5 do not define unitary transformations.
Synthesis of a circuit based on the above representation is more challenging. In particular, the Hadamard gates in effect destroy values and create new ones, changing the state space to some new subspace of F n+k 2 , possibly with greater dimension if the destroyed value was linearly dependent with the other qubits (e.g. an initialized ancilla). Each linear Boolean function is likewise computable only in some of the possible state spaces. To re-synthesize we then need to apply Hadamard gates in such a way as to be able to pick up each phase factor and end in the state space span ({y 1 , y 2 , ..., y n }).
Since the application of a Hadamard gate changes the state space, the state of each qubit must be chosen so that the qubits span a suitable space afterwards. As an illustration consider the transformation
We could achieve the correct phase by applying a Hadamard gate on the second qubit first, but the resulting state would be |x 1 x 3 , from which we cannot directly construct the output state |(x 1 ⊕ x 2 )x 3 . The simplest way to choose a suitable state for each qubit before applying a Hadamard gate is to use the qubit's state in the original circuit, though there may be other ways of computing such states. During re-synthesis we then transform the state to match the state in the original circuit before a particular Hadamard gate is applied.
In this sense, the Hadamard gates are fixed by the original circuit and the re-synthesis process needs to place the remaining terms of the phase (i.e. c i · f i (x 1 , x 2 , ..., x n+k ) in between them. One approach is to use the greedy nature of Algorithm 1 to maintain a partition of those functions that are computable in the current state space of the circuit. Specifically, we iterate through the Hadamard gates and for each one we partition any elements that are in the new state space -this step relies on the fact that Algorithm 1 is greedy to avoid having to repartition the elements that were already in the old state space. For any block in the partition containing functions that will not be computable after the next Hadamard gate, we remove the block and synthesize a {CN OT, T } circuit applying those phases. A more detailed description is given in Section 6.
While this method is heuristic, we note that the partition is always minimal for the set of currently computable functions. In particular, Algorithm 1 produces a minimal partition when given a minimal partition, and removing blocks from a partition does not affect minimality -given a subset P ′ of a minimal partition P , if there existed a partition P 0 of the elements in P ′ such that |P 0 | < |P ′ |, then P 0 ∪ P \ P ′ is a partition of the elements in P into fewer sets.
One problem arises in that the dimension of the state space may increase in the next subcircuit (e.g. if the Hadamard is applied to an ancilla qubit). In this case, the independence condition (1) of the matroid changes, and previous partitions may now be invalid under the new inequality. However, as a trivial consequence of the fact that the dimension increases by at most 1, a partition that is no longer independent can be modified to satisfy it by removing exactly one linearly dependent element. Furthermore, if all partitions are modified to satisfy the new independence condition in this way, the new partition is minimal and the elements that were removed can be repartitioned by Algorithm 1.
Lemma 6. Given a subspace V of F n 2 with dim(V ) = m and a set of linear Boolean functions S ⊆ V * , let
If P is a minimal partition of (S, I m ), then the partition P ′ defined by removing one linearly dependent element from every A ∈ P if A / ∈ I m+1 is a minimal partition of (S ′ , I m+1 ), where S ′ is the set of elements partitioned by P ′ .
Proof. Suppose there exists some partition P 0 of (S ′ , I m+1 ) with |P 0 | < |P ′ |. We then see that one element of S \ S ′ can be added to any A ∈ P 0 to give a set in I m . In particular, consider any A ∈ P 0 . Since m + 1 − rank(A) ≤ n − |A| we see that m − rank(A) ≤ n − |A| − 1 = n − |A ∪ {s}| for any s ∈ S \ S ′ . We also note that n − |A ∪ {s}| ≥ 0 as required, since any subset of S has rank at most m, so for any A ∈ I m+1 , d + 1 − rank(A) > 0 and thus |A| < n. Therefore, for any A ∈ P 0 , s ∈ S \ S ′ we have A ∪ {s} ∈ I m .
Next we note that for any T ⊆ S there exists a partition of (T, I m ) with size at most |T | − 1. This is a simple result of the fact that m < n, as any subset A ⊆ T of size 2 has rank at least 1, so
Additionally, any size 1 subset of T is trivially independent under I m .
Thus, since we can add one element to every partition in P 0 , and we can partition the remaining |S \ S ′ | − |P 0 | elements into at most |S \ S ′ | − |P 0 | − 1 partitions, we see that there exists a partition of (S, I m ) with size at most
Since |S \ S ′ | − 1 < |P | we obtain a contradiction.
The T par algorithm
In the last section we described a heuristic for optimizing T -count and depth over the gate set {H, CN OT, T }. In this section, we present the concrete algorithm, T par, and enlarge the gate set to include the Pauli gates
We ignore the irrelevant global phase i in Y = iXY , though it can be recovered by applying XP XP = iI to any qubit. Appendix A gives a demonstration of the T par algorithm on a simple circuit.
Recall that in the computational basis, the input space of a circuit with n qubits, n − m ancillae and k Hadamard gates is a dimension m subspace V of F n+k 2
. We represent the state of a qubit as a vector in the dual space of F , along with a Boolean value b, called the parity, which is used to record bit flips. Specifically, we note that X : |x → |1 ⊕ x , so we can model bit flips with a single parity bit. We denote the set of states F 2 × F (n+k) * 2 as S. Given a Clifford + T circuit C, written as a sequence of gates over {X, Y, Z, P, P † , H, CN OT, T, T † }, we first compute a triple S, Q, H ∈ D representing C. S = {(c, f )|c ∈ Z 8 , f ∈ S} stores the T k phase factors as linear Boolean functions with parity and multiplicity, Q = (g 1 , g 2 , ..., g n ) ∈ S n tracks the state of each qubit, and H = (h 1 , h 2 , ...h k ) gives a sequence of Hadamard gates where each entry h i stores the input and output states, h i .Q I and h i .Q O respectively. We define the initial state of the circuit as Q 0 = ((0, x 1 ), (0, x 2 ), ..., (0, x m ), (0, 0), ..., (0, 0)), which is understood as the state |x 1 x 2 ...x m |0 ⊗n−m . To compute S, Q, H , we use the function U : D → D ( Figure 5 ) to sequentially update the triple S, Q, H for each gate U in the circuit, starting from the initial value ∅, Q 0 , ∅ . The T par algorithm (Algorithm 2) proceeds as follows: after computing S, Q, H , we synthesize a new circuit by iterating through the Hadamard gates in H while updating a partition P of the functions of S that are computable in the current subcircuit. In particular, we divide S into S P and S −P , where S P are the already partitioned elements and S −P are those not already partitioned. For a given h i = {Q I , O O }, for every (c, f ) ∈ S −P we check whether f ∈ span(Q I ). If so, we add (c, f ) to the current partition using Algorithm 1 with the independence relation A ⊆ S ∈ I if and only if rank(Q I ) − rank(A) ≤ n − |A|. After partitioning, we update S P and S −P accordingly. The tests for inclusion of each function f in span(h i .Q I ) requires O |S −P | · (n + k) 3 time, and we can loosely bound the partitioning time as the time to partition the entire set S using Algorithm 1, O( |S| 3 · (n + k) 3 ; since |S −P | ≤ |S| the entire step thus takes O |S| 3 · (n + k) 3 time. A tighter analysis is possible, though we omit it as the algorithm is heuristic in nature.
We next iterate through P and for each block A ∈ P , if f / ∈ span(Q O ) for some (c, f ) ∈ A we remove A from the partition and synthesize a circuit computing the relevant phase factors. While we Figure 5 : Semantic function · . We define S ⊎ T as the union of S and T where any f such that (c 1 , f ) ∈ S, (c 2 , f ) ∈ T is given coefficient c 1 + c 2 mod 8. U i denotes the gate U applied to qubit i and CN OT (i,j) specifies i as the control qubit and j as the target.
defer the discussion of the synthesis procedure for now, we remark that it requires O (n + k) 3 time. Otherwise, if A is no longer an independent set under the new independence relation A ⊆ S ∈ I if and only if rank(Q O ) − rank(A) ≤ n − |A|, we remove a linearly dependent element from A and update S P and S −P so that the deleted element will be re-partitioned on the next iteration. As rank(A) and a linearly dependent element can both be found with one application of Gaussian elimination, this step also requires O (n + k) 3 time, and so the entire loop takes O |P | · (n + k) 3 time.
Finally, we synthesize a circuit applying the Hadamard gate in O (n + k) 3 time, and repeat the entire process for the next Hadamard. The entire algorithm, shown in Algorithm 2, thus runs in time
As |C| · n is in most cases negligible compared to the k · (n + k) 3 · |S| 3 factor, and |P | ≤ |S|, we describe the runtime as simply O k · |S| 3 · (n + k) 3 . Moreover, it should be noted that if no repartitioning is done, the runtime is bounded by O |S| 3 · (n + k) 3 , as each element is partitioned exactly once, rather than the worst case of k times in general.
Synthesizing partitions
We now describe the general synthesis procedure, SYNTHESIZE(A, Q I , Q O ), from Algorithm 2. The procedure synthesizes a circuit with inputs Q I and outputs Q O applying the phases given by a computable partition A of linear Boolean functions. The algorithm proceeds by first extending A with n − |A| linear Boolean functions to form a set A ′ with rank equal to rank(Q I ) -this is accomplished by using row operations to reduce A to a subset of Q I , then adding the vectors in Q I \ A. Next we synthesize a circuit computing A ′ by reducing Q I and A ′ to the same basis using Gaussian elimination in O((n + k) 3 ) time, where
addition of two rows corresponds to the application of a CN OT gate, and parity changes correspond to X gates. The circuit reducing Q I to this basis is applied forwards, while the circuit reducing A ′ is applied in reverse, giving a circuit mapping |Q I → |A ′ . The phase factors are applied by constructing a combination of T, P and Z gates, corresponding to the relevant coefficients, then the circuit mapping |Q I to |A ′ is reversed to compute |Q I . In the case when Q O = Q I , the corresponding Hadamard gate is applied to compute the output |Q O .
As alluded to before, we now see that the synthesis procedure has time complexity O (n + k) 3 since it requires a constant number (three) of applications of Gaussian elimination. Moreover, the T -depth of the resulting circuit is 1.
As an important practical issue, Gaussian elimination based synthesis produces linear reversible circuits that are non-optimal in terms of the number of gates or depth, resulting in a potential increase in the number of CN OT gates after re-synthesizing, as compared to the original design. While our focus was on the optimization of T gates, there exist algorithms, [20, 25] , that produce more efficient circuits for linear reversible functions. Specifically, [25] provides an algorithm to synthesize linear reversible circuits with Θ(n 2 / log(n)) gates, and [20] reports an O(n)-depth algorithm. More recently, [17] described an optimization procedure for stabilizer circuits that could be applied afterward to further optimize linear reversible and Clifford group subcircuits. In practical implementations an advanced linear reversible synthesis algorithm should be used. Compared to the optimization of T -depth, we considered the optimization of the linear reversible circuit stages to be a second order improvement and did not pursue it in this work.
Results
We implemented Algorithm 2 in C++ 3 and used it to optimize various quantum circuits, specifically arithmetic and reversible ones, from the literature. Individual circuits were written in the standard fault-tolerant universal gate set {X, Y, Z, H, P, P † , CN OT, T, T † }, using the decompositions found in [2] where applicable. As most arithmetic circuits are dominated by Toffoli gates, we used the lowest T -depth implementation of the Toffoli without ancillae known [2] .
Results are reported in Tables 1 and 2 . They were generated in Debian Linux running on a quad-core 64-bit Intel Core i7 2.40GHz processor and 8 GB RAM. Table 1 gives gate counts for the circuits before and after optimization. Table 2 shows T -depths before and after optimization using either 0, n, or ∞ ancillae 4 where n denotes the original number of qubits in the circuit. The T -depth for each circuit before optimization was computed by parallelizing the T -gates and Toffoli gates by hand, and writing each group of parallel Toffoli gates in T -depth 3.
With no extra ancillae added, all the tested benchmarks show significant reductions in terms of both T -count and T -depth, with average reductions of 39.9% and 54.3% respectively. The algorithm is particularly effective in cases where adjacent Toffoli gates share either controls or targets, as many of the phases cancel -each of the GF(2 m ) multipliers share this structure, and as a result show large reductions in T -count and T -depth. In fact, the T par algorithm will parallelize any GF(2 m ) multiplication circuit to T -depth 2 when given sufficiently many ancillae, by noting that each such circuit contains two stages of Toffoli gates that result in one {CN OT, T } stage each after removing trivial identities. By comparison, since the Toffoli gates cannot be all written in parallel, the minimum T -depth achievable using T -depth 1 Toffoli implementations [28] is 12(m−1). Those circuits that mix controls and targets between adjacent Toffoli gates are less affected by the optimization, e.g., CSUM-MUX 9 , as the Hadamard gates create barriers to T parallelization.
The runtimes show that algorithm scales well to large circuits, the largest tested circuit having 192 qubits, 28672 T gates and 8192 Hadamard gates. This stands in contrast to most previous efforts to optimize quantum circuits, which have generally been limited in usefulness to a few qubits and a small number of gates. While the inclusion of Hadamard gates adds significant complexity to the algorithm, it has actually reduced runtime of the algorithm compared to experiments where T par was applied only to {CN OT, T } subcircuits which is likely a result of the greater T -count reductions. As a result, T par appears to be an effective heuristic algorithm for the large-scale optimization of fault-tolerant circuits.
We also tested our algorithm's ability to make use of ancillae to optimize T -depth (Table 2 ). For each of the benchmark circuits, we applied our algorithm with an added n ancillae, where the original circuit contained n qubits. We also report the minimum T -depth achievable for each circuit using our algorithm. It can be noted that our algorithm usually decreases in running time when ancillae are added, due to the reduced number of partitions and thus faster matroid partitioning. Specifically, when there are many ancillae, for the majority of the time when an item s is partitioned it can be directly added to one of the partitions in time O(|P | · (n + k) 3 ). The algorithm is thus very flexible, and the experimental data illustrates a great potential for exploring space-time trade-off in quantum circuits.
As an important application, our experiments include instances of the multiple control Toffoli gates, Λ k (X), which are widely used in the construction of reversible circuits. We report the results using two different implementations -the Barenco et al. implementation using k − 2 ancillae with arbitrary initial states [3] , and the Nielsen-Chuang implementation using k − 2 ancillae initialized in the state |0 [24] . In both constructions the ancillae are returned to their initial state. Our optimization of the Barenco et al. version reduces the T -count from 7(4k − 8) to 3(4k − 8) + 4 and T -depth from 3(4k − 8) to 4k − 8 with unbounded ancillae, in the instances we tried. Likewise, our optimization of the Nielsen-Chuang implementation reduced the T -count from 7(2k − 3) to 4(2k − 3) + 3 and T -depth from 3(2k − 3) to 2k − 3. These formulae in fact hold for every k ≥ 3, a result of the simple structure of the circuits. As the two versions use 4k − 8 and 2k − 3 sequential Toffoli gates, respectively, we note that T par parallelizes each Toffoli to T -depth 1 when sufficiently many ancillae are available. Moreover, the reductions in T -count can be observed to correspond directly to each shared target or control -in this way, T par achieves the same T count and depth reductions as the multiply-controlled gate construction reported in [28] , but applies to more general cases and does not require implementing controls with the less intuitive Λ 2 (±iX) gates. As a final remark, we note that our algorithm reproduces many of the previous results regarding the optimization of T -depth. In particular, Figure 6 shows the circuit produced by running T par Table 1 : T -count benchmarks. We report the gate counts after optimizing circuits with T par, using no extra ancillae. N specifies the number of qubits. x C reports the number of CN OT gates, x T reports the number of T gates and x U reports the number of other gates. x ′ denotes the number of gates after optimization.
on an implementation of the Toffoli gate. The circuit mirrors the T -depth 1 Toffoli reported in [28] . Moreover, the full range of T -depths possible with different numbers of ancillae can be observed, as seen in Figure 7 . We also show a re-synthesized controlled-T gate [2] using one ancilla to reduce the T -depth from 5 to 3 (Figure 8) , and a re-synthesized Barenco et al. implementation of the Λ 3 (X) gate using no ancillae ( Figure 9 ).
Conclusion
We have described an algorithm for re-synthesizing Clifford + T circuits with reduced T -count and depth. The algorithm uses a circuit representation based on linear Boolean functions, allowing T gates to be combined and then parallelized through the use of matroid partitioning algorithms. The algorithm has worst case runtime that is cubic in the number of T gates, qubits, and Hadamard gates, though our experiments show that the algorithm is sufficiently fast for practical circuit sizes.
Our benchmarks (Tables 1 and 2) show that large gains can be obtained in reducing the T -count and T -depth of quantum circuits. In some cases, T -count was reduced by as much as 65.7%, while Table 2 : T -depth benchmarks. We report the T -depth after no optimization (original) and after optimization with 0 (cf. Table 1) , n, or unbounded ancillae.
the T -depth could be reduced by up to 87.6% without ancillae. Furthermore, the benchmarks illustrate that ancillae can be used to parallelize T gates further, and given the runtimes reported the algorithm can be seen to provide substantial flexibility in exploring the trade-off between ancilla usage and T -depth. In the most extreme case we were able to reduce the T -depth of GF(2 m )-Mult circuits from 12(m − 1) to a constant of 2, using unbounded ancillae. While the benchmarks were all arithmetic or otherwise reversible operations, such operations typically require the majority of the resources in circuits for quantum algorithms of interest [16] .
We close by noting that as a consequence of the T par algorithm, reducing the number of terms in the mixed arithmetic polynomials describing the phase corresponds directly to reducing the Tcomplexity of quantum circuits; in fact, it was observed that minimization of T -count in {CN OT, T } circuits is equivalent to minimizing the number of odd coefficients in the phase polynomial. A natural avenue of future work is then to develop methods for optimizing such polynomials for Tcount and depth. This work also represents the first instance, to the authors' knowledge, of the use of sum over paths style representations in quantum circuit synthesis and optimization. While this representation has proven effective in optimizing circuit T -count and T -depth, the questions of synthesis for more general phases, e.g. the sum over paths representation of {H, CN OT, T }, and of the precise form of phases synthesizeable over the "Clifford + T " gate set remain. Moreover, we leave it as a topic of future research to find new applications to optimization over different gate sets and cost metrics. Efficient practical synthesis of linear reversible circuits is another important direction that would directly contribute to improving the results of this work.
A Parallelizing (Λ 2 (X) ⊗ I)(I ⊗ Λ 2 (X))
In this section we illustrate the workings of the T par algorithm using the following circuit:
We first expand this circuit by using the T -depth 3 Toffoli gate implementation [2] :
Next we compute S, Q, H by applying the function · to each gate in sequence, starting with ∅, (x 1 , x 2 , x 3 , x 4 )) , ∅ . The result is S = x 1 , 2 · x 2 , x 5 , 7·(x 1 ⊕ x 2 ), 7·(x 1 ⊕ x 5 ), 7·(x 2 ⊕ x 5 ), (x 1 ⊕ x 2 ⊕ x 5 ), x 6 , x 7 , 7·(x 2 ⊕ x 6 ), 7·(x 2 ⊕ x 7 ), 7·(x 6 ⊕ x 7 ), (x 2 ⊕ x 6 ⊕ x 7 ) , Q = (x 1 , x 2 , x 6 , x 8 ), H = (h 1 , h 2 , h 3 , h 4 ) where h 1 = {Q I = (x 1 , x 2 , x 3 , x 4 ), Q O = (x 1 , x 2 , x 5 , x 4 )} , h 2 = {Q I = (x 1 , x 2 , x 5 , x 4 ), Q O = (x 1 , x 2 , x 6 , x 4 )} , h 3 = {Q I = (x 1 , x 2 , x 6 , x 4 ), Q O = (x 1 , x 2 , x 6 , x 7 )} , h 4 = {Q I = (x 1 , x 2 , x 6 , x 7 ), Q O = (x 1 , x 2 , x 6 , x 8 )} .
Starting with h 1 , we see that the terms x 1 , 2 · x 2 , 7 · (x 1 ⊕ x 2 ) are computable, so we partition them into blocks satisfying 4 − rank(A) ≤ 4 − |A|, giving P = {{x 1 , 2 · x 2 }, {7 · (x 1 ⊕ x 2 )}} . As neither partition will become uncomputable after h 1 , we simply apply the first Hadamard gate: Figure 9 : An optimized implementation of the Λ 3 (X) gate (CNOT stages optimized by templates [22] ).
At the second Hadamard gate, the path variable x 5 is available, so x 5 , 7 · (x 1 ⊕ x 5 ), 7 · (x 2 ⊕ x 5 ), x 1 ⊕ x 2 ⊕ x 5 are now computable. We add them to the partition to get P = {{x 1 , 2 · x 2 , 7 · (x 2 ⊕ x 5 )}, {7 · (x 1 ⊕ x 2 )}{x 1 ⊕ x 2 ⊕ x 5 , 7 · (x 1 ⊕ x 5 ), x 5 }} .
We trivially see that x 2 ⊕ x 5 / ∈ span({x 1 , x 2 , x 6 , x 4 }) and likewise neither is x 5 , so we synthesize a circuit computing the partitions {x 1 , 2 · x 2 , 7 · (x 2 ⊕ x 5 )} and {x 1 ⊕ x 2 ⊕ x 5 , 7 · (x 1 ⊕ x 5 ), x 5 } and allow {7 · (x 1 ⊕ x 2 )} to move past the second Hadamard gate.
Again, x 6 is now available, so insert the newly computable terms x 6 , 7 · (x 2 ⊕ x 6 ) into the current partition P = {{7 · (x 1 ⊕ x 2 )}} to get P = {{7 · (x 1 ⊕ x 2 ), x 6 , 7 · (x 2 ⊕ x 6 )}}. As the single partition block will still be computable after applying h 3 , we apply the next Hadamard gate:
We now add the last of the terms to the partition, x 7 , 7 · (x 6 ⊕ x 7 ), x 2 ⊕ x 6 ⊕ x 7 , giving {{7 · (x 1 ⊕ x 2 ), x 6 , 7 · (x 2 ⊕ x 6 ), x 7 }, {7 · x 2 ⊕ x 7 ), 7 · (x 6 ⊕ x 7 ), x 2 ⊕ x 6 ⊕ x 7 }} .
As both partitions contain the value x 7 which will be destroyed by h 4 , we apply the remaining partitions, followed by the last Hadamard gate:
The final circuit, shown above, reduces the original circuit by 2 T gates (from 14 to 12) and 2 levels of T -depth (from 6 to 4).
Note that the partitions used in the above are minimal, but are not the partitions Algorithm 1 actually produces. Instead, these partitions have been created to best demonstrate the algorithm. Additionally, for simplicity, we described states without parity, as there are no bit flip gates in this example.
