In this work, we introduce a new circuit optimization technique to reduce the number of T gates in Clifford+T circuits by treating T gates conjugated by Clifford gates as π 4 -rotations around Pauli operators. The tested benchmarks shows up to 71.43% and an average 42.67% reduction in T-count, both surpass the best performance reported. The worst case complexity of our algorithm is O(nk 2 ) where n is the number of qubits and k is the number of T gates in the original Clifford+T circuit.
I. INTRODUCTION
Much effort has been devoted to build the world's first meaningful quantum computer, which can deliver the ability to help scientists/engineers to develop new materials from drugs to battery, and to solve optimization problems from finance to logistics. In the past several years, prototypes of early-stage quantum computers are demonstrated by universities and tech companies. To understand the ability of these noisy intermediate-scale quantum (NISQ) devices and the capability of future practical quantum computers, it is essential to develop efficient implementations of quantum algorithms, which are typically expressed in terms of quantum circuits. Quantum compilation, or more specifically, circuit optimization plays an important role in both regimes.
There are many factors to take into consideration when executing quantum circuits on NISQ devices. Just like in classical computer science, efficiency measures the execution time necessary for an algorithm to finish its job. Limited connectivity of existing quantum hardwares have a significant impact on the cost of quantum algorithms. This raise a related problem called qubit allocation or quantum scheduling [1, 2] . Furthermore, there may be significant variation in the error rate of qubits and links, or system reliability in general, on existing hardwares [3] . Such variation makes implementing a given circuit on NISQ hardwares to achieve better performance even more complicated.
Among all the above-mentioned factors, execution time of a circuit is crucial since it will not only affect the performance but also decide if a circuit is realizable on hardware. Quantum states are very fragile due to decoherence which will limit the size of quantum circuits that can be executed on hardware. As a demonstration, various circuit optimization passes are implemented and tested in a recent paper [4] , a measurable performance change of NISQ devices has been observed.
Building NISQ devices will not be our eventual goal. We expect to be able to protect quantum systems from various types of noises and scale up quantum devices using quantum error correction techniques. In the regime of fault tolerant quantum computing, one promising approach is to decompose logic operations on a surface code into Clifford+T circuits [5] . If we take the number of physical qubits involved (space cost) and protocol implementation time (time cost) into consideration, then T gate will be much more expensive than Clifford gates. For example, in a distance-d surface code, the space-time cost of a logical T gate is of order O(d 3 ) when employing Bravyi-Haah codes [6] [7] [8] . Although the cost of Clifford gates is of same order, the constant pre-factor is less by several orders of magnitude.
Various Clifford+T circuit optimization schemes have been proposed in literature [9] [10] [11] [12] . [9] computes the T-count exactly by meet-in-the-middle brute force search, which guarantees optimality, but does not go very far. Namely, it works for two qubits with m = 12, or three qubits with m = 6. It also introduces the notation of R(±P ), which is convenient for our approach. T par [10] primarily optimizes T-depth by resynthesizing the T gates in each CNOT+T section, but it also improves T-count by cancelling pairs of T gates that are moved to the same place. It deals with Hadamard gate H by synthesizing all T gates that "will not be computable" before synthesizing the H, which is not quite optimal because some of those T gates may become "computable" again after the next H gate. RM [11] further optimizes CNOT+T sections by showing that optimizing T-count in such circuits is equivalent to the minimum distance decoding problem of a certain Reed-Muller code. Although the decoding problem itself is hard, by using existing Reed-Muller decoders, RM is able to achieve an improved T-count on many benchmark circuits. The handling of H gates is still the same as T par. TOpt [12] uses the idea of gadgetization to deal with H gates: Hadamard gates are eliminated from the first part of the circuit using an ancilla beginning in the |+ state, which is later measured and classically controls a Clifford gate (which may not be a Pauli, so the quantum controlled version may not be a Clifford). It also uses a new heuristic method on the resulting CNOT+T circuit to reduce T-count in polynomial time.
In this paper, we adapt the R(±P ) notation used in [9] to regard any Clifford+T circuit as a series of π/4 rotations, then cancel pairs of T gates with the help of certain commutation relations. Compared to T par [10] , our method handles all Clifford gates in a unified way instead of having to consider H as a special case. This allows our method to achieve a strictly higher T-count reduction.
The tested benchmarks shows up to 71.43% and an average 42.67% reduction in T-count, both surpassing the best performance reported. It is worth mentioning that unlike other T-count optimizers which usually cause more than 100% increase in CNOT-count, our optimization procedure will not increase CNOT-count while reducing T-count. Furthermore, one may apply our optimization procedure to those optimized circuits generated from other circuit optimization tools no matter it optimizes CNOT-count or T-count, to get an even smaller T-count without affecting other performance parameters (CNOT-count).
The detailed benchmarking result can be found in Sec VI.
II. PRELIMINARIES
In this section we establish some necessary notations needed for presenting our work. Throughout this paper, we use U to denote a unitary gate and use U to denote the matrix corresponds to U.
The Pauli operators on a single qubit are I = 1 0 0 1
The set of Pauli operators on n qubits is defined as P n = {σ 1 ⊗ · · · ⊗ σ n |σ i ∈ {I, X, Y, Z}}, and |P n | = 4 n . We use subscripts to indicate qubits on which an operator acts. For example, in a 3-qubit system,
The single-qubit Clifford group is generated by the Hadamard gate H and phase gate S where H =
Clifford group is generated by these two gates along with the CNOT gate (|0 0| ⊗ I + |1 1| ⊗ X), acting on any, 1 or 2, of those n qubits. It is worth mentioning that the Clifford+T universal set attracts a lot of attention and has been widely studied in literature [13, 14] . "Universal" here means any multi-qubit unitary gate can be implemented to any given accuracy by using a sequence of gates from Clifford+T set. The number of T gates appear in this sequence is defined as T-count of that Clifford+T circuit representation. Similarly, we can define CNOT-count.
Some T gates on distinct qubits may be performed simultaneously; we will say these T gates are in the same Tcycle. For a given Clifford+T representation, we may have different ways to group these T gates into T-cycles. The minimum number of T-cycles is defined as T-depth of that representation.
A fundamental problem arise: Given a quantum algorithm represented as a unitary gate, what is the most hardware-efficient way to implement it? As we explained in the introduction, much effort have been devoted to understand the hardware efficiency. The choice of cost function may depend on the architecture of the hardware.
Let's assume a unitary U has an exact representation over the Clifford+T gate set. Thus there exist Clifford opera-
By substituting variables, there exist Clifford operators
Each
i , a sequence of T gates acting on different qubits conjugated by Clifford operator C i .
III. T gates as π/4 rotations
The T gate can be defined as a π/4 phase rotation around the single-qubit Pauli Z:
The key insight is that, when the T gate acting on the i-th qubit is conjugated by any multi-qubit Clifford C, it remains a π/4 phase rotation around a multi-qubit Pauli:
where we borrow the following notation from [9] :
By the definition of the Clifford group, CZ i C † ∈ ±P * n , where P * n = P n \{I}. Also notice that
While those are not π/4 rotations, they are both the identity up to a phase, and we choose to allow them because they may be convenient in some circuit transformations. Another way of understanding R(P ) is that the set {R(P )} is invariant under commutation with any Clifford C:
By applying (7) to shift all Clifford gates to the back, any Clifford+T circuit can be represented as a series of π/4 rotations followed by a Clifford (up to a global phase):
where P j ∈ ±P * n , C 0 ∈ C n , and m is the number of T gates in the circuit. This same form is also presented in [9] , except that here we allow negative Paulis as P j (because they are sometimes convenient), and we do not assume that m is the minimum T-count (because computing the minimum is likely to be a computationally hard problem).
IV. A SIMPLE ALGORITHM FOR OPTIMIZING THE T -COUNT
Once we have written a Clifford + T circuit in the form of (20) , there are several transformations we can apply to simplify it. The most obvious ones are:
i.e. two opposite π/4 rotations will cancel each other, and two identical π/4 rotations can be combined into a
rotation, which is a Clifford (namely, it is CS 1 C † , for any Clifford C such that CZ 1 C † = P ) and can be shifted to the back with (7). Applying either transformation will reduce the T-count of the circuit by 2.
Of course, it is rare for a Clifford + T circuit to have two π/4 rotations around the same Pauli directly adjacent to each other. However, it is not rare for two adjacent π/4 rotations to commute with each other. For example, if a group of T gates can be executed in parallel, then their corresponding π/4 rotations all commute pairwise. In the R(P ) representation, it is easy to determine which rotations commute:
As it turns out, when we take (10) into account, there are many more opportunities to apply (9) . Specifically, in the CNOT+T circuits studied in [10] , all the "rotation axis" Paulis are either I or Z on every single qubit; hence, as [10] shows, they all commute with each other. By studying the Clifford + T circuit as a whole, however, we can find more commuting π/4 rotations than [10] . For example, in the Mod 5 4 circuit, there are several appearances of (I ⊗ H)CN OT (I ⊗ H), which is really a CZ gate in disguise. It is possible to rewrite the CZ gate as a CN OT + S circuit:
If we apply this transformation before feeding the circuit to the algorithm of [10] , then it correctly recognizes that there is a CN OT + T section of the circuit encompassing all the T gates, and optimizes the T-count down to 8, as opposed to 16 when this transformation is not applied. Below, we describe our algorithm that applies (9) and (10) to reduce the T-count. We note that our algorithm does not explicitly use any transformation like (11); instead, because it first transforms the circuit into the form of (20) , it treats all equivalent Clifford circuits in the same way, ensuring that the representation of the CZ circuit does not affect the result.
1. Let U 0 be an empty circuit, and U 1 be the input circuit transformed into the form of (20) , with the phase factor e iφ removed.
• Over the course of the algorithm, both U 0 and U 1 will be kept in the form of (20):
• U 0 U 1 should be a loop invariant.
2. While the R(P ) product part of U 1 is not empty:
(a) Remove the leftmost factor R(P ) from U 1 , and insert R(C 0 P C † 0 ) to the left of C 0 in U 0 . (b) Scan through each other factor R(Q) in U 0 , stopping when any of the following is true:
(c) If we found some R(Q) such that Q = ±C 0 P C † 0 :
3. Let C 0 ← C 0 C 1 , and C 1 ← I.
4. Transform U 0 back into a Clifford + T circuit, and return it as the optimized circuit.
The bottleneck of this algorithm is step 2(b); in the worst case, for each R(P ) factor in the input circuit, we need to scan through every R(Q) factor in U 0 , checking equality and commutativity each time. Therefore the worst case complexity is O(k 2 n), where n is the number of qubits and k is the number of T gates in the original circuit.
V. THE STRUCTURE OF π/4 ROTATIONS AS A DAG AND APPLICATION TO T -DEPTH OPTIMIZATION
As we have seen, the commutation relation (10) proves quite useful in manipulating Clifford+T circuits. Although it does not decrease the T-count on its own, it enables other rules like (9) to be applied. Furthermore, reordering T gates may be interesting for its own sake if we consider other goals of circuit optimization, such as minimizing the T-depth.
Since (10) just swaps adjacent π/4 rotations, applying (10) repeatedly is equivalent to applying some permutation to all the π/4 rotations in a circuit. However, it may not be immediately obvious which permutations are permissible. Fortunately, there is a simple description for the set of permissible permutations if we regard the structure of π/4 rotations as a directed acyclic graph (DAG).
Definition 1. The T-graph of a circuit U in the form of
is defined as G T (U ) = (V, E), where V = {v j } for j = 1, · · · , m, and
i.e. G T (U ) has a vertex for each π/4 rotation in U , and an edge for each pair of π/4 rotations whose Pauli anti-commute with each other, with the direction of the edge determined on the order in which the two rotations appear in the original circuit.
The T-graph G T (U ) of a circuit is always a DAG, since the order v 1 , · · · , v m is a topological ordering of G T (U ). In fact, we have the following stronger result: Theorem 1. By applying only (10), a circuit in the form of (13) can be transformed into
If and only if v p1 , · · · , v pm is a topological ordering of G T (U ).
Proof. First, suppose that repeated applications of (10) can indeed transform (13) into (15) . Let e = (v pi , v pj ) be any one edge of G T (U ). By the definition of G T (U ), in the original circuit (13):
• R(P pi ) appears before R(P pj );
The latter condition means that application of (10) cannot change the relative order of R(P pi ) and R(P pj ), so in (15), R(P pi ) still appears before R(P pj ), i.e. i < j. Since this argument is valid for every edge e = (v pi , v pj ) of G T (U ), it follows that v p1 , · · · , v pm is a topological ordering of G T (U ). Conversely, suppose that v p1 , · · · , v pm is indeed a topological ordering of G T (U ). Then (13) can be transformed into (15) by doing a bubble sort on the π/4 rotations until they match the order R(P p1 ), · · · , R(P pm ). At each time step, such a bubble sort will only swap two adjacent rotations, R(P pj ) and R(P pi ), if both of the following are true:
• p j < p i (so in (13) R(P pj ) comes before R(P pi ));
• i < j (so in (15) R(P pi ) comes before R(P pj )).
The first condition guarantees that (v pi , v pj ) is not an edge of G T (U ), and the second condition, together with the assumption that v p1 , · · · , v pm is a topological ordering, guarantees that (v pj , v pi ) is not an edge of G T (U ) either. Therefore [P pi , P pj ] = 0, and (10) can indeed be applied to make the swap.
We note that the "only if" direction of Theorem 1 allows only applications of (10). For example, Both R(Z)R(−Z)R(X)R(−X) and R(Z)R(X)R(−X)R(−Z) evaluates to I by (9), so they are indeed equivalent, but that is not covered by Theorem 1.
We have alluded to the relation between the representation (20) and T-depth of a circuit when we mentioned that T gates that can be applied in parallel always translate to commuting π/4 rotations. Unfortunately, the strong converse is not always true -a group of pairwise commuting π/4 rotations cannot always be translated into a group of T gates applied in parallel. Instead, we need an extra condition: • All P j commute with each other;
• There does not exist a non-empty subset S of {1, · · · , m} such that j∈S P j = ±I.
Proof. Those two conditions are exactly the conditions under which there exists a Clifford C ∈ C n such that CP j C † = Z j holds for j = 1, · · · , m. The circuit is then given by
Notice that translating a T-depth 1 circuit back into π/4 rotations with (7) will indeed result in a circuit in the form of (20) satisfying the conditions of Theorem 2. However, as written, the converse of Theorem 2 is not true, because a circuit that initially does not satisfy those conditions may be simplified or otherwise transformed to satisfy them.
As [10] points out, when the goal is to minimize T-depth, additional ancilla qubits initialized in the |0 state may be useful. However, they are also tricky: when using ancilla qubits to implement a unitary, they must return to a constant state at the end of the circuit, or otherwise the output state may be entangled with the ancilla instead of being a pure state.
A obvious way to satisfy this condition is to ensure that, when the circuit is transformed into the form of (20), all the ancilla qubits stay in the state |0 throughout the circuit. It turns out that this is more or less what [10] does, too. Lemma 1. Let P = P 0 ⊗ Q, where P ∈ P n+t , P 0 ∈ P n , and Q ∈ P t . Then the following statements are equivalent:
The left hand side can be expanded as follows:
The last line should be equal to |ϕ ⊗ |0 ⊗t . Equivalently, both of the following must hold:
The desired statement, Q ∈ {I, Z} ⊗t , follows from (18).
(3) ⇒ (4): Notice that, in (19) , the left hand side evaluates to R(P 0 )|ψ . The rest easily follows.
By repeated application of "(3) ⇒ (4)", we immediately get an easy way to incorporate ancilla qubits into our circuits: Corollary 1. A circuit in the form of (20) implementing the unitary U can always be extended with T ancilla qubits as follows:
where
Theorem 3. Adding ancilla qubits to a circuit U using Corollary 1 does not change its T-graph G T (U ).
Proof. It suffices to show that G T (U ) and G T (U ′ ) have the same set of edges, which is equivalent to showing that
Now we revisit Theorem 2. With the help of ancilla qubits, we can remove the second condition. Proof. We apply Corollary 1 to extend the group of rotations into m j=1 R(P j ⊗ Q j ), where Q j = Z j ∈ {I, Z} ⊗m . Now the second condition of Theorem 2 is naturally satisfied, and the first condition of Theorem 2 will continue to be satisfied due to the proof of Theorem 3. Hence an application of Theorem 2 on the extended group of rotations give the desired result.
For some Clifford + T circuits, the combination of the commutation relation (10) and Theorem 4 is a powerful tool for optimizing the T-depth (with an unlimited number of ancilla qubits). (10) and Theorem 4, the minimum T-depth that can be achieved for a circuit U is exactly equal to the length of (i.e. the number of vertices on) the longest path in G T (U ).
Theorem 5. Using only
Proof. First, we show that the minimum T-depth cannot be less than the length of the longest path in G T (U ). Suppose the longest path in
. When using (10) to swap the rotations, the relative order of R(P j1 ), R(P j2 ), · · · , R(P j k ) cannot change; then when using Theorem 4 to group the rotations into layers, each of them must go into a different layer, since the rotations in the same layer must pairwise commute. Therefore the minimum T-depth that can be achieved this way is k, the length of the longest path.
Next, we show that a T-depth of k is indeed achievable. To this end, we first sort the vertices of G T (U ) in ascending order of the length of the longest path ending at each vertex. Since the minimum length of such path is 1, and the maximum is k, this essentially divides the vertices of G T (U ) into k layers. Edges in G T (U ) can only go from a layer to a later layer, not to any prior layer nor to the same layer. Therefore, this sort is a topological ordering, and by Theorem 1, we can reorder the π/4 rotations in U accordingly. The fact that there does not exist any edge between vertices in the same layer also means that all rotations corresponding to those vertices commute with each other, so by applying Theorem 4, we can transform each layer into a T-layer. The end result is a Clifford + T circuit with T-depth k.
Again, Theorem 5 restricts the rules of transformation allowed. In particular, cancellation with (9) is not taken into account. In our experiments, we utilize cancellation by applying the algorithm described in IV before applying Theorem 5; this is also similar to the approach taken in [10] .
VI. BENCHMARKING
In this section, we will benchmark our T-Optimizer software with two leading Clifford+T optimizers, T-par and TOpt. The specifications of circuit inputs that T-par and TOpt take are slightly different. T-Optimizer takes a circuit in the .qc format that T-par accepted. A description of the .qc format can be found at https://github.com/aparent/QCViewer. For the sake of readability, we briefly explain the basics of .qc file, which indeed serves as textual representation of what the circuit consists of. In the header, .v defines names of the circuit qubits, .i and .o specify which qubits accept primary inputs and report primary outputs respectively. In the body, H refers the Hadamard gate and tof refers to the Toffoli gate. The identifiers following the names refer to the qubits the gate is applies to. For instance, the line H a indicates a Hadamard gate on the qubit named a. In the case where there is more than one qubit as with the Toffoli gate, the rightmost one is the target qubit and the rest are the control qubits.
As mentioned in [12] , T-count does not account for the full space-time cost of quantum computation. Thus in our benchmarking result, we present both the CNOT-count and the T-count. The other two leading circuit optimization softwares, T-par and TOpt often produce the output circuit with increased CNOT-count. We also aware that these two open source packages are under active development even after they reported their results in [10] [11] [12] . For some instances we tested, the output result is actually better than those reported number. Thus we choose to benchmark our T-Optimizer with these two open source packages rather than the reported results. TOpt has a heuristic subroutine, so we run each instance 100 times, pick the best pair (CNOT-count, T-count) in the sense of reverse lexicographical order on the Cartesian product; if it is worse than the report number, then we will run 100 more times.
VII. SUMMARY AND FUTURE WORK
In this work, we present a new optimization technique to reduce the number of T gates in Clifford+T circuits by treating every T gate conjugated by Clifford operators as π 4 -rotations around Pauli operators. For benchmarking circuits like Adder 8 and Mod5 4 , T-Optimizer will reduce significantly more T gates than any other circuit optimization software.
As we learn through the benchmarking result, all these benchmarked quantum circuit optimizers have "sweet spots" in which their performance has no equal. A unified framework of circuit optimization would be very interesting.
Another avenue of work is to start from the original unitary gate U which may not be exactly represented by Clifford+T. Many attempts have been made to address this fundamental open question. An algorithm based on exhausted search was presented in [25] . Given its exponential runtime, it is hard to practically use that for reasonable small accuracy ǫ. For quantum algorithms based on phase estimations, phase kickback tricks will introduce ancillary qubits to perform phase gates [26] . The Solovay-Kitaev algorithm and its variants are also well-studied in literature [27] . The algorithm runs in O(log 2.71 (1/ǫ)) time and produce a quantum circuit of size O(log 3.97 (1/ǫ)) to approximate the desired unitary gate up to accuracy ǫ. In the past several years, several efficient algorithms with improved circuit size (compared with the Solovay-Kitaev algorithm) have been proposed for single-qubit unitary approximation [28] [29] [30] [31] . For instance, [30] presents an efficient algorithm to achieve T-counts of ∼ 10 + 4 log 2 (1/ǫ), which matches the information-theoretic lower bound of K + 3 log 2 (1/ǫ). However, to efficiently approximate general multi-qubit unitary gates, there is still ample room for improvement in circuit size.
Note added: Simultaneously with our results, Kissinger and Wetering demonstrated a method to reduce the number of T-gates in a quantum circuit by presenting the quantum circuit as a ZX-diagram and then using the phase teleportation technique [32] . Surprisingly, KW method produces benchmarking results identical to ours and both methods do not change number or locations of any non-phase gates.
TABLE I:
We report the CNOT-count and T-count after no optimization (original) and after T-par, TOpt and T-Optimizer optimizations with no ancillae.
