Polynomial-time T-depth Optimization of Clifford+T circuits via Matroid
  Partitioning by Amy, Matthew et al.
ar
X
iv
:1
30
3.
20
42
v2
  [
qu
an
t-p
h]
  1
3 D
ec
 20
13
Polynomial-time T-depth Optimization of Clifford+T circuits via
Matroid Partitioning
Matthew Amy,1 Dmitri Maslov,2 and Michele Mosca3,4
1 Department of Computer Science
University of Toronto, Toronto, Ontario, Canada
2 National Science Foundation
Arlington, Virginia, USA
3 Institute for Quantum Computing, and Dept. of Combinatorics & Optimization
University of Waterloo, Waterloo, Ontario, Canada
4 Perimeter Insitute for Theoretical Physics
Waterloo, Ontario, Canada
December 16, 2013
Abstract
Most work in quantum circuit optimization has been performed in isolation from the results of
quantum fault-tolerance. Here we present a polynomial-time algorithm for optimizing quantum
circuits that takes the actual implementation of fault-tolerant logical gates into consideration.
Our algorithm re-synthesizes quantum circuits composed of Clifford group and T gates, the
latter being typically the most costly gate in fault-tolerant models, e.g., those based on the
Steane or surface codes, with the purpose of minimizing both T -count and T -depth. A major
feature of the algorithm is the ability to re-synthesize circuits with additional ancillae to reduce
T -depth at effectively no cost. The tested benchmarks show up to 65.7% reduction in T -count
and up to 87.6% reduction in T -depth without ancillae, or 99.7% reduction in T -depth using
ancillae.
1 Introduction
Quantum computers have the potential to efficiently solve important computational problems, in-
cluding integer factorization [29] and quantum simulation [18], for which there are no known efficient
classical algorithms. However, even with recent advances in quantum information processing tech-
nologies [6, 7, 8, 26], the prospects of scalable quantum computing without some systematic way
of mitigating physical errors and noise are bleak.
The active and vibrant fields of quantum error correction and fault-tolerance provide such
tools for constructing scalable quantum computers. By combining physical qubits through the
use of error correcting codes and providing fault-tolerant logical operations, larger computations
can be achieved with high fidelity – by concatenating codes, or in topological codes by increasing
code distance – provided the physical operations achieve a certain threshold fidelity. With recent
improvements to fault-tolerant thresholds [5, 14, 15], scalable quantum computation is becoming
1
more and more viable, resulting in a growing need for efficient automated design tools targeting
fault-tolerant quantum computers.
Quantum circuit synthesis and optimization is particularly important, given the prevalence of
the circuit model of quantum computation, but previous work has been largely isolated from the
unique concerns of fault-tolerance. While at the physical level, coupled gates are generally the
hardest to perform, most of the common quantum error-correcting codes have efficient CNOT
implementations. Moreover, for fault-tolerant models based on (double even, self-dual) CSS codes,
e.g., the popular Steane code, as well as the promising surface codes, the Clifford group can be
implemented as logical gates with little cost [14, 33].
For universal quantum computing, however, at least one non-Clifford group gate is needed,
which typically requires large ancilla factories and gate teleportation to implement fault-tolerantly.
As the non-Clifford T gate has known constructions in most of the common error correction schemes,
the standard universal fault-tolerant gate set is taken to be “Clifford + T”. Given the high cost
of the fault tolerant implementations of the T gate [1, 14], exceeding the cost of Clifford group
gates by as much as a factor of a hundred or more, it has recently been proposed that efficient
circuits should minimize the number of T gates, and more specifically the number of T gates
that cannot be performed in parallel [2, 13] – we define these metrics as a circuit’s T -count and
T -depth, respectively. Indeed, Fowler [13] shows how to perform fault-tolerant computations in
time proportional to one round of measurement per layer of T gates, and as a result the T -depth
directly determines a circuit’s runtime. Likewise, reducing the number of T gates reduces the
number of ancilla states that require preparation, vastly reducing circuit volume and at the same
time increasing the fidelity of the computation. While the primary purpose of our work is to
optimize T -depth (circuit runtime), our algorithm also provides significant reductions to T -count
(circuit volume).
Some recent work has been done concerning minimization of T -depth [2, 28], though these
previous results focus on finding small optimal two- and three-qubit circuits [2], and on classes of
circuits that can be parallelized to T -depth 1 by adding ancillae [2, 28]. By contrast, we report a
scalable automated tool for the optimization of T -depth that functions with or without ancillae, and
is not limited to a few qubits or a specific class of circuits. In particular, we present a polynomial-
time algorithm for optimizing both the T -depth and T -count of quantum circuits composed of
Clifford group and T gates. The algorithm also makes automatic use of ancillae to optimize T -depth,
with the addition of ancillae typically decreasing the runtime of our software implementation. Our
experiments show on average 61.1% reduction in T -depth and 39.9% reduction in T -count without
adding any ancillae, using the available benchmarks. When the use of ancillae is allowed, the
average T -depth reduction is demonstrated to be as high as 80.7% (the more ancillae are allowed
the more parallelization becomes possible, in some cases reducing T -depth by as much as 99.7%).
The rest of this paper is structured as follows: Section 2 reviews some background on quantum
and reversible computation, and introduces the notations we will use. Sections 3 and 4 describe
the algorithmic core – a procedure that optimally parallelizes the T gates in a circuit composed of
CNOT and T gates by performing matroid partitioning. Section 5 develops a heuristic extending
the optimal {CNOT, T} core to a universal gate set, while Section 6 describes the final algorithm.
Section 7 reports our experimental results, and Section 8 concludes the paper.
2
2 Preliminaries
We begin by reviewing some basic facts about quantum and reversible circuits necessary for this
paper.
In the classical circuit model, the state of a system of n bits is represented as a binary string
of length n, with classical gates corresponding to operators that map length-n binary strings to
length-m binary strings. More precisely, length-n binary strings are vectors of Fn2 , where F2 is
the two-element finite field with addition corresponding to Boolean exclusive-OR (EXOR, ⊕) and
multiplication corresponding to Boolean AND (∧). We then represent classical gates as operators
f : Fn2 → Fm2 , and we typically refer to f as a (classical) function. For brevity, if m = 1 we call f
Boolean.
The quantum circuit model, one of the prominent models of quantum computation [24], gener-
alizes the classical circuit model to deal with quantum effects. In particular, it describes the state
space of a system of n qubits as a vector in a 2n-dimensional complex vector space H spanned
by the (classical) n bit states. By convention, we refer to the classical states as the standard or
computational basis of H and write them in Dirac notation: |x〉, x ∈ Fn2 .
In contrast to the classical circuit model, quantum gates are restricted to a subset of all operators
on H – specifically, quantum gates are linear operators U : H → H that preserve the L2 norm. Such
operators U satisfy U †U = UU † = I, where U † denotes the adjoint of U , and are known as unitary.
Given that unitary operators are invertible, we see that the subset of quantum transformations that
permute the computational basis states are exactly the set of invertible classical transformations –
we call such functions reversible, with the intuition that any computation performed by reversible
functions can be undone or reversed. The Toffoli gate,
Λ2(X) : |x〉|y〉|z〉 7→ |x〉|y〉|z ⊕ (x ∧ y)〉,
is an example of a reversible function.
We can also have classical/quantum computations that use ancillae – being bits/qubits that can
be initialized to the 0/|0〉 or 1/|1〉 state and act as a temporary register. Without loss of generality,
we require that all ancillae are initialized in the 0/|0〉 state. In the case of a circuit with n bits, m
of which are data bits (i.e. n −m is the number of ancillae), we describe the state space as some
subspace V of Fn2 with dimension m. We will typically use n to represent the total number of bits
in a system, and m to refer to the number of data bits.
While all reversible classical gates are linear as operators over H, they need not be linear as
operators over Fn2 . In particular, we call f : F
n
2 → Fm2 linear if f(x⊕y) = f(x)⊕f(y). For instance,
the reversible controlled-NOT gate
CNOT : |x〉|y〉 7→ |x〉|x⊕ y〉
is linear over Fn2 . It is a known result that the set of all linear reversible functions are those that
can be computed by a circuit consisting of only CNOT gates [25].
Throughout this paper we will also be interested in linear Boolean functions and their relation
to linear reversible functions. For convenience, we refer to the set of n-ary linear Boolean functions
F
n
2 → F2 as the dual vector space (Fn2 )∗ of Fn2 , and note that a linear Boolean function f : Fn2 → F2
can be represented as a row vector over Fn2 – i.e., x
T for some x ∈ Fn2 . Furthermore, for a set of
linear Boolean functions S ⊆ (Fn2 )∗, we define rank(S) as the maximum number of independent
(row) vectors in S, or equivalently the dimension of the subspace V ∗ spanned by S.
3
As this paper is concerned with the optimization of quantum circuits, we also define some quan-
tum gates commonly used in fault tolerant models. In particular, we define the T and Hadamard
gates,
T : |x〉 7→ e ipix4 |x〉, H : |x〉 7→ |0〉+ (−1)
x|1〉√
2
.
We will show that circuits over the gate set {CNOT, T} implement linear reversible functions with
discrete phases corresponding to the eighth roots of unity. A side result of this paper is a proof
that {CNOT, T} circuits can be simulated on a classical computer in polynomial time. If this set
is further extended with the Hadamard gate we achieve a gate set that is universal for quantum
computation. We call {H,CNOT, T} the “Clifford + T” gate set, as it contains the Clifford group
generators {H,P := T 2, CNOT} along with the T gate. Moreover, {H,CNOT, T} is a minimal
generating set for the Clifford + T gate set.
Since quantum gates are commonly defined by unitary matrices, we provide the equivalent
matrix definitions below:
H :=
1√
2
(
1 1
1 −1
)
, CNOT :=


1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0

 , T :=
(
1 0
0 e
ipi
4
)
.
2.1 Computable sets of linear Boolean functions
While every linear reversible function f over n inputs can be written as n linear Boolean functions,
i.e.
f(a1, a2, ..., an) = (f1(a1, ..., an), ..., fn(a1, ..., an)) ,
not every set of linear Boolean functions defines a reversible function. For instance, f(a1, a2, a3) =
(a1, a2, a1 ⊕ a2) is irreversible since the input a3 is effectively destroyed. It is easy to verify that
a set of n linear Boolean functions forms the outputs of an n-ary linear reversible function if and
only if they have rank equal to the dimension of the input space.
Lemma 1. Given a subspace V of Fn2 and a set of linear Boolean functions S = {f1, f2 . . . fn} ⊆ V ∗,
the linear function f : V →W defined as
f(a1, a2, . . . , an) = (f1(a1, . . . , an), . . . , fn(a1, . . . , an))
is reversible if and only if rank(S) = dim(V ).
Since the unitary quantum circuit model is reversible, a set of linear Boolean functions S can only
be computed simultaneously (i.e. there exists a quantum circuit implementing the transformation
|a1a2...an〉 7→ |b1b2...bn〉 where for each f ∈ S, f(a1, a2, ..., an) = bi for some i) if it defines a
reversible function. We call such a set of linear Boolean functions (reversibly) computable over
a particular input space V – as we will be concerned strictly with reversible computations, we
frequently omit the qualifier reversible.
We may also want compute a set S of linear Boolean functions with a linear reversible function
on n > |S| qubits. In this case, since every n-ary linear reversible function corresponds to some
set of n linear Boolean functions, a linear reversible function computes S if and only if there exists
some computable superset S′ of S. Equivalently from Lemma 1, a set S ⊆ (Fn2 )∗ is (reversibly)
4
x1 x1 ⊕ x2 ⊕ x3 ⊕ x4
x2 • x2 ⊕ x4
x3 • x2 ⊕ x3
x4 • • x4
Figure 1: A circuit computing the functions x1 ⊕ x2 ⊕ x3 ⊕ x4, x2 ⊕ x4, and x2 ⊕ x3.
computable over a subspace V of Fn2 if and only if there exists a superset S
′ of S with cardinality
n such that rank(S′) = dim(V ).
Before moving on, we establish the condition under which a computable superset of linear
Boolean functions exists. Given a set S = {f1, f2, ..., fk} ⊆ (Fn2 )∗ and subspace V , we can observe
that we only need to find some fk+1, ..., fn ∈ (Fn2 )∗ such that rank ({f1, f2, ..., fn}) = dim(V ). It is
not hard to see that such fk+1, ..., fn exist if and only if the number of linearly independent vectors
in (Fn2 )
∗ needed is at most n− k.
Lemma 2. Given a subspace V of Fn2 and a set of linear Boolean functions S ⊆ (Fn2 )∗, there exists
a superset S′ of S with cardinality n such that rank (S′) = dim(V ) if and only if
dim(V )− rank (S) ≤ n− |S|. (1)
We can note that inequality (1) implies |S| = rank(S), i.e., S is linearly independent, when
dim(V ) = n.
3 {CNOT , T} circuits
We first consider circuits over the gate set {CNOT, T, P := T 2, Z := T 4, T † := T 7, P † := T 6}, as
they have a particular property that will be crucial to synthesizing low T -depth circuits. We remind
the reader that the even powers of the T gate are all Clifford gates, whereas all odd powers lie outside
the Clifford group. This is an essential observation for practical considerations. Furthermore, no
power of the T gate requires more than a single non-Clifford T gate to implement. We usually omit
the extraneous gates and refer to this gate set by the generating set {CNOT, T}.
It can be observed that since CNOT |x〉|y〉 = |x〉|x⊕ y〉 and T |x〉 = e ipix4 |x〉 for x, y ∈ F2, a
{CNOT, T} circuit can be described as computing a linear reversible function on the input basis
state, with an added phase that is some power of ω := eipi/4. Stated more precisely [2],
Lemma 3. A unitary U ∈ U(2n) is exactly implementable by an n-qubit circuit over {CNOT, T}
if and only if
U |x1x2 . . . xn〉 = ωp(x1,x2,...,xn)|g(x1, x2, . . . , xn)〉
where x1x2...xn ∈ Fn2 and
p(x1, x2, . . . , xn) =
l∑
i=1
ci · fi(x1, x2, . . . , xn)
for some linear reversible function g ∈ Fn2 → Fn2 and linear Boolean functions f1, f2, ..., fl ∈ (Fn2 )∗
with coefficients c1, c2, ..., cl ∈ Z8.
5
As a result of Lemma 3, we can fully characterize any unitary U ∈ U(2n) implementable by a
{CNOT, T} circuit with a set S ⊆ Z8× (Fn2 )∗ of linear Boolean functions together with coefficients
in Z8, and a linear reversible output function g : F
n
2 → Fn2 , with the interpretation
U〈S,g〉: |x1x2...xn〉 7→ ω
∑
(c,f)∈S
c·f(x1,x2,...,xn)
|g(x1, x2, ..., xn)〉.
Moreover, S and g are efficiently computable given a {CNOT, T} circuit, taking time linear in
the number of qubits and gates. In computing S it also becomes apparent when T gates, possibly
physically separated within the circuit, are applied to the same value and can thus be replaced by
a phase gate such as P : |x〉 7→ ω2·x|x〉. Our experimental results show that a large number of T
gates can be removed in this way.
Example 1. Consider the following circuit.
x1 T1 • • x2
x2 • • • T †2 x1
x3 • • T3 x3 ⊕ x4
x4 T4 • x4
If we track the effect of the CNOT gates we see that T1 and T
†
2 (indices are used to mark
different T gates within the circuit) are both applied to qubits in the state |x1〉, resulting in a
cumulative phase of ωx1+7x1 = ω8x1 = 1x1 = 1. As such, both gates can be removed. Likewise, T3
and T4 are both applied to qubits in the state |x3 ⊕ x4〉, and their phases combine to ω2(x3⊕x4) –
this pair of T gates can thus be replaced with a single P gate. This results in an optimized circuit:
x1 • • x2
x2 • • • x1
x3 • • P x3 ⊕ x4
x4 • x4
which may be optimized further to the form SWAP (x1, x2)CNOT (x4, x3)P (x3) by rewriting the
linear reversible section.
Once S and g have been computed, the proof of Lemma 3 [2] gives a constructive method for
synthesizing a circuit implementing U〈S,g〉. However, this na¨ıve method of re-synthesis may end up
with worse T -depth than the original circuit, despite possibly reduced T -count.
We can instead recall from Section 2 that if a subset A ⊆ S is reversibly computable (i.e. if
the linear Boolean functions in A are reversibly computable), we can construct a linear reversible
function with outputs simultaneously computing the functions in A; in this case |A| of the necessary
phase factors could then be applied in parallel. Synthesis can thus proceed by partitioning S into
computable subsets, then for each partition A ⊆ S first compute a (reversible) superset of A with
a stage of CNOT gates; many efficient algorithms exist [4, 25, 20] that can decompose a linear
reversible function into CNOT gates. Next we apply the relevant phase gates in parallel to add the
phase ω
∑
(c,f)∈A c·f(x1,x2,...,xn), and finally uncompute by reversing the CNOT stage. Given that at
most one T gate is used to implement any integer power of T , every partition will have a T -depth
of at most 1.
6
x1 T †
x1 ⊕ x3
T
x1 ⊕ x2 ⊕ x3
T †
x1 ⊕ x2
T
x1
x1
x2 T † • • • • x2
x3 •
x2 ⊕ x3
T
x3 • T † x3
Figure 2: {CNOT, T} circuit implementing the doubly controlled Z gate.
As a result, any unitary U implementable over {CNOT, T} can be implemented in T -depth
k where k is the minimum number of sets partitioning1 S1 = {(c, f) ∈ S|c ≡ 1 mod 2} into
computable subsets, as elements in S0 = S \ S1 do not require T gates to implement. In fact, we
can trivially see that given a specific set S and output g, k is the minimal T -depth, as any layer of
T gates in a circuit implementing U〈S,g〉 corresponds to a computable subset of S.
It is important to note that while this method reduces and maximally parallelizes the T
gates, it will not necessarily find the optimal T -count and by extension T -depth. Specifically,
given a {CNOT, T} circuit with phase defined by the set S there may be some distinct S′
that defines an equivalent computation using fewer T gates. Consider, for instance, the set
S∅ =
{
(1, f)|f ∈ (F42)∗
}
. We note that the integer sum
∑
f∈(F42)
∗ f(x1, x2, x3, x4) is equal to 8
for any non-zero (x1, x2, x3, x4) ∈ F42, and 0 for (0, 0, 0, 0). Since ω8 = ω0 = 1, S∅ computes the
trivial phase on every input and is therefore equivalent to the empty set ∅. In this case, all |S∅| = 15
T gates can then be removed.
It turns out, through a brute force search, that for n < 4, no two sets S, S′ requiring distinct
numbers of T gates define equivalent computations. As a result, we obtain a direct proof that the
doubly controlled-Z gate requires 7 T gates to implement over {CNOT, T} with any number of
ancillae. For n ≥ 4 however, the problem of minimizing T -count in {CNOT, T} circuits reduces
to a minimization problem over degree 1 polynomials in mixed arithmetic; moreover, since every
such polynomial defines the global phase for some {CNOT, T} circuit, the two problems are in fact
equivalent.
3.1 Parallelizing Λ2(Z)
To illustrate {CNOT, T} re-synthesis, consider the circuit in Figure 2, implementing the doubly
controlled-Z gate,
Λ2(Z) : |x1x2x3〉 7→ ω4·x1∧x2∧x3 |x1x2x3〉
over {CNOT, T}. We track the effect of each CNOT gate on the state of the qubits, as an-
notated in the circuit. When a T/T † gate is applied, we add/subtract a term in the expo-
nent of ω corresponding to the state of the target qubit. The resulting transformation is then
Λ2(Z) : |x1x2x3〉 7→ ωp(x1,x2,x3)|x1x2x3〉, where
p(x1, x2, x3) = x1 + x2 + x3 − (x1 ⊕ x2)− (x1 ⊕ x3)− (x2 ⊕ x3) + (x1 ⊕ x2 ⊕ x3).
In fact, since 2 · (x ∧ y) = x+ y − x⊕ y, we see that ωp(x1,x2,x3) = ω4·(x1∧x2∧x3) as expected.
The T -stages in Figure 2 also correspond to a partition of S, specifically
{{−(x1),−(x2)}, {x1 ⊕ x3, x2 ⊕ x3}, {−(x1 ⊕ x2 ⊕ x3)}, {x1 ⊕ x2,−(x3)}} .
1For the remainder of this paper we do not separate S0 and S1 to simplify the presentation, though our algorithm
can be trivially modified to partition S0 and S1 separately.
7
x1 • T † • T x1
x2 • T † • • T • x2
x3 • T • • T † • x3
0 T † 0
Figure 3: T -depth 2 implementation of Figure 2 with one ancilla.
It can easily be verified that for each subset S in this partition, dim(V ) − rank(S) ≤ n − |S| (i.e.
S satisfies (1)), since dim(V ) = n = 3 and each subset is linearly independent. By contrast, the
partition
{{−(x1),−(x2)}, {x1 ⊕ x3, x2 ⊕ x3, x1 ⊕ x2}, {−(x1 ⊕ x2 ⊕ x3)}, {−(x3)}}
could not have been synthesized as the set {x1⊕x3, x2⊕x3, x1⊕x2} has rank 2, thus the mapping
|x1x2x3〉 7→ |(x1 ⊕ x3)(x2 ⊕ x3)(x1 ⊕ x2)〉 is not reversible.
While we haven’t yet described how to find a minimal computable partition algorithmically, we
can observe that the T -depth 3 partition
{{−(x1),−(x2), x1 ⊕ x3}, {−(x1 ⊕ x2 ⊕ x3)}, {−(x3), x1 ⊕ x2, x2 ⊕ x3}}
is computable since each subset satisfies (1), and is also minimal. If, however, we added an ancilla
when re-synthesizing Figure 2, we can use the extra qubit to generate a smaller partition. In
particular, we know |x1x2x3〉 7→ |x1x2(x1 ⊕ x3)〉 is reversible so |x1x2x3〉|0〉 7→ |x1x2(x1 ⊕ x3)〉|0〉
is as well. We can then compute the value x1 ⊕ x2 ⊕ x3 into the ancilla with 2 CNOT gates and
apply 4 T gates (one for each qubit) at the same time. To connect this intuitive idea with equation
(1), we observe that n = 4, dim(V ) = 3 since the input |x1x2x3〉|0〉 spans a space of dimension 3,
and
rank({−(x1),−(x2), x1 ⊕ x3,−(x1 ⊕ x2 ⊕ x3)}) = 3,
so dim(V )− rank(S) = 0 ≤ 0 = n− |S|. Figure 3 shows the resulting circuit, implementing Λ2(Z)
in T -depth 2.
4 Matroids
We now turn our attention to the problem of determining a minimal partition of a set of linear
Boolean functions into computable sets. Due to the connection between the computability condition
(1) on sets of linear Boolean functions and linear independence, we are able to phrase the problem as
a matroid partitioning problem. To do so, we first introduce the concept of a matroid, an algebraic
structure that generalizes the idea of linear independence in vector spaces.
Definition 1. A finite matroid is a pair (S, I) where S is a finite set and I is a set of subsets of S
such that
1. ∅ ∈ I.
2. For all A,B ⊂ S, if A ∈ I and B ⊂ A, then B ∈ I.
8
3. For all A,B ∈ I, if |A| > |B|, then there exists some a ∈ A such that B ∪ {a} ∈ I.
It turns out that a set of linear Boolean functions, together with an independence relation
defined by the equality (1), forms a matroid:
Lemma 4. For any subspace V of Fn2 with dimension m and set of linear Boolean functions
S = {f1, f2, . . . , fk} ⊆ V ∗, let I denote the set
{A ⊆ S|m− rank (A) ≤ n− |A|}.
The pair (S, I) is a finite matroid.
Proof. We verify that (S, I) satisfies all three conditions of Definition 1.
1. m− rank(∅) ≤ n− |∅| is trivially true since m ≤ n. Thus ∅ ∈ I.
2. Suppose A,B ⊂ S, where A ∈ I and B ⊂ A. Since A = B ∪ (A \B) we see that
rank(A) ≤ rank(B) + rank(A \B)
≤ rank(B) + (|A| − |B|),
and so rank(A)− rank(B) ≤ |A| − |B|. Since m− rank(A) ≤ n− |A|, we see that
m ≤ n+ rank(A)− |A| ≤ n+ rank(B)− |B|,
and thus m− rank(B) ≤ n− |B|.
3. Suppose A,B ∈ I and |A| > |B|. If rank(A) ≤ rank(B), then
m− rank(B) ≤ m− rank(A) ≤ n− |A| < n− |B|,
and so m− rank(B ∪ {s}) ≤ n− |B ∪ {s}| for any s ∈ A.
Otherwise, rank(A) > rank(B) and we can let A′ and B′ be maximal linearly independent
subsets of A and B, respectively. Since A′ 6⊆ span(B′), for any s ∈ A \ span(B′), B′ ∪ {s} is
linearly independent. Then,
m− rank(B′ ∪ {s}) = m− rank(B ∪ {s})
= m− rank(B)− 1
≤ n− |B| − 1
= n− |B ∪ {s}|.
With Lemmas 2 and 4 we see that the problem of finding a minimal partition of the phase
factors in a {CNOT, T} circuit reduces to the more general matroid partitioning problem.
9
⊥{x1 ⊕ x3,
x2 ⊕ x3
}
⊥{ x1, x2,
x1 ⊕ x2 ⊕ x3
}
x1 ⊕ x3
x1
x2 ⊕ x3
x2 x1 ⊕ x2 ⊕ x3
x1 ⊕ x2
Figure 4: The directed graph Gs constructed when adding x1 ⊕ x2 to the minimal partition
{{x1 ⊕ x3, x2 ⊕ x3}, {x1, x2, x1 ⊕ x2 ⊕ x3}}. A minimum length path is shown in solid lines, re-
sulting in the new partition {{x1 ⊕ x3, x2 ⊕ x3, x1}, {x1 ⊕ x2, x2, x1 ⊕ x2 ⊕ x3}} .
4.1 Matroid partitioning
The matroid partitioning problem can be defined as follows:
Definition 2. (Matroid partitioning)
Given a matroid (S, I), find a partition {A1, A2, . . . , Ak} of S such that Ai ∈ I for each 1 ≤ i ≤ k
and for any other partition {A′1, A′2, . . . , A′k′} into independent subsets, k′ ≥ k.
Matroid partitioning can, perhaps surprisingly, be solved in polynomial time, given an indepen-
dence oracle for the matroid [12]. As a result, given an oracle for (1), the T gates in a {CNOT, T}
circuit can be optimally partitioned efficiently. Since the condition in Lemma 2 can be checked by
using Gaussian elimination to compute the matrix rank in O(n3) time,2 we thus see that a minimal
partition can be computed in polynomial time. The rest of this section describes an algorithm for
computing such a minimal partition.
The algorithm we use for solving the matroid partitioning problem is based on an algorithm
due to Edmonds [12]. Given a matroid (S, I) and a minimal (matroid) partition P of S′ ⊂ S, we
take an element s ∈ S \ S′ not already partitioned and construct a minimal partition of S′ ∪ {s}.
To create the new partition, we construct a directed graph Gs containing a vertex u for every
u ∈ S′ ∪ {s} as well as a vertex ⊥p for every subset p ∈ P . The edges of Gs represent changes to
the partition that are invariant under the property of each subset being independent. In particular,
for any u, v ∈ S′ ∪ {s} there is a directed edge v → u in Gs if and only if u is contained in some
subset p ∈ P and (p \ {u}) ∪ {v} ∈ I, i.e. v can be added to p if we remove u. Additionally, given
a subset p ∈ P and element u ∈ S′ ∪ {s}, there exists an edge u→ ⊥p if and only if p∪ {u} ∈ I. A
path from s to ⊥p for some subset p gives a set of updates to P that produce a valid partition P ′
of S′ ∪ {s}. Likewise, if there is no such path, there is no partition of size |P | partitioning S′ ∪ {s}
(see [12] for a proof), and so a new subset {s} is added to P . Figure 4 shows the full graph Gs for
one iteration when computing a minimal partition for Λ2(Z) (Figure 2).
Rather than generating the graph Gs explicitly for each element s, we try to construct a path
from s to some ⊥p breadth-first (Algorithm 1). The time complexity of breadth-first search is
O(|E|+ |V |) for a graph with edge set E and vertices V . We can note that there are |S′|+ |P |+ 1
vertices and at most |S′|2 + |P | · (|S′| + 1) + (|S′| + |P |) edges in Gs, as well as the fact that
2In practice we reduce this to O(m2n) by storing vectors in V ∗ as length dim(V ) = m vectors. The O(n3) bound
is used for simplicity.
10
|P | ≤ |S′|. Since each edge requires a single test for independence in O(n3) time, the breadth first
search requires time in O(|S′|2 · n3 + |S′|).
Algorithm 1 Matroid partitioning algorithm
function Partition(s, P, (S, I))
/* I denotes the independence oracle,
P is a minimal partition */
Create path queue Q, Q.enqueue(s→ ∅)
Unmark each element of S, mark s
while Q non-empty do
t := Q.dequeue()
for each A ∈ P do
Set A′ := A ∪ {head(t)}
if A′ ∈ I then
Set A := A′
for each u→ v in path t do
Replace u with v in its current partition
end for
else
for each unmarked u ∈ A do
if A′ \ {u} ∈ I then
Q.enqueue(u→ t)
Mark u
end if
end for
end if
end for
end while
If no path was found, set P := P ∪ {s}
end function
Algorithm 1 details the algorithm for adding an element s to a partition P of matroid (S, I) –
the full algorithm follows by iteratively adding each element of the ground set to the initially empty
partition, and correctness follows from the property that if P is minimal for (S, I), then the new
partition P ′ is minimal for (S ∪ {s}, I). Since adding a single element to a partition of i elements
takes O
(
(2i)2 · n3 + 2i) time, and ∑|S|i i2 = |S|33 + |S|22 + |S|6 , we see that Algorithm 1 can be used
to partition the full set S in O(|S|3 · n3) time.
5 Extending to a universal gate set
In the previous sections we described a method for re-synthesizing {CNOT, T} circuits that removes
redundant T -gates by computing the total phase and parallelizing the phase gates through matroid
partitioning. However, the usefulness of such an algorithm on its own is marred by the fact that
{CNOT, T} circuits are a restricted class of quantum circuits – in particular, they do not create
superpositions or interference between basis states. To apply the optimization procedure to more
11
complex quantum circuits, we extend these ideas to deal with the universal gate set {H,CNOT, T}.
We recall that a Hadamard gate H has the effect
H : |x1〉 7→ 1√
2
∑
x2∈F2
ω4·x1·x2 |x2〉
on a basis state x1 ∈ F2. We call x2 a path variable, following in the tradition of similar represen-
tations of quantum circuits [10, 27], called sum over path representations. We note that the state
|x1〉 effectively ceases to exist, having been replaced with |x2〉. Circuits over {H,CNOT, T} can
then be described by a phase polynomial and set of linear Boolean outputs, similar to Lemma 3.
Lemma 5. If a unitary U ∈ U(2n) is exactly implementable by an n-qubit circuit over {H,CNOT, T}
with k H gates, then for x1x2...xn ∈ Fn2 ,
U |x1x2...xn〉 = 1√
2k
∑
xn+1...xn+k∈F
k
2
ωp(x1,x2,...,xn+k)|y1y2...yn〉
where yi = hi(x1, x2, . . . , xn+k) and
p(x1,x2, ..., xn+k) =
l∑
i=1
ci ·fi(x1, ..., xn+k) + 4 ·
k∑
i=1
xn+i ·gi(x1, ..., xn+k)
for some linear Boolean functions hi, fi, gi and coefficients ci ∈ Z8. The k path variables
xn+1, ..., xn+k result from the application of Hadamard gates.
Proof. Follows from the effect of each gate on the computational basis states.
It can be noted that unlike Lemma 3, the converse is not true – some computations of the form
in Lemma 5 do not define unitary transformations.
Synthesis of a circuit based on the above representation is more challenging. In particular,
the Hadamard gates in effect destroy values and create new ones, changing the state space to
some new subspace of Fn+k2 , possibly with greater dimension if the destroyed value was linearly
dependent with the other qubits (e.g. an initialized ancilla). Each linear Boolean function is
likewise computable only in some of the possible state spaces. To re-synthesize we then need to
apply Hadamard gates in such a way as to be able to pick up each phase factor and end in the
state space span ({y1, y2, ..., yn}).
Since the application of a Hadamard gate changes the state space, the state of each qubit must
be chosen so that the qubits span a suitable space afterwards. As an illustration consider the
transformation
|x1x2〉 7→ 1√
2
∑
x3∈F2
ω4·x3·x2 |(x1 ⊕ x2)x3〉.
We could achieve the correct phase by applying a Hadamard gate on the second qubit first, but
the resulting state would be |x1x3〉, from which we cannot directly construct the output state
|(x1 ⊕ x2)x3〉. The simplest way to choose a suitable state for each qubit before applying a
Hadamard gate is to use the qubit’s state in the original circuit, though there may be other ways
of computing such states. During re-synthesis we then transform the state to match the state in
the original circuit before a particular Hadamard gate is applied.
12
In this sense, the Hadamard gates are fixed by the original circuit and the re-synthesis process
needs to place the remaining terms of the phase (i.e. ci · fi(x1, x2, ..., xn+k) in between them. One
approach is to use the greedy nature of Algorithm 1 to maintain a partition of those functions
that are computable in the current state space of the circuit. Specifically, we iterate through the
Hadamard gates and for each one we partition any elements that are in the new state space – this
step relies on the fact that Algorithm 1 is greedy to avoid having to repartition the elements that
were already in the old state space. For any block in the partition containing functions that will not
be computable after the next Hadamard gate, we remove the block and synthesize a {CNOT, T}
circuit applying those phases. A more detailed description is given in Section 6.
While this method is heuristic, we note that the partition is always minimal for the set of
currently computable functions. In particular, Algorithm 1 produces a minimal partition when
given a minimal partition, and removing blocks from a partition does not affect minimality – given
a subset P ′ of a minimal partition P , if there existed a partition P0 of the elements in P
′ such that
|P0| < |P ′|, then P0 ∪ P \ P ′ is a partition of the elements in P into fewer sets.
One problem arises in that the dimension of the state space may increase in the next subcircuit
(e.g. if the Hadamard is applied to an ancilla qubit). In this case, the independence condition (1) of
the matroid changes, and previous partitions may now be invalid under the new inequality. However,
as a trivial consequence of the fact that the dimension increases by at most 1, a partition that is
no longer independent can be modified to satisfy it by removing exactly one linearly dependent
element. Furthermore, if all partitions are modified to satisfy the new independence condition in
this way, the new partition is minimal and the elements that were removed can be repartitioned by
Algorithm 1.
Lemma 6. Given a subspace V of Fn2 with dim(V ) = m and a set of linear Boolean functions
S ⊆ V ∗, let
Ii = {A ⊆ S|i− rank (A) ≤ n− |A|}.
If P is a minimal partition of (S, Im), then the partition P
′ defined by removing one linearly
dependent element from every A ∈ P if A /∈ Im+1 is a minimal partition of (S′, Im+1), where S′ is
the set of elements partitioned by P ′.
Proof. Suppose there exists some partition P0 of (S
′, Im+1) with |P0| < |P ′|. We then see that one
element of S \ S′ can be added to any A ∈ P0 to give a set in Im. In particular, consider any
A ∈ P0. Since m+ 1− rank(A) ≤ n− |A| we see that
m− rank(A) ≤ n− |A| − 1 = n− |A ∪ {s}|
for any s ∈ S \ S′. We also note that n − |A ∪ {s}| ≥ 0 as required, since any subset of S has
rank at most m, so for any A ∈ Im+1, d + 1 − rank(A) > 0 and thus |A| < n. Therefore, for any
A ∈ P0, s ∈ S \ S′ we have A ∪ {s} ∈ Im.
Next we note that for any T ⊆ S there exists a partition of (T, Im) with size at most |T | − 1.
This is a simple result of the fact that m < n, as any subset A ⊆ T of size 2 has rank at least 1, so
m− rank(A) ≤ m− 1 ≤ n− 2 = n− |A|.
Additionally, any size 1 subset of T is trivially independent under Im.
13
Thus, since we can add one element to every partition in P0, and we can partition the remaining
|S \S′|− |P0| elements into at most |S \S′|− |P0|−1 partitions, we see that there exists a partition
of (S, Im) with size at most
|P0|+ (|S \ S′| − |P0| − 1) = |S \ S′| − 1.
Since |S \ S′| − 1 < |P | we obtain a contradiction.
6 The Tpar algorithm
In the last section we described a heuristic for optimizing T -count and depth over the gate set
{H,CNOT, T}. In this section, we present the concrete algorithm, Tpar, and enlarge the gate set
to include the Pauli gates
X :=
(
0 1
1 0
)
, Y :=
(
0 i
−i 0
)
, Z :=
(
1 0
0 −1
)
.
We ignore the irrelevant global phase i in Y = iXY , though it can be recovered by applying
XPXP = iI to any qubit. Appendix A gives a demonstration of the Tpar algorithm on a simple
circuit.
Recall that in the computational basis, the input space of a circuit with n qubits, n−m ancillae
and k Hadamard gates is a dimension m subspace V of Fn+k2 . We represent the state of a qubit as
a vector in the dual space of Fn+k2 , F
(n+k)∗
2 , along with a Boolean value b, called the parity, which
is used to record bit flips. Specifically, we note that X : |x〉 7→ |1⊕ x〉, so we can model bit flips
with a single parity bit. We denote the set of states F2 × F(n+k)∗2 as S.
Given a Clifford + T circuit C, written as a sequence of gates over {X,Y,Z, P, P †,H,CNOT, T,
T †}, we first compute a triple 〈S,Q,H〉 ∈ D representing C. S = {(c, f)|c ∈ Z8, f ∈ S} stores the
T k phase factors as linear Boolean functions with parity and multiplicity, Q = (g1, g2, ..., gn) ∈ Sn
tracks the state of each qubit, and H = (h1, h2, ...hk) gives a sequence of Hadamard gates where
each entry hi stores the input and output states, hi.QI and hi.QO respectively. We define the initial
state of the circuit as Q0 = ((0, x1), (0, x2), ..., (0, xm), (0, 0), ..., (0, 0)), which is understood as the
state |x1x2...xm〉|0〉⊗n−m. To compute 〈S,Q,H〉, we use the function JUK : D → D (Figure 5) to
sequentially update the triple 〈S,Q,H〉 for each gate U in the circuit, starting from the initial value
〈∅, Q0, ∅〉.
The Tpar algorithm (Algorithm 2) proceeds as follows: after computing 〈S,Q,H〉, we synthesize
a new circuit by iterating through the Hadamard gates in H while updating a partition P of the
functions of S that are computable in the current subcircuit. In particular, we divide S into SP and
S−P , where SP are the already partitioned elements and S−P are those not already partitioned.
For a given hi = {QI , OO}, for every (c, f) ∈ S−P we check whether f ∈ span(QI). If so, we add
(c, f) to the current partition using Algorithm 1 with the independence relation A ⊆ S ∈ I if and
only if rank(QI)− rank(A) ≤ n− |A|. After partitioning, we update SP and S−P accordingly. The
tests for inclusion of each function f in span(hi.QI) requires O
(|S−P | · (n + k)3) time, and we can
loosely bound the partitioning time as the time to partition the entire set S using Algorithm 1,
O(
(|S|3 · (n + k)3); since |S−P | ≤ |S| the entire step thus takes O (|S|3 · (n+ k)3) time. A tighter
analysis is possible, though we omit it as the algorithm is heuristic in nature.
We next iterate through P and for each block A ∈ P , if f /∈ span(QO) for some (c, f) ∈ A we
remove A from the partition and synthesize a circuit computing the relevant phase factors. While we
14
JXiK〈S,Q,H〉 = 〈S, (g1, ..., gi−1, 1⊕ gi, ..., gn),H〉
JZiK〈S,Q,H〉 = 〈S ⊎ {(4, gi)}, Q,H〉
JYiK〈S,Q,H〉 = 〈S ⊎ {(4, gi)}, Q′,H〉 where Q′ = (g1, ..., gi−1, 1⊕ gi, ..., gn)
JPiK〈S,Q,H〉) = 〈S ⊎ {(2, gi)}, Q,H〉
JP †i K〈S,Q,H〉) = 〈S ⊎ {(6, gi)}, Q,H〉
JHiK〈S,Q, (h1, h2, ..., hj)〉) = 〈S,Q′,H ′〉 where Q′ = (g1, ..., gi−1, (0, xj+i), ..., gn)
H ′ = (h1, h2, ..., hj , {QI = Q,QO = Q′})
JCNOT(i,j)K〈S,Q,H〉) = 〈S,Q′,H〉 where Q′ = (g1, ..., gj−1, gj ⊕ gi, ..., gn)
JTiK〈S,Q,H〉) = 〈S ⊎ {(1, gi)}, Q,H〉
JT †i K〈S,Q,H〉) = 〈S ⊎ {(7, gi)}, Q,H〉
Figure 5: Semantic function J·K. We define S ⊎ T as the union of S and T where any f such that
(c1, f) ∈ S, (c2, f) ∈ T is given coefficient c1 + c2 mod 8. Ui denotes the gate U applied to qubit i
and CNOT(i,j) specifies i as the control qubit and j as the target.
defer the discussion of the synthesis procedure for now, we remark that it requires O
(
(n+ k)3
)
time.
Otherwise, if A is no longer an independent set under the new independence relation A ⊆ S ∈ I
if and only if rank(QO) − rank(A) ≤ n − |A|, we remove a linearly dependent element from A
and update SP and S−P so that the deleted element will be re-partitioned on the next iteration.
As rank(A) and a linearly dependent element can both be found with one application of Gaussian
elimination, this step also requires O
(
(n+ k)3
)
time, and so the entire loop takes O
(|P | · (n+ k)3)
time.
Finally, we synthesize a circuit applying the Hadamard gate in O
(
(n+ k)3
)
time, and repeat
the entire process for the next Hadamard. The entire algorithm, shown in Algorithm 2, thus runs
in time
O
(|C| · n+ k · (n+ k)3 · (|S|3 + |P |+ 1)) .
As |C| · n is in most cases negligible compared to the k · (n + k)3 · |S|3 factor, and |P | ≤ |S|,
we describe the runtime as simply O
(
k · |S|3 · (n+ k)3). Moreover, it should be noted that if no
repartitioning is done, the runtime is bounded by O
(|S|3 · (n+ k)3), as each element is partitioned
exactly once, rather than the worst case of k times in general.
6.1 Synthesizing partitions
We now describe the general synthesis procedure, SYNTHESIZE(A,QI , QO), from Algorithm 2. The
procedure synthesizes a circuit with inputs QI and outputs QO applying the phases given by a
computable partition A of linear Boolean functions.
The algorithm proceeds by first extending A with n − |A| linear Boolean functions to form a
set A′ with rank equal to rank(QI) – this is accomplished by using row operations to reduce A
to a subset of QI , then adding the vectors in QI \ A. Next we synthesize a circuit computing A′
by reducing QI and A
′ to the same basis using Gaussian elimination in O((n + k)3) time, where
15
Algorithm 2 T-parallelization algorithm
function Tpar(Clifford + T circuit C)
C ′ := ∅
〈S,Q,H〉 := JC|C|K · · · JC1K〈∅, Q0, ∅〉
Set SP := ∅; S−P := S; P := ∅
for each 1 ≤ i ≤ k do
I := {A ⊆ S| rank(hi.QI)− rank(A) ≤ n− |A|}
for each (c, f) ∈ S−P do
if f ∈ span(hi.QI) then
P :=Partition((c, f), P, (SP , I))
SP := SP ∪ {(c, f)}; S−P := S−P \ {(c, f)}
end if
end for
for each A ∈ P do
if i = k or ∃(c, f) ∈ A s.t. f /∈ span(hi.QO)
then
Append(C ′, Synthesize(A,hi.QI , hi.QI))
P := P \ A
else if rank(hi.QO)− rank(A) > n− |A| then
Choose (c, f) ∈ A such that rank(A) = rank(A \ {(c, f)})
A := A \ {(c, f)}
SP := SP \ {(c, f)}; S−P := S−P ∪ {(c, f)}
end if
end for
Append(C ′, Synthesize(∅, hi.QI , hi.QO))
end for
return C ′
end function
function Synthesize(A,QI, QO)
/* Synthesize a circuit implementing U : |QI〉 7→ ω
∑
(c,(b,f))∈A c·b⊕f(x1,x2,...,xn+k)|QO〉 */
Compute A′ ⊇ A s.t. rank(A′) = rank(QI), |A′| = n
Synthesize {CNOT,X} circuit C1 : |QI〉 7→ |A′〉
Synthesize {Z,P, T} circuit C2 : |A′〉 7→ ω
∑
(c,b,f)∈A′ c·b⊕f(x1,x2,...,xn+k)|A′〉
Synthesize {CNOT,X,H} circuit C3 : |A′〉 7→ |QO〉
Return C1C2C3
end function
16
addition of two rows corresponds to the application of a CNOT gate, and parity changes correspond
to X gates. The circuit reducing QI to this basis is applied forwards, while the circuit reducing
A′ is applied in reverse, giving a circuit mapping |QI〉 7→ |A′〉. The phase factors are applied by
constructing a combination of T, P and Z gates, corresponding to the relevant coefficients, then
the circuit mapping |QI〉 to |A′〉 is reversed to compute |QI〉. In the case when QO 6= QI , the
corresponding Hadamard gate is applied to compute the output |QO〉.
As alluded to before, we now see that the synthesis procedure has time complexity O
(
(n+ k)3
)
since it requires a constant number (three) of applications of Gaussian elimination. Moreover, the
T -depth of the resulting circuit is 1.
As an important practical issue, Gaussian elimination based synthesis produces linear reversible
circuits that are non-optimal in terms of the number of gates or depth, resulting in a potential in-
crease in the number of CNOT gates after re-synthesizing, as compared to the original design.
While our focus was on the optimization of T gates, there exist algorithms, [20, 25], that produce
more efficient circuits for linear reversible functions. Specifically, [25] provides an algorithm to
synthesize linear reversible circuits with Θ(n2/ log(n)) gates, and [20] reports an O(n)-depth algo-
rithm. More recently, [17] described an optimization procedure for stabilizer circuits that could be
applied afterward to further optimize linear reversible and Clifford group subcircuits. In practical
implementations an advanced linear reversible synthesis algorithm should be used. Compared to
the optimization of T -depth, we considered the optimization of the linear reversible circuit stages
to be a second order improvement and did not pursue it in this work.
7 Results
We implemented Algorithm 2 in C++3 and used it to optimize various quantum circuits, specifically
arithmetic and reversible ones, from the literature. Individual circuits were written in the standard
fault-tolerant universal gate set {X,Y,Z,H,P, P † , CNOT, T, T †}, using the decompositions found
in [2] where applicable. As most arithmetic circuits are dominated by Toffoli gates, we used the
lowest T -depth implementation of the Toffoli without ancillae known [2].
Results are reported in Tables 1 and 2. They were generated in Debian Linux running on a
quad-core 64-bit Intel Core i7 2.40GHz processor and 8 GB RAM. Table 1 gives gate counts for
the circuits before and after optimization. Table 2 shows T -depths before and after optimization
using either 0, n, or ∞ ancillae4 where n denotes the original number of qubits in the circuit. The
T -depth for each circuit before optimization was computed by parallelizing the T -gates and Toffoli
gates by hand, and writing each group of parallel Toffoli gates in T -depth 3.
With no extra ancillae added, all the tested benchmarks show significant reductions in terms
of both T -count and T -depth, with average reductions of 39.9% and 54.3% respectively. The
algorithm is particularly effective in cases where adjacent Toffoli gates share either controls or
targets, as many of the phases cancel – each of the GF(2m) multipliers share this structure, and as
a result show large reductions in T -count and T -depth. In fact, the Tpar algorithm will parallelize
any GF(2m) multiplication circuit to T -depth 2 when given sufficiently many ancillae, by noting
that each such circuit contains two stages of Toffoli gates that result in one {CNOT, T} stage each
3C++ Source code available at http://code.google.com/p/tpar/.
4In order to give an example of the trade-off between number of ancillae and the T -depth for some non-constant
number of ancillae, we arbitrarily chose to illustrate results with n ancillae. Our implementation allows any other
computable value.
17
after removing trivial identities. By comparison, since the Toffoli gates cannot be all written in
parallel, the minimum T -depth achievable using T -depth 1 Toffoli implementations [28] is 12(m−1).
Those circuits that mix controls and targets between adjacent Toffoli gates are less affected by the
optimization, e.g., CSUM-MUX9, as the Hadamard gates create barriers to T parallelization.
The runtimes show that algorithm scales well to large circuits, the largest tested circuit having
192 qubits, 28672 T gates and 8192 Hadamard gates. This stands in contrast to most previous
efforts to optimize quantum circuits, which have generally been limited in usefulness to a few qubits
and a small number of gates. While the inclusion of Hadamard gates adds significant complexity
to the algorithm, it has actually reduced runtime of the algorithm compared to experiments where
Tpar was applied only to {CNOT, T} subcircuits which is likely a result of the greater T -count
reductions. As a result, Tpar appears to be an effective heuristic algorithm for the large-scale
optimization of fault-tolerant circuits.
We also tested our algorithm’s ability to make use of ancillae to optimize T -depth (Table 2).
For each of the benchmark circuits, we applied our algorithm with an added n ancillae, where the
original circuit contained n qubits. We also report the minimum T -depth achievable for each circuit
using our algorithm. It can be noted that our algorithm usually decreases in running time when
ancillae are added, due to the reduced number of partitions and thus faster matroid partitioning.
Specifically, when there are many ancillae, for the majority of the time when an item s is partitioned
it can be directly added to one of the partitions in time O(|P | ·(n+k)3). The algorithm is thus very
flexible, and the experimental data illustrates a great potential for exploring space-time trade-off
in quantum circuits.
As an important application, our experiments include instances of the multiple control Toffoli
gates, Λk(X), which are widely used in the construction of reversible circuits. We report the results
using two different implementations – the Barenco et al. implementation using k − 2 ancillae with
arbitrary initial states [3], and the Nielsen-Chuang implementation using k − 2 ancillae initialized
in the state |0〉 [24]. In both constructions the ancillae are returned to their initial state. Our
optimization of the Barenco et al. version reduces the T -count from 7(4k− 8) to 3(4k− 8) + 4 and
T -depth from 3(4k − 8) to 4k − 8 with unbounded ancillae, in the instances we tried. Likewise,
our optimization of the Nielsen-Chuang implementation reduced the T -count from 7(2k − 3) to
4(2k− 3) + 3 and T -depth from 3(2k− 3) to 2k− 3. These formulae in fact hold for every k ≥ 3, a
result of the simple structure of the circuits. As the two versions use 4k − 8 and 2k − 3 sequential
Toffoli gates, respectively, we note that Tpar parallelizes each Toffoli to T -depth 1 when sufficiently
many ancillae are available. Moreover, the reductions in T -count can be observed to correspond
directly to each shared target or control – in this way, Tpar achieves the same T count and depth
reductions as the multiply-controlled gate construction reported in [28], but applies to more general
cases and does not require implementing controls with the less intuitive Λ2(±iX) gates.
Figure 6: T -depth 1 implementation of the Toffoli gate.
As a final remark, we note that our algorithm reproduces many of the previous results regarding
the optimization of T -depth. In particular, Figure 6 shows the circuit produced by running Tpar
18
Benchmark N xC xT xg x
′
C x
′
T x
′
g Time (s) T -count reduction (%)
Mod 54 [21] 5 32 28 9 48 16 12 0.000 42.9
VBE-Adder3 [32] 10 80 70 20 114 24 23 0.001 65.7
CSLA-MUX3 [31] 15 90 70 20 425 62 21 0.001 11.4
CSUM-MUX9 [31] 30 196 196 84 411 112 70 0.005 42.9
QCLA-Com7 [11] 24 215 203 73 583 95 73 0.003 53.2
QCLA-Mod7 [11] 26 441 413 132 1185 249 138 0.008 39.7
QCLA-Adder10 [11] 36 273 238 86 737 162 73 0.018 31.9
Adder8 [30] 24 466 399 126 920 215 153 0.007 46.1
RC-Adder6 [9] 14 104 77 30 234 63 29 0.001 18.2
Mod-Red21 [19] 11 121 119 58 301 73 51 0.001 38.7
Mod-Mult55 [19] 9 55 49 36 166 37 20 0.000 24.5
Λ3(X) – [3] 5 28 28 8 54 16 12 0.000 42.9
– [24] 5 21 21 6 41 15 9 0.000 28.6
Λ4(X) – [3] 7 56 56 16 90 28 23 0.000 50.0
– [24] 7 35 35 10 63 23 16 0.000 34.3
Λ5(X) – [3] 9 84 84 24 132 40 34 0.001 52.4
– [24] 9 49 49 14 94 31 23 0.000 36.7
Λ10(X) – [3] 19 224 224 64 328 100 89 0.004 55.4
– [24] 19 119 119 34 232 71 58 0.002 40.3
GF(24)-Mult [23] 12 115 112 32 324 68 27 0.001 39.3
GF(25)-Mult [23] 15 179 175 50 535 111 36 0.004 36.6
GF(26)-Mult [23] 18 257 252 72 649 150 43 0.008 40.5
GF(27)-Mult [23] 21 349 343 98 992 217 36 0.031 36.7
GF(28)-Mult [23] 24 468 448 128 1256 264 40 0.052 41.1
GF(29)-Mult [23] 27 575 567 162 1701 351 44 0.110 38.1
GF(210)-Mult [23] 30 709 700 200 2176 410 69 0.227 41.4
GF(216)-Mult [23] 48 1856 1792 512 6592 1040 82 5.079 42.0
GF(232)-Mult [23] 96 7291 7168 2048 33269 4128 166 602.577 42.4
GF(264)-Mult [23] 192 28860 28672 8192 180892 16448 334 95447.466 42.6
Average 39.9
Maximum 65.7
Table 1: T -count benchmarks. We report the gate counts after optimizing circuits with Tpar, using
no extra ancillae. N specifies the number of qubits. xC reports the number of CNOT gates, xT
reports the number of T gates and xU reports the number of other gates. x
′ denotes the number
of gates after optimization.
on an implementation of the Toffoli gate. The circuit mirrors the T -depth 1 Toffoli reported in [28].
Moreover, the full range of T -depths possible with different numbers of ancillae can be observed,
as seen in Figure 7. We also show a re-synthesized controlled-T gate [2] using one ancilla to reduce
the T -depth from 5 to 3 (Figure 8), and a re-synthesized Barenco et al. implementation of the
Λ3(X) gate using no ancillae (Figure 9).
8 Conclusion
We have described an algorithm for re-synthesizing Clifford + T circuits with reduced T -count and
depth. The algorithm uses a circuit representation based on linear Boolean functions, allowing T
gates to be combined and then parallelized through the use of matroid partitioning algorithms. The
algorithm has worst case runtime that is cubic in the number of T gates, qubits, and Hadamard
gates, though our experiments show that the algorithm is sufficiently fast for practical circuit sizes.
Our benchmarks (Tables 1 and 2) show that large gains can be obtained in reducing the T -count
and T -depth of quantum circuits. In some cases, T -count was reduced by as much as 65.7%, while
19
Benchmark T -depth T -depth Red. T -depth Time Red. T -depth Time Red.
original 0 ancilla (%) N ancilla (s) (%) ∞ ancilla (s) (%)
Mod 54 [21] 12 6 50.0 3 0.000 75.0 3 0.000 75.0
VBE-Adder3 [32] 24 9 62.5 5 0.000 79.2 5 0.000 79.2
CSLA-MUX3 [31] 21 8 61.9 4 0.004 81.0 4 0.001 81.0
CSUM-MUX9 [31] 18 9 50.0 4 0.003 77.8 3 0.005 83.3
QCLA-Com7 [11] 27 12 55.6 7 0.003 74.1 7 0.004 74.1
QCLA-Mod7 [11] 57 29 49.1 14 0.008 75.4 14 0.010 75.4
QCLA-Adder10 [11] 24 11 54.2 6 0.005 75.0 6 0.006 75.0
Adder8 [30] 69 30 56.5 15 0.007 78.3 15 0.008 78.3
RC-Adder6 [9] 33 22 33.3 11 0.002 66.7 11 0.001 66.7
Mod-Red21 [19] 48 25 47.9 15 0.002 68.8 15 0.011 68.8
Mod-Mult55 [19] 15 7 53.3 4 0.000 73.3 4 0.005 73.3
Λ3(X) – [3] 12 8 33.3 4 0.001 66.7 4 0.000 66.7
– [24] 9 6 33.3 3 0.000 66.7 3 0.001 66.7
Λ4(X) – [3] 24 13 45.8 8 0.001 66.7 8 0.002 66.7
– [24] 15 9 40.0 5 0.001 66.7 5 0.000 66.7
Λ5(X) – [3] 36 18 50.0 12 0.001 66.7 12 0.004 66.7
– [24] 21 12 42.9 7 0.001 66.7 7 0.001 66.7
Λ10(X) – [3] 96 43 55.2 32 0.005 66.7 32 0.032 66.7
– [24] 51 27 47.1 17 0.003 66.7 17 0.008 66.7
GF(24)-Mult [23] 36 6 83.3 4 0.001 88.9 2 0.001 94.4
GF(25)-Mult [23] 48 9 81.3 5 0.002 89.6 2 0.002 95.8
GF(26)-Mult [23] 60 9 85.0 5 0.005 91.7 2 0.003 96.7
GF(27)-Mult [23] 72 12 83.3 7 0.026 90.3 2 0.004 97.2
GF(28)-Mult [23] 84 13 84.5 7 0.035 91.7 2 0.006 97.6
GF(29)-Mult [23] 96 15 84.4 7 0.058 92.7 2 0.010 97.9
GF(210)-Mult [23] 108 16 85.2 7 0.157 93.5 2 0.012 98.1
GF(216)-Mult [23] 180 24 86.7 12 3.128 93.3 2 0.061 98.9
GF(232)-Mult [23] 372 47 87.4 23 644.189 93.8 2 1.246 99.5
GF(264)-Mult [23] 756 94 87.6 44 127287.329 94.2 2 78.641 99.7
Average 61.1 78.5 80.7
Maximum 87.6 94.2 99.7
Table 2: T -depth benchmarks. We report the T -depth after no optimization (original) and after
optimization with 0 (cf. Table 1), n, or unbounded ancillae.
the T -depth could be reduced by up to 87.6% without ancillae. Furthermore, the benchmarks
illustrate that ancillae can be used to parallelize T gates further, and given the runtimes reported
the algorithm can be seen to provide substantial flexibility in exploring the trade-off between ancilla
usage and T -depth. In the most extreme case we were able to reduce the T -depth of GF(2m)-Mult
circuits from 12(m − 1) to a constant of 2, using unbounded ancillae. While the benchmarks were
all arithmetic or otherwise reversible operations, such operations typically require the majority of
the resources in circuits for quantum algorithms of interest [16].
We close by noting that as a consequence of the Tpar algorithm, reducing the number of terms
in the mixed arithmetic polynomials describing the phase corresponds directly to reducing the T -
complexity of quantum circuits; in fact, it was observed that minimization of T -count in {CNOT, T}
circuits is equivalent to minimizing the number of odd coefficients in the phase polynomial. A
natural avenue of future work is then to develop methods for optimizing such polynomials for T -
count and depth. This work also represents the first instance, to the authors’ knowledge, of the
use of sum over paths style representations in quantum circuit synthesis and optimization. While
this representation has proven effective in optimizing circuit T -count and T -depth, the questions of
synthesis for more general phases, e.g. the sum over paths representation of {H,CNOT, T}, and
of the precise form of phases synthesizeable over the “Clifford + T” gate set remain. Moreover, we
20
Figure 7: T -depth 2 implementation of the Toffoli gate.
Figure 8: T -depth 3 implementation of the controlled-T gate (CNOT stages optimized by templates
[22]).
leave it as a topic of future research to find new applications to optimization over different gate
sets and cost metrics. Efficient practical synthesis of linear reversible circuits is another important
direction that would directly contribute to improving the results of this work.
A Parallelizing (Λ2(X)⊗ I)(I ⊗ Λ2(X))
In this section we illustrate the workings of the Tpar algorithm using the following circuit:
We first expand this circuit by using the T -depth 3 Toffoli gate implementation [2]:
Next we compute 〈S,Q,H〉 by applying the function J·K to each gate in sequence, starting with
〈∅, (x1, x2, x3, x4)) , ∅〉. The result is
S =
{
x1, 2 · x2, x5, 7·(x1 ⊕ x2), 7·(x1 ⊕ x5), 7·(x2 ⊕ x5), (x1 ⊕ x2 ⊕ x5),
x6, x7, 7·(x2 ⊕ x6), 7·(x2 ⊕ x7), 7·(x6 ⊕ x7), (x2 ⊕ x6 ⊕ x7)
}
,
Q = (x1, x2, x6, x8),
H = (h1, h2, h3, h4) where
h1 = {QI = (x1, x2, x3, x4), QO = (x1, x2, x5, x4)} ,
h2 = {QI = (x1, x2, x5, x4), QO = (x1, x2, x6, x4)} ,
h3 = {QI = (x1, x2, x6, x4), QO = (x1, x2, x6, x7)} ,
h4 = {QI = (x1, x2, x6, x7), QO = (x1, x2, x6, x8)} .
Starting with h1, we see that the terms x1, 2 · x2, 7 · (x1 ⊕ x2) are computable, so we partition
them into blocks satisfying 4 − rank(A) ≤ 4 − |A|, giving P = {{x1, 2 · x2}, {7 · (x1 ⊕ x2)}} . As
neither partition will become uncomputable after h1, we simply apply the first Hadamard gate:
21
Figure 9: An optimized implementation of the Λ3(X) gate (CNOT stages optimized by templates
[22]).
At the second Hadamard gate, the path variable x5 is available, so x5, 7 · (x1 ⊕ x5), 7 · (x2 ⊕
x5), x1 ⊕ x2 ⊕ x5 are now computable. We add them to the partition to get
P = {{x1, 2 · x2, 7 · (x2 ⊕ x5)}, {7 · (x1 ⊕ x2)}{x1 ⊕ x2 ⊕ x5, 7 · (x1 ⊕ x5), x5}} .
We trivially see that x2 ⊕ x5 /∈ span({x1, x2, x6, x4}) and likewise neither is x5, so we synthesize a
circuit computing the partitions {x1, 2 · x2, 7 · (x2 ⊕ x5)} and {x1 ⊕ x2 ⊕ x5, 7 · (x1 ⊕ x5), x5} and
allow {7 · (x1 ⊕ x2)} to move past the second Hadamard gate.
Again, x6 is now available, so insert the newly computable terms x6, 7 · (x2 ⊕ x6) into the
current partition P = {{7 · (x1 ⊕ x2)}} to get P = {{7 · (x1 ⊕ x2), x6, 7 · (x2 ⊕ x6)}}. As the single
partition block will still be computable after applying h3, we apply the next Hadamard gate:
We now add the last of the terms to the partition, x7, 7 · (x6 ⊕ x7), x2 ⊕ x6 ⊕ x7, giving
{{7 · (x1 ⊕ x2), x6, 7 · (x2 ⊕ x6), x7}, {7 · x2 ⊕ x7), 7 · (x6 ⊕ x7), x2 ⊕ x6 ⊕ x7}} .
As both partitions contain the value x7 which will be destroyed by h4, we apply the remaining
partitions, followed by the last Hadamard gate:
The final circuit, shown above, reduces the original circuit by 2 T gates (from 14 to 12) and 2
levels of T -depth (from 6 to 4).
Note that the partitions used in the above are minimal, but are not the partitions Algorithm 1
actually produces. Instead, these partitions have been created to best demonstrate the algorithm.
Additionally, for simplicity, we described states without parity, as there are no bit flip gates in this
example.
22
Acknowledgments
We would like to thank Martin Ro¨tteler and Bill Cunningham for many helpful discussions.
Supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via De-
partment of Interior National Business Center Contract number DllPC20l66. The U.S. Government
is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of IARPA, DoI/NBC or the U.S. Government.
This material is based upon work partially supported by the National Science Foundation
(NSF), during D. Maslov’s assignment at the Foundation. Any opinion, findings, and conclusions
or recommendations expressed in this material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation.
Michele Mosca is also supported by Canada’s NSERC, MPrime, CIFAR, and CFI. IQC and
Perimeter Institute are supported in part by the Government of Canada and the Province of
Ontario.
References
[1] P. Aliferis, D. Gottesman, and J. Preskill, “Quantum accuracy threshold for concatenated
distance-3 codes,” Quantum Info. Comput., vol. 6, pp. 97–165, 2006, quant-ph/0504218.
[2] M. Amy, D. Maslov, M. Mosca, and M. Ro¨tteler, “A meet-in-the-middle algorithm for fast
synthesis of depth-optimal quantum circuits,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 32, no. 6, pp. 818–830, 2013, arXiv:1206.0758.
[3] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Margolus, P. Shor, T. Sleator,
J. A. Smolin, and H. Weinfurter, “Elementary gates for quantum computation,” Phys. Rev.
A, vol. 52, pp. 3457–3467, 1995, quant-ph/9503016.
[4] T. Beth and M. Ro¨tteler, “Quantum algorithms: Applicable algebra and quantum physics,”
in Quantum Information, vol. 173 of Springer Tracts in Modern Physics, pp. 96–150, Springer
Berlin Heidelberg, 2001.
[5] H. Bombin, R. S. Andrist, M. Ohzeki, H. G. Katzgraber, and M. A. Martin-Delgado, “Strong
resilience of topological codes to depolarization,” Phys. Rev. X, vol. 2, p. 021004, 2012,
arXiv:1202.1852.
[6] J. W. Britton, B. C. Sawyer, A. C. Keith, C.-C. J. Wang, J. K. Freericks, H. Uys, M. J.
Biercuk, and J. J. Bollinger, “Engineered two-dimensional ising interactions in a trapped-ion
quantum simulator with hundreds of spins,” Nature, no. 7395, pp. 489–492, 2012.
[7] K. R. Brown, A. C. Wilson, Y. Colombe, C. Ospelkaus, A. M. Meier, E. Knill, D. Leibfried,
and D. J. Wineland, “Single-qubit-gate error below 10−4 in a trapped ion,” Phys. Rev. A,
vol. 84, p. 030303, 2011, arXiv:1104.2552.
[8] J. M. Chow, J. M. Gambetta, A. D. Co´rcoles, S. T. Merkel, J. A. Smolin, C. Rigetti, S. Poletto,
G. A. Keefe, M. B. Rothwell, J. R. Rozen, M. B. Ketchen, and M. Steffen, “Universal quantum
23
gate set approaching fault-tolerant thresholds with superconducting qubits,” Phys. Rev. Lett.,
vol. 109, p. 060501, 2012, arXiv:1202.5344.
[9] S. A. Cuccaro, T. G. Draper, S. A. Kutin, and D. Petrie Moulton, “A new quantum ripple-carry
addition circuit,” ArXiv e-prints, 2004, quant-ph/0410184.
[10] C. M. Dawson, A. P. Hines, D. Mortimer, H. L. Haselgrove, M. A. Nielsen, and T. J. Os-
borne, “Quantum computing and polynomial equations over the finite field Z2,” Quantum
Info. Comput., vol. 5, pp. 102–112, 2005, quant-ph/0408129.
[11] T. G. Draper, S. A. Kutin, E. M. Rains, and K. M. Svore, “A logarithmic-depth quantum carry-
lookahead adder,” Quantum Info. Comput., vol. 6, pp. 351–369, 2006, quant-ph/0406142.
[12] J. Edmonds, “Minimum partition of a matroid into independent subsets,” Journal of Research
of the National Bureau of Standards, vol. 69B, pp. 67–72, Jan. 1965.
[13] A. G. Fowler, “Time-optimal quantum computation,” ArXiv e-prints, 2012, arXiv:1210.4626.
[14] A. G. Fowler, A. M. Stephens, and P. Groszkowski, “High-threshold universal quantum com-
putation on the surface code,” Phys. Rev. A, vol. 80, p. 052312, 2009, arXiv:0803.0272.
[15] A. G. Fowler, A. C. Whiteside, and L. C. L. Hollenberg, “Towards practical classical processing
for the surface code,” Phys. Rev. Lett., vol. 108, p. 180501, 2012, arXiv:1110.5133.
[16] IARPA Quantum Computer Science Program, 2011-2013,
http://www.iarpa.gov/Programs/sso/QCS/qcs.html.
[17] V. Kliuchnikov and D. Maslov, “Optimization of Clifford circuits,” Phys. Rev. A, vol. 88,
p. 052307, arXiv:1305.0810.
[18] S. Lloyd, “Universal quantum simulators,” Science, vol. 273, no. 5278, pp. 1073–1078, 1996.
[19] I. L. Markov and M. Saeedi, “Constant-optimized quantum circuits for modular multiplication
and exponentiation,” Quantum Info. Comput., vol. 12, pp. 361–394, 2012, arXiv:1202.6614.
[20] D. Maslov, “Linear depth stabilizer and quantum Fourier transformation circuits with no
auxiliary qubits in finite-neighbor quantum architectures,” Phys. Rev. A, vol. 76, p. 052310,
2007, quant-ph/0703211.
[21] D. Maslov, “Reversible logic synthesis benchmarks page,”
http://webhome.cs.uvic.ca/d˜maslov/, last accessed October 2013.
[22] D. Maslov, G. W. Dueck, D. M. Miller, and C. Negrevergne. “Quantum Circuit Simplification
and Level Compaction,” IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 27, no. 3, pp. 436–444, 2008, quant-ph/0604001.
[23] D. Maslov, J. Mathew, D. Cheung, and D. K. Pradhan, “An O(m2)-depth quantum algorithm
for the elliptic curve discrete logarithm problem over GF(2m),” Quantum Info. Comput., vol. 9,
pp. 610–621, 2009, arXiv:0710.1093.
[24] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information. Cambridge
University Press, 2000.
24
[25] K. N. Patel, I. L. Markov, and J. P. Hayes, “Optimal synthesis of linear reversible circuits,”
Quantum Info. Comput., vol. 8, pp. 282–294, 2008, quant-ph/0302002.
[26] C. Rigetti, J. M. Gambetta, S. Poletto, B. L. T. Plourde, J. M. Chow, A. D. Co´rcoles, J. A.
Smolin, S. T. Merkel, J. R. Rozen, G. A. Keefe, M. B. Rothwell, M. B. Ketchen, and M. Steffen,
“Superconducting qubit in a waveguide cavity with a coherence time approaching 0.1 ms,”
Phys. Rev. B, vol. 86, p. 100506, 2012, arXiv:1202.5533.
[27] T. Rudolph, “Simple encoding of a quantum circuit amplitude as a matrix permanent,” Phys.
Rev. A, vol. 80, p. 054302, 2009, arXiv:0909.3005.
[28] P. Selinger, “Quantum circuits of T -depth one,” Phys. Rev. A, vol. 87, p. 042302, 2013,
arXiv:1210.0974.
[29] P. Shor, “Algorithms for quantum computation: discrete logarithms and factoring,” Founda-
tions of Computer Science, pp. 124–134, 1994, quant-ph/9508027v2.
[30] Y. Takahashi, S. Tani, and N. Kunihiro, “Quantum addition circuits and unbounded fan-out,”
Quantum Info. Comput., vol. 10, pp. 872–890, 2010, arXiv:0910.2530.
[31] R. Van Meter and K. M. Itoh, “Fast quantum modular exponentiation,” Phys. Rev. A, vol. 71,
p. 052320, 2005, quant-ph/0408006.
[32] V. Vedral, A. Barenco, and A. Ekert, “Quantum networks for elementary arithmetic opera-
tions,” Phys. Rev. A, vol. 54, pp. 147–153, 1996, quant-ph/9511018.
[33] X. Zhou, D. W. Leung, and I. L. Chuang, “Methodology for quantum logic gate construction,”
Phys. Rev. A, vol. 62, p. 052316, 2000, quant-ph/0002039.
25
