Due to the decoherence of the state-of-the-art physical implementations of quantum computers, it is essential to parallelize the quantum circuits to reduce their depth. Two decades ago, Moore and Nilsson [1] demonstrated that additional qubits (or ancillae) could be used to design "shallow" parallel circuits for quantum operators. They proved that any n-qubit CNOT circuit could be parallelized to O(log n) depth, with O(n 2 ) ancillae. However, the near-term quantum technologies can only support limited amount of qubits, making space-depth trade-off a fundamental research subject for quantum-circuit synthesis.
depth, with m ancillae. We show that this bound is tight by a counting argument, and further show that even with arbitrary two-qubit quantum gates to approximate CNOT circuits, the depth lower bound still meets our construction, illustrating the robustness of our result. Our work improves upon two previous results, one by Moore and Nilsson [1] for O(log n)-depth quantum synthesis, and one by Patel, Markov, and Hayes [2] for m = 0: for the former, we reduce the need for ancillae by a factor of log 2 n by showing that m = O(n 2 / log 2 n) additional qubits -which is asymptotically optimal -suffice to build O(log n)-depth, O(n 2 / log n)-size CNOT circuits; for the later, we reduce the depth by a factor of n to the asymptotically optimal bound O n log n . Our results can be directly extended to stabilizer circuits using an earlier result by Aaronson and Gottesman [3] . In addition, we provide relevant hardness evidence for synthesis optimization of CNOT circuits in term of both size and depth.
Introduction
One of the most important tasks in quantum computing is quantum-circuit synthesis. Given any n-qubit unitary operator, synthesis algorithms aim to implement it as a sequence of low-level gates, and optimize the circuit size and depth [4, 5] . During the last decade, quantum synthesis algorithms have been developed to achieve asymptotically optimal size [6, 7, 8, 9] . To reduce the circuit depth, synthesis algorithms commonly use ancillae. For example, with sufficient ancillae, Quantum Fourier Transform can be approximated by O(log n + log log(1/ǫ))-depth circuit [10] and stabilizer circuit can be parallelized to O(log n) depth [1] . However, the near-term quantum devices only have a small number of qubits [11] , which may seriously limit the amount of ancillae. This practical concern gives rise to the following fundamental space-depth trade-off problem in quantum circuit synthesis:
Can we characterize the relationship between the number of ancillae and the possible optimal depth?
Because controlled NOT gate (CNOT) and single-qubit operations form a universal set for quantum computing [4, 5] , CNOT-circuit optimization has been widely studied. For circuit size, Patel, Markov, and Hayes [2] proved that each n-qubit CNOT circuit can be synthesized with O(n 2 / log n) CNOT gates and this bound is asymptotically tight. When topological constraints -i.e., there is limited two-qubit connectivity among their addressable qubits -are taken into consideration, synthesis algorithms have been designed by Kissinger-de Griend [12] and NashGheorghiu-Mosca [13] to build circuits of size O(n 2 ). For circuit depth, Moore and Nilsson [1] proved that given O(n 2 ) ancillae, any n-qubit CNOT circuit can be parallelized into O(log n) depth. In addition, Aaronson and Gottesman [3] established a strong connection between CNOT circuits and stabilizer circuits. They proved that stabilizer circuits have a canonical form of 11 blocks, and each block consists of only one type of gates from gate set {CNOT, Phase, Hadamard}. Since each block of Phase or Hadamard gates has depth 1 and size at most n, optimization of CNOT-circuits can be generalized to stabilizer circuits.
Our main contribution:
In this paper, we establish an asymptotically optimal space-depth tradeoff in CNOT-circuit synthesis, as stated in the following theorem. Theorem 1. For any integer m ≥ 0, any n-qubit CNOT circuit can be parallelized to O max log n, n 2 (n + m) log(n + m) depth with m ancillae. Moreover, there is an O (n ω )-time synthesis algorithm for achieving this (here ω is the matrix multiplication exponent [14] ).
Theorem 1 can be readily extended to stabilizer circuits thanks to the wonderful result of Aaronson and Gottesman [3] , stating that for any stabilizer circuit, there exists an equivalent circuit that applies a block of Hadamard gates only (H), then a block of CNOT gates only (C), and a block of Phase gates only (P), and so on in the following 11 blocks sequence H-C-P-C-P-C-H-P-C-P-C. Since Hadamard gate or phase gate is single-qubit gate and can be merged, thus one block of them takes at most depth one and size n. Therefore, it suffices to optimize the CNOT block. Besides, Theorem 1 can be extended to CNOT+R z (θ) circuits, as all of R z (θ) gates can be moved to the end of the circuit [12] . We summarize these consequences in the following corollary. Corollary 1. For any integer m ≥ 0, any n-qubit stabilizer circuit can be parallelized to O max log n, n 2 (n + m) log(n + m) depth with m ancillae. The same statement also holds for CNOT+R z (θ) circuits.
Our result in Theorem 1 improves upon two previous results concerning parallel CNOT-circuit synthesis, respectively, with sufficient ancillae or without any ancillae. By parallelizing any CNOT circuit into an O(log n)-depth O(n 2 / log n)-size equivalent circuit with m = O(n 2 / log 2 n) ancillae, we reduce the number of ancillae needed by Moore and Nilsson [1] by a factor of log 2 n. By achieving asymptotically optimal depth bound of O n log n in parallel CNOT-circuit synthesis without any ancillae, i.e., m = 0, we reduce the depth implied in the work of Patel, Markov, and Hayes [2] by a factor of n.
These improvements are significant theoretically, because we also prove -by a counting argument -that our space-depth trade-off is asymptotically tight. The tightness of Theorem 1 is proved in a more general setting by the following theorem. That is, even if arbitrary two-qubit quantum gate is allowed, rather than only CNOT gates, to approximately implement the given CNOT circuit, the construction still meets the lower bound. Roughly speaking, an ǫ-approximate circuit outputs a quantum state close to the CNOT circuit's output under ℓ 2 norm.
Theorem 2. For 1 − o(1) fraction of n-qubit CNOT circuit, any ǫ-approximate n-qubit m-ancilla quantum circuit has depth Ω max log n,
2 is a constant.
Besides the depth, for m = O(n 2 / log 2 n), our construction has size O(n 2 / log n). It's easy to generalize technique in [15, 16] to show that such n-qubit m-ancillae circuit must have size Ω(n 2 / log n). Thus our construction also meets the asymptotically optimal size. Mathematically, our synthesis method for Theorem 1 is based on carefully-designed Gaussian eliminations. As observed by Patel et al. [2] , any n-qubit CNOT circuit can be represented by an invertible matrix M ∈ F n×n 2 , and the synthesis of CNOT circuit is equivalent to transform M to identity by Gaussian eliminations. As our aim of this paper is to reduce the circuit depth, we use parallel Gaussian eliminations instead. We minimize the number of parallel Gaussian eliminations by the following two techniques:
• For the case without any ancillae, we first establish that if the structure of the matrix is near random, then it is amenable to effective parallel Gaussian elimination. We then use a popular idea from oblivious routing [17] to ensure the existence of a close-to-random structure, This randomization step is then derandomized by a standard approach with somewhat tricky conditional expectations.
• For the case with non-zero ancillae, we adopt the idea from the Method of Four Russians. Recall that in [2] , Patel et al. use Four Russians to eliminate log n columns, namely n log n elements, by O(n) Gaussian eliminations. In this work, we deal with log 2 n columns, namely n log 2 n elements, by O(log n) parallel Gaussian eliminations. This is done by preparing the additive basis of Boolean vectors [18] and properly balancing the trade-off between resource and cost.
Both of our ancilla-based and ancilla-free synthesis algorithms for CNOT circuits rely on the 1-factorization of almost regular bipartite graph [19, 20] -a direct application of Hall's marriage theorem -to get an ideal ordering. We show both our algorithms runs in time O(n ω ), where ω is the universal constant for matrix multiplication.
Our results have some direct implication on matrix decomposition over finite fields. More precisely, we show arbitrary M ∈ GL(n, 2) can be decomposed to O n log n parallel row-elimination matrices. Our technique can be easily generalized to finite field F q for constant q. Theorem 3. For any M ∈ GL(n, q), where q is a constant, M can be transformed to identity by O n log n parallel Gaussian eliminations.
Our construction indicates that there might be a parallel Gaussian elimination algorithm which solves linear equations over F q by O n log n parallel row-elimination matrices with O n log n parallel time. Some related work can be seen in [21, 22, 23, 24] . We leave this as an open problem for future research.
A related fundamental problem is to construct an equivalent circuit with optimal depth for any given CNOT circuit. Note that our parallel CNOT-circuit synthesis algorithm is optimal in the asymptotic sense. Specifically, given any matrix M ∈ GL(n, 2) and a pair of integers k, m, determine whether there exists an n-qubit m-ancilla CNOT circuit for M with depth at most k. This decision problem is similar to Minimum Circuit Size Problem (MCSP) -the famous problem which is unlikely to be proven in P or N P-complete by natural proof [25, 26] . In this paper, we provide what we consider to be relevant hardness evidence by proving hardness results for optimizing CNOT circuits in slightly different scenarios. In the first scenario, one aims to optimize the depth of a CNOT circuit under certain topological constraints [27, 28] with ancillae. In the second scenario, one aims to optimize a sub-circuit of a CNOT circuit with ancillae. We briefly summarize the inapproximability result as Theorem 4; the formal statement is in Section 5.2.
Theorem 4 (Informal)
. It is N P-hard to approximate the solution of the following problems within any constant factor:
• Global Constrained Minimization (GCM): Given an n-qubit CNOT circuit, integer m and topological constraints on the n + m qubits, output an equivalent n-qubit m-ancilla CNOT circuit of minimum size or depth.
• Local Size Minimization (LSM): Given an n-qubit CNOT circuit, integer m and a specific part of the circuit, output an equivalent n-qubit m-ancilla CNOT circuit which optimizes the size of the specified sub-circuit.
At a high level, Global Constrained Minimization (GCM) aims to find the optimal size or depth under certain topological constraints S, where CNOT gate with control i and target j is legal iff (i, j) ∈ S. This restriction is common in existing quantum devices [27, 28] and CNOT circuit optimization on such devices has been discussed [13, 12] . The Local Size Minimization (LSM) aims to optimize the size of a selected part of the circuit while leaving other parts unchanged. Hardness for general quantum circuit optimization over topological constraints can be seen in [29, 30] .
Organization of the paper. In Section 2, we review notations and basic definitions used in this paper. In Section 3, we first present our parallel CNOT-circuit synthesis algorithm without using any ancillae. Besides, we prove any tree-based CNOT circuit can be parallelized to depth O(log 2 n) without any ancilla. In Section 4, we present our ancilla-based synthesis algorithm and complete the proof of Theorem 1. In Section 5, we give the lower bound and related hardness result. Finally, in Section 6, we summarize the paper and present some open problems.
Preliminary
Basic Notations: We use O to hide the polylogarithmic terms, [n] to denote {1, . . . , n}, C to denote the complex domain, F q to denote the field with q elements, ⊕ to denote addition under F 2 , GL(n, q) to denote the set of n × n invertible matrices with entries from F q , superscription ⊤ to denote the transpose of matrix or vector, I n to denote the n × n identity matrix (its subscription is omitted if the context is clear), 1 p,q , p = q to denote the all-zero matrix except that the (p, q) th entry equals 1, and M[i, j], M[i, * ], M[ * , j] to denote, respectively the (i, j) th entry, the i th row, and j th column in matrix M.
CNOT Gate and Circuits: A CNOT gate maps Boolean tuple (x, y) to (x, x ⊕ y). Because it is the invertible linear map 1 0 1 1 over F 2 2 , any n-qubit CNOT circuit C can be viewed as an invertable linear map over F n 2 , represented as an invertible matrix and denoted by M C ∈ F n×n 2 .
Ancilla-Based CNOT Circuits: An n-qubit m-ancilla CNOT circuit has m ancillae with initial assignment |0 ⊗m , and satisfies the following key property: After the evaluation of the circuit, all ancillae are restored, regardless of the input of the n qubits. An n-qubit m-ancilla CNOT circuit C implements an invertible matrix M ∈ F n×n 2 if for any x ∈ F n 2 and input |x |0 ⊗m , the output of the circuit is |Mx |0 ⊗m . In other words, the matrix representation for C is M E 0 F for some
and invertible F ∈ F m×m 2
. Since E, F do not interfere the output when input is |x |0 ⊗m ,
we abbreviate them as * . In particular, * m represents some matrix in F m×m 2
. In the remainder of this paper, we say such C is an n-qubit m-ancilla CNOT circuit for M.
Equivalent Ancilla-Based CNOT Circuits: We say an n-qubit m 1 -ancilla CNOT circuit C 1 is equivalent to an n-qubit m 2 -ancilla CNOT circuit C 2 (denoted by C 1 ∼ = C 2 ) if for any x ∈ {0, 1} n , the output of the first n qubits are the same for C 1 |x |0 ⊗m 1 and C 2 |x |0 ⊗m 2 .
Row-Elimination Matrices: Mathematically, a CNOT gate with control qubit i and target qubit j can be represented as a row elimination from i to j (i.e., adding row-i to row-j). Thus, a CNOT circuit can be viewed as the product of sequence of row-elimination matrices.
Definition 1 (Row-Elimination Matrix). We say matrix R ∈ F n×n 2 is a row-elimination matrix if R = I or there exists i, j ∈ [n], i = j such that
Note that for any row-elimination matrix R, we have R 2 = I. We use R(i, j) to denote I + 1 j,i . A row-elimination matrix represents exactly one single step in the process of Gaussian elimination. For any matrix, left-multiplied by R(i, j) represents adding row-i to the row-j.
Parallel Row-Elimination Matrices: A basic concept in parallel CNOT-circuit synthesis is parallel row elimination (or equivalently parallel Gaussian elimination).
A parallel row-elimination matrix represents several independent steps in Gaussian elimination. Since all i k , j k are distinct, there is no need to name a particular order. When i, j is clear in the context (like in Section 3 and Section 4), we use R to denote a parallel row-elimination matrix, R 1 , R 2 , . . . to denote a sequence of parallel row-elimination matrices.
Quantum Approximation: In this paper, without loss of generality, we only consider quantum circuits consisting of single qubit and two-qubit gates. In Section 5, we will use the following definitions of quantum circuit approximation.
Definition 3 (ǫ-close). For any ǫ > 0, two vectors u, v ∈ C n are said to be ǫ-close iff u − v 2 < ǫ.
Definition 4 (ǫ-approximate). Given n-qubit quantum circuit C 1 and n-qubit m-ancilla quantum circuit C 2 , we say C 2 ǫ-approximates C 1 if for any x ∈ {0, 1} n ,
• C 2 maps |x |0 ⊗m to |ϕ x |0 ⊗m for some |ϕ x ∈ C 2 n , ϕ x 2 = 1,
• |ϕ x and C 1 |x are ǫ-close.
Parallelizing CNOT-Circuit Synthesis Without Ancillae
We will divide the proof of Theorem 1 into two parts. In this section, we prove the first part which covers the case when m = o(n). We will address the rest case in Section 4. In fact, it is sufficient to prove the following Theorem 5 here, as it implies an O (n ω )-time algorithm to parallelize any n-qubit CNOT circuit to depth O n log n with o(n) ancillae.
Theorem 5 (Ancilla-Free Parallel CNOT Synthesis). There is an O(n ω )-time algorithm to parallelize any n-qubit CNOT circuit to depth O n log n without ancillae.
Thanks to the connection between CNOT circuits and invertible Boolean matrices, we can reformulate Theorem 5 as the following:
Lemma 1 (Theorem 5 Reformulated). There is an O(n ω )-time algorithm such that given any M ∈ GL(n, 2), it outputs parallel row-elimination matrices R 1 , . . . ,
Proof. By Bunch and Hopcroft [31] , we can factorize, in time O(n ω ), PM = LU, where P is a permutation matrix, and L, U are, respectively, lower and upper triangular matrices. Besides, it follows a result from Moore and Nilsson [1] that any permutation matrix P can be decomposed into six parallel row-elimination matrices. Lemma 1 then follows from the claim below, as we can handle lower triangular matrices similarly.
Claim 1. Lemma 1 holds for any upper triangular M.
Proof. Our algorithm applies a divide-and-conquer scheme. Like in standard analyses for divideand-conquer methods, we assume that n is sufficiently large (the details will become clear later in the proof). For simplicity, we first consider a randomized algorithm. We will then derandomize it using Lemma 2. The synthesis process is shown in Figure 1 , which has five main steps.
Step 1 (Recursion):
, where A is of size 1 A, which can be computed in advance, independetly from the recursion, in time O(n ω ) [14, 32] .
Step 2 (Find Random Layby):
and each A i is in
. Our key observation here is: If A i is "close to random" (to be formally defined in Lemma 2), then it can be eliminated efficiently. To ensure the needed degree of randomness in our matrix structures, we use a classical idea from oblivious routing [17] : We generate a random B i of same size for each A i as its layby and define C i = A i ⊕ B i . Note that, although they are correlated, both B i and C i by themselves are "random" n log n × n log n matrices with entries from F 1 2 log n 2 .
Step 3 (Generate Row-Traversal Sequence): We say matrix sequence X 1 , . . . , X T ∈ GL(c, 2) is row-traversal if
Let c = 1 2 log n and we apply Lemma 3 to obtain a row-traversal sequence with T = O( √ n).
Step 4 (First Traverse): In this step, we will add C i 's to A i 's and then get B i 's. View the bottom right I n/2 as n log n identities of same size and name them as Y 1 , . . . , Y n/ log n . Let X 0 = I. For each time stamp t ∈ [T ], all Y j 's simultaneously go from X t−1 to X t using original Gaussian elimination algorithm, then
-(j, k) was not selected in previous S i (i.e., S i in previous time stamps and previous repetitions),
• for all i and (j, k) ∈ S i , add row-i of Y k to row-j of A i as one parallel row elimination;
• repeat the two steps above until all S i = ∅.
The detailed explanation of how to construct S i will be justified later.
Step 5 (Second Traverse): Now in the upper right part, all A i 's have reached the pre-decided layby B i 's. In this step, we do another round of traverse similar with Step 4; the only difference is that we use B i when constructing S i . Thus, we add B i 's to B i 's like Step 4, and the upper right square finally becomes zero.
Now we explain the construction of S i in Step 4. For fixed t and i, although S i is found repeatedly in Step 4 for better description, it is actually implemented in a single shot. We justify this as well as its efficiency, where the random B i plays an essential role.
When B i is random, any vector in F 1 2 log n 2 appears about √ n log n times in every row and column of C i with high probability. Then we enumerate all (j, k) ∈ n log n × n log n such that
and view them as the edges on a bipartite graph. Thus any valid S u is a matching in this graph and the iterated construction is equivalent to a matching decomposition. Since any vertex has degree about √ n log n , the bipartite graph can be factorized into about √ n log n matchings in linear time (hiding polylogarithmic terms) [19, 20] .
Hence in Step 4, it will use about √ n log n parallel row-elimination matrices for every time stamp. Similar analysis holds for Step 5, and we will derandomize the choice of B i in Lemma 2. Thus the maximum number of parallel row-elimination matrices, if denoted as d(n), can be obtained by the following recursion
Step 1
+ 2×
Step 4,5
Using Lemma 2 and Lemma 3, the running time, if denoted as T (n), can be obtained by the following recursion
Step 2 + O(n)
Step 3
Now we give two essential components in the proof of Claim 1. Lemma 2 addresses the crucial property that B i , C i must have to make the matching decomposition and parallel row elimination efficient.
Note that the proof of existence in Lemma 2 can be obtained easily by direct application of Chernoff's bound, but in that case it would be hard to derandomize in time O(n 2 ).
Lemma 2.
There is an O n 2 -time algorithm such that for any n sufficiently large, given n log n × n log n matrix A with entries from F 1 2 log n 2 , it outputs B of same format satisfying for any v ∈ F 1 2 log n 2 , it appears at most √ en log n times in any row or column of B, A ⊕ B.
Proof. Pick entries of B bit by bit uniformly at random. Let C = A ⊕ B and set ǫ = 1 2 log n with foresight.
In the following, we prove by induction on the number of determined bits t = 0, . . . , 1 2 log n that no w ∈ F t 2 appears as prefix more than 1 2 + ǫ t n log n times in any row or column of B, C. Assume first k − 1 bits are determined, now we randomly pick the k-th bit from {0, 1}. For any u ∈ F k 2 , i ∈ n log n , define four 0/1 bad event indicators:
) iff u appears as prefix more than 
log n and Bin(m, p) denotes the m-round Bernoulli trial with probability p. Thus, there exists an assignment of the k-th bit such that no u ∈ F k 2 appears as prefix more than log n n log n ≤ √ en log n .
We use · ℓ = · to denote the relation that two vectors share the same first ℓ bits. Let the undetermined bit in entries of B, C be * . Now we derandomize the choice of the k-th bit of B[i, j] by the method of conditional expectation for some fixed i, j. Let the first k − 1 bits of B[i, j], C[i, j] be p, q respectively, and
Let s = f κ
Then we choose b B ∈ {0, 1} that decreases most, which must be non-negative.
To boost the selection, we pre-process and truncate f (u, v) to the highest log 2 n significant bits.
Then even if the best b B increases the expectation, the fluctuation is O 2 − log 2 n and accumulates as an insensitive o(1) term.
Lemma 3 presents a simple way to construct the row-traversal sequence. Though its length can be further improved, the asymptotic order is already tight and sufficient for our purpose. Thus we do not particularly pursue the optimal parameter in it.
Lemma 3. There is an O poly(k)2 k -time algorithm to generate a row-traversal sequence of length 3 × 2 k−1 − k + 1.
⊤ and rank(·) compute the rank of matrix over F k×k 2
. Observe that for any v ∈ F k 2 ,
since for any x = 0,
Also, any row of I ⊕ 1v ⊤ traverses all vectors in F k 2 as v goes through F k 2 . Thus the output of Algorithm 1 gives the desired sequence.
Note that the technique to prove Lemma 1 can be extended to general M ∈ GL(n, q), which is stated as Theorem 3.
Proof of Theorem 3. The proof is almost identical except that all log is replaced with log q . Another difference is that the length of row-traversal sequence will be (q + 1)q k−1 − k + 1 in Lemma 3.
By lemmas and theorem above, we have parallelized any n-qubit CNOT circuit to O n log n depth without ancillae. A fundamental problem in parallel CNOT-circuit synthesis, when no ancillae is given, is to characterize the impact of circuits' topological structures to the size-depth trade-off. Unlike in the asymptotic space-depth trade-off where CNOT circuits are essentially compressed as an invertible matrix in GL(n, 2), the circuit details are part of the input to synthesis algorithms.
While this problem remains an on-going research subject, in the following we use a basic family of CNOT circuits to illustrate that the topological details of CNOT circuits can be effectively used. This family of the CNOT circuits has tree structures: Given a proper binary tree T with n leaves, in which each leaf has a unique label from [n] and each internal node has a label from {L, R}, we can define an n-qubit CNOT circuit (with variables {x 1 , . . . , x n }) as the following.
We use postorder-traversal to define the CNOT circuit M T by first defining for each node v in T , its qubit index i(v) (and the gate M v it describes if v is internal):
• For a leaf v, i(v) is its label in T ;
• for an internal node v with label L and children
). Suppose the postorder-traversal projection of the internal nodes of T is v 1 , . . . , v n−1 . Then,
An example for CNOT trees can be seen in Figure 2 . Figure 2 : An example of CNOT trees, where the right tag above an internal node is its qubit index.
The following theorem gives an equivalent O(log 2 n)-depth CNOT circuit for any n-qubit CNOT circuit M T .
Theorem 6 (Parallel Synthesis of CNOT Trees).
For any proper binary tree T with n leaves, the n-qubit CNOT circuit M T can be parallelized to O(log 2 n) depth without ancillae. [33, 34, 35] . See Appendix C for the proof. Theorem 6 can be generalized to the following corollary.
Theorem 6 can be obtained by applying Miller and Reif's parallel-tree-contraction technique

Corollary 2.
If an n-qubit CNOT circuit can be expressed as the product of k CNOT trees, then it can be parallelized into a CNOT circuit with O(k log 2 n) depth without ancillae.
Parallelizing CNOT circuits with ancillae
In this section, we prove Theorem 1 for the m = Ω(n) part, i.e., m = sn, s = Ω(1). For any s = Ω n log 2 n , the bound in Theorem 1 is always O(log n), thus it suffices to consider s = O n log 2 n . We restate Theorem 1 in this case as follows.
Theorem 7.
There is an O(n ω )-time algorithm to parallelize any n-qubit CNOT circuit into O n s log n depth with (3s + 1)n ancillae, where
We use a standard technique in reversible computation to simplify the problem. Given arbitrary M ∈ GL(n, 2), Theorem 7 aims to construct a CNOT circuit for M with ancillae. We first construct two 2n-qubit 3sn-ancilla CNOT circuits C 1 , C 2 for I M I , I M −1 I respectively, i.e., for any
Starting with |x |0 ⊗n |0 ⊗3sn and applying C 1 , C 2 , where C 2 takes the second n-bits as control and the first n-bits as target, we get
Then, we permute the first and second n qubits, which can be done in depth 6 by [1] , to get the final circuit. Based on the observation above as well as the equivalence between CNOT circuits and invertible Boolean matrix, to prove Theorem 7, it suffices to construct circuit for I M I as Lemma 4 states.
By Lemma 4, the time complexity to construct C 1 , C 2 is O(n 2 ). On the other hand, it needs O (n ω ) time to compute M −1 by [32, 14] , thus the overall time cost is O (n ω ) for Theorem 7.
Lemma 4. There is an
it outputs parallel row-elimination matrices R 1 , . . . , R d where d = O n s log n such that
We delay the detailed proof of Lemma 4 to the end of the section and prove several key lemmas first. The key point here is, we can construct s log 2 n columns of M using O(log n) parallel Gaussian eliminations with the help of 3sn rows in the last. Then we simply construct columns of M as group of s log 2 n sequentially. We begin our proof with the base case s = 1 as Lemma 6, which calls Lemma 5 as a subprocedure to construct a slim sparse matrix.
Lemma 5. There is an O(
log n 2 which has at most t one's, it outputs R 1 , . . . , R d where d = O(log n) such that
Proof. For simplicity, we write e j , j ∈ 1 2 log n for the vector, whose entries are 0 except for the j-th. Let t j = #{i|Y[i, j] = 1}, then t = j t j . Now we describe how to make t 1 copies of e 1 's on t 1 rows by O(log t 1 ) parallel row eliminations. Let the t 1 rows be a 1 , . . . , a t 1 . We add the first row (original e 1 ), to the a 1 -th row; then double the number of e 1 by adding the first and a 1 -th to a 2 , a 3 -th rows simultaneously; and keep doubling till the number reaches t 1 .
Since
√ n log n, we make t i copies of e i independently for all i on the last t rows with O(log n) parallel row-elimination matrices.
Then, we construct Y on the middle At last, we restore the last t rows by reversing the copy process.
In the following lemma, we use O(log n) parallel row-elimination matrices to construct any given log 2 n columns, which corresponds to the case s = 1.
, it outputs parallel row-elimination matrices R 1 , . . . , R d where d = O(log n) such that
Proof. Rows of Y can be seen as a set of Boolean vectors of length log 2 n. In the algorithm, we first synthesize an additive base for these vectors; then add them together to obtain Y. The main process is depicted in Figure 3 .
The process (step 1-3) to construct Y (We only draw the first log 2 n columns).
Step 1 (Construct P i 's): There are √ n vectors in F 1 2 log n 2
. With arbitrary ordering, we write
log n 2 be the matrix where P[ℓ, * ] = y ℓ . In the following, we will construct several P in parallel, as shown in Figure 3 .
Specifically, by Lemma 5, we use O(log n) parallel row-elimination matrices and ancillary rows from log 2 n + √ n log n + 1 -th row to log 2 n + 3 2 n + j+1 2 √ n log n -th row, to construct P in the intersection of log 2 n + n + j √ n + 1 -th row to log 2 n + n + (j + 1) √ n -th and j 2 log n + 1 -th column to j+1 2 log n -th column for 0 ≤ j ≤ 2 log n − 1. For simplicity, we write them as P 1 , P 2 , . . . , P 2 log n . Notice that P i can be constructed simultaneously, thus the total cost is still O(log n) parallel row-elimination matrices.
Step 2 (Copy rows in P i 's):
log n 2 for all k.
Suppose y ℓ appears s k,ℓ times in Y k , i.e.,
Similar to Lemma 5, if we make s k,ℓ copies of y ℓ from columnk−1 2 log n + 1 to columnk 2 log n , then we can construct Y in O(log n) parallel row-elimination matrices. However, this fails as it needs 2n log n ancillary rows. We overcome the problem by slightly modifying the method.
We make
2 log n copies of y ℓ in columnk−1 2 log n + 1 to columnk 2 log n parallel for all k, ℓ by O(log n) parallel row-elimination matrices (The rows in the original P i is regarded as the first copy of corresponding y ℓ ). Thus the ancillary rows being used during this step are at most 2n log n 2 log n = n. Since for all k, ℓ, we have s k,ℓ ≤ n; then Step 2 can be achieved by O(log n) parallel row-elimination matrices.
Step 3 (Construct Y): For i ∈ [n], define u i as the log 2 n + i -th row. For j ∈ [2n], define w j as the log 2 n + n + j -th row, which corresponds to a copy of certain y ℓ . Then we add proper w j to u i to form Y. We say w j is used for t times if w j is supposed to be added to t different u i 's. By Step 2, we ensure that each w j will be used for at most 2 log n times. Besides, notice that u i is supposed to be the sum of 2 log n corresponding w j s. In other words, if we draw a bipartite graph G = (V 1 , V 2 , E), where
, E = {(u i , w j )|w j will be added to u i }, then all vertices in G have degree at most 2 log n. Thus it can be factorized into 2 log n matchings in linear time (hiding polylogarithmic terms) [19, 20] . Hence Y can be constructed by these copies with O(log n) parallel row-elimination matrices.
The steps above is summarized in Figure 3 .
Step 4 (Restore): In this step, we erase the copies of y ℓ 's and P i 's by the inverse of Step 2 and Step 1. The cost is the same as before.
To sum up, the overall time complexity is O √ n
+ O(n)
Step 2
Step 4 , and the total number of parallel row-elimination matrices is O (log n)
+ O (log n)
Step 4
.
Corollary 3.
There is an O(sn)-time algorithm such that given
Proof. Different from Lemma 6, we have more ancillary rows, namely 3sn. Thus we can parallelize computation more efficiently. In general, we execute Lemma 6 in parallel with 3sn ancillary rows. The whole process is depicted in Figure 4 .
. By Lemma 6, we construct all Y i simultaneously by O(log n) parallel row-elimination matrices with another 2sn ancillary rows. Specifically, view the upper left I s log 2 n as s blocks of I log 2 n . We use the i-th I log 2 n and 3n rows to construct Y i , n rows of which is for writing down Y i , i.e., we put Y i in the ((i − 1)n + 1)-th to in-th rows as Figure 4 shows. We denote the whole process in this step as R a . Figure 4 : The process to construct Y (We omit 2sn ancillary rows and corresponding columns in the construction of Y k , k ∈ [s] for simplicity).
Secondly, we add Y 1 , . . . , Y s from the corresponding rows to the first n ancillary rows to construct Y. This can easily be done with O(log s) parallel row-elimination matrices, since we can add s rows to a given row with O(log s) parallel steps, and restore all but the target rows by corresponding inverse operations.
Finally, we erase Y 1 , . . . , Y s between the (n + 1)-th and sn-th row by R −1 a .
In the following, we give the proof of Lemma 4. For the time complexity of the whole process, it is easy to check it is O(sn × n t ) = O(n 2 ).
Proof of Lemma 4. Let
5 Lower bound and hardness result
Lower bound
In this section, we give the lower bound to approximately parallelize CNOT circuit. This shows our result is still asymptotically tight when we allow arbitrary two-qubit gates to approximate the given CNOT circuit.
Theorem 8 (Theorem 2 Restated). For 1−o(1) fraction of n-qubit CNOT circuit, any ǫ-approximate n-qubit m-ancilla quantum circuit has depth Ω max log n,
, where ǫ < √ 2 2 is a constant.
Proof of the first lower bound Ω(log n). Consider the following n-qubit CNOT circuit C:
Suppose C ′ is an ǫ-approximate n-qubit m-ancilla quantum circuit whose depth is o(log n). This means the n-th qubit in the output of C ′ is only influenced by o(n) inputs. Assume w.l.o.g |x 1 is not one of them. By assumption of ǫ-approximation,
⊗m and (C |1 |x 2 · · · x n ) |0 ⊗m will be near orthogonal. By finer analysis, we have
For other CNOT circuits, this argument fails only when every output is affected by o(n) inputs. It's easy to see the number of those circuits is upper bounded by 
while the number of different CNOT circuits is
Thus the fraction is upper bounded by
Proof of the second lower bound Ω n 2 (n+m) log(n+m) . Set
with foresight. Now we discretize all two-qubit gates into finitely many possibilities. For any two-qubit gate t, it can be uniquely described by T t ∈ C 4×4 and expanded as a unitary matrix U t ∈ C 2 n+m ×2 n+m . We define the δ-discretization of U t as U δ t , where
≤ 2δ. Moreover, although there are infinitely many two-qubit gates, there are only at most 2 δ 32 different kinds of δ-discretization result of a two-qubit gates.
For any n-qubit m-ancilla quantum circuit of depth d, there are s < (n + m)d two-qubit gates. Also, the circuit can be viewed as a linear transform which sequentially multiplies U g 1 , . . . , U gs to the input vector. Let U g = U gs · · · U g 1 and U δ g = U δ gs · · · U δ g 1 . Thus for any input vector v ∈ C 2 n+m , v 2 = 1, we have
Assume two quantum circuits have gates {h i } i∈ [s] , {t i } i∈ [s] respectively, and U
. Write C h and C t as the circuits corresponding to U h and U t . Suppose C h and C t are ǫ-approximate to different n-qubit CNOT circuits C 1 and C 2 . Then there exists x ∈ {0, 1} n that two circuits act differently on |x . That is, .
Plugging in the parameters, the fraction is at most
= o(1).
Hardness result
Now we prove the hardness of optimizing CNOT circuit in different scenarios from the problem: rBounded Set Cover. Since all N P-hardness in this section comes from polynomial-time many-to-one reduction, we omit it in the following for simplicity.
Problem (r-Bounded Set Cover (rBSC)). For any integer r, the problem is defined as follows:
• Input: Positive integers r, p, q and
• Output: The minimum integer k ≥ 1 such that there exists V ⊆ [q], |V | = k and i∈V W i =
[p].
For this problem, an inapproximability result is known.
Theorem 9 ( [36, 37] ). It is N P-hard to approximate the solution of the r-Bounded Set Cover problem within a factor of ln r − O(ln ln r).
Although there is an equivalence between CNOT circuit and GL(n, 2), we introduce some notations to directly represent CNOT circuit and address the topological constraints.
Let G n = {(i, j) ∈ [n] × [n]|i = j} be the set of all CNOT gates over n qubits, where (i, j) ∈ G n means the i-th qubit controls the j-th. For any S ⊆ G n , let L (S) be the set of all possible CNOT layers under S, where T ∈ L (S) is a subset of S and any distinct (i, j),
For any sequence g 1 , . . . , g s ∈ G n , C(g 1 , . . . , g s ) is the CNOT circuit wiring g i as the i-th gate. Similarly, for ℓ 1 , . . . , ℓ d ∈ L (G n ), C(ℓ 1 , . . . , ℓ d ) is the CNOT circuit taking ℓ i as the i-th layer.
Problem (Global Constrained Minimization (GCM)). The problem is defined for depth and size respectively as follows:
• Input: Positive integers n, m, s, S ⊆ G n+m and g 1 , . . . , g s ∈ S such that C(g 1 , . . . , g s ) is an n-qubit m-ancilla CNOT circuit.
• Output (size version): The minimum integer u ≥ 0 such that there exist h 1 , . . . , h u ∈ S and C(g 1 , . . . , g s ) ∼ = C(h 1 , . . . , h u ).
• Output (depth version): The minimum integer v ≥ 0 such that there exist
Theorem 10. For GCM, it is N P-hard to approximate within any constant factor the solution of both depth and size version.
Proof. Given input r, p, q, W 1 , . . . , W q of rBSC, construct the input for GCM as follows:
• Set n = p + c, m = q, s = (c + 2)p, where c is a positive integer to be determined later.
•
• For i ∈ [p], let t = (c + 2)(i − 1) and choose arbitrary j ∈ [q] such that i ∈ W j . Then set
If we denote the input as |x 1 · · · x p |y 1 · · · y c |0 ⊗m and define s x = x 1 ⊕ · · · ⊕ x p , the output will be
Then the theorem follows naturally from the following claim as r is an arbitrary constant and we can set c = 1 for the depth version, while c = p for the size version. The proof of Claim 2 is deferred to Appendix A.
The N P-hardness of GCM relies heavily on the topological constraints. Although this can be justified that current real-life implementation of quantum gates indeed has such constraints [27, 28] , to connect with the CNOT circuits we consider in previous sections and in the long run of quantum device, we introduce a "local" version of CNOT-circuit minimization, and prove the hardness result in this scenario.
Problem (Local Size Minimization (LSM)). The problem is defined as follows:
• Input: Positive integers n, m, s, L, R with 1 ≤ L ≤ R ≤ s and g 1 , . . . , g s ∈ G n+m such that C(g 1 , . . . , g s ) is an n-qubit m-ancilla CNOT circuit.
• Output: The minimum integer w ≥ 0 such that there exist h 1 , . . . , h w ∈ G n+m and
. . , g s ).
Theorem 11. For LSM, it is N P-hard to approximate the solution within any constant factor.
Proof. Given input r, p, q, W 1 , . . . , W q of rBSC, construct the input for LSM as follows:
• Set n = p + 1, m = b and L = c + 1, R = c + p, s = 2c + p.
• Denote the m ancillae as
• The first c gates: For i ∈ [q], U ⊆ W i and j ∈ U , construct gate (j, n + ind(i, U )).
• The middle p gates: For i ∈ [p], construct gate (i, n).
• The last c gates:
If we denote the input as |x 1 · · · x p |y |0 ⊗m , the output will be
The theorem follows immediately from the following claim, the proof of which is in Appendix B.
Claim 3. k = w.
Conclusion and Open Questions
In this paper, we design efficient algorithms for parallelizing CNOT circuits with or without ancillae. We further present low bounds showing that the space-depth trade-off of our constructions is asymptotically tight. To be more relevant to real-world usage, we also give a simpler algorithm (see Proposition 1 in Appendix D) for parallelizing any n-qubit CNOT circuit to depth ≤ 4n without ancillae, which -though not aysmptotically optimal -could be useful for small-size circuits without worrying about the large overhead hidden in the big-O notation. In addition, we present strong evidences that CNOT-circuit optimization, both for size and depth, could be intractable: (1) we prove that constructing equivalent CNOT-circuit with minimum depth (or size) under specified topologically-constrained qubit structure is N P-hard to approximate within any constant factor.
(2) we show that the local size-minimization of CNOT circuits resists any efficient constant-ratio approximation as well. While our algorithms are asymptotically optimal for parallel CNOT-circuit synthesis, many fundamental questions pertinent to theoretical/practical quantum logical synthesis remain open, starting with the following basic decision problems (with parameters d and s):
Problem. Decide whether a given CNOT circuit can be
• parallelized to depth d, or
• reduced to s size.
Are they N P-complete or is there any non-trivial algorithm that produces a CNOT circuits with optimal depth (or size), given its matrix representation?
Meanwhile, we are also interested in the following:
• Given specific structures of CNOT circuits, such as CNOT trees in Section 3, can it be parallelized to O(log n) depth? More specifically, can we characterize the class of all O(log n)-depth CNOT circuits?
• When there are topological constraints among addressable qubits, is there any efficient algorithm to parallelize CNOT circuits? Towards this, a recent work [13] offers an algorithm to reduce its size to O(n 2 ) under constraints, but how to decrease its depth is still unknown.
• Can our algorithms be realized as true parallel algorithms without the time-consuming preprocessing phase?
A Proof of Claim 2
The v ≤ kc + 2r, u ≤ kc + 2p Part:
and define
Then we explicitly construct a circuit C as follows:
, the (r + k(t − 1) + j)-th layer is simply one gate (j + n, p + t).
• For d ∈ [r], the (kc + r + d)-th layer consists of gates (t i d , i + n) if d ≤ |T i |. Thus C has kc + 2r (possibly empty) layers and kc + 2p gates; and C is essentially the same circuit as the input one, without violating the constraints.
The v ≥ k Part: Assume C is shallowest desired circuit, let V = {t ∈ [q]|(t + n, p + 1) ∈ C}.
Since these gates has the same target, we have v ≥ |V |. On the other hand, x i can be added to y 1 only if i ∈ j∈V W j . Thus |V | ≥ k and v ≥ k as desired.
The u ≥ kc + 2p Part: Assume C is smallest desired circuit, then for any j ∈ [c] let V j = {t ∈ [q]|(t + n, p + j) ∈ C}. Similar analysis shows |V j | ≥ k. On the other hand, for any i ∈ [p], x i must be added below in order to appear in y 1 , . . . , y c ; and then it must be added below again to restore the ancilla. Thus u ≥ c j=1 |V j | + 2p ≥ kc + 2p as desired.
B Proof of Claim 3
After the first c gates, the value of the n qubits remains the same; and y i,U = t∈U x t for i ∈ [q], U ⊆ W i . Then during the middle p gates, the first n−1 qubits and the m ancillae are unchanged; and the n-th qubit becomes x 1 ⊕ · · · ⊕ x n . The last c gates are only meant to restore ancillae.
Since CNOT gates are reversible, h 1 , . . . , h w are valid is equivalent to C(g 1 , . . . , g R ) ∼ = C(g 1 , . . . , g L−1 , h 1 , . . . , h w ).
The w ≤ k Part: Assume w.l.o.g Then we explicitly construct h i = (n + ind(i, T i ), n) for i ∈ [k]. It can be verified these k gates serve the same purpose of g L , . . . , g R .
The w ≥ k Part: We abstract C(h 1 , . . . , h w ) as a matrix M in GL(n+m, 2); and assume the value of the n-th qubit after h w is the summation of the value of the i-th before h 1 for i ∈ I ⊆ [n + m], i.e., I is the set of non-zero entries of M[n, * ]. Since x n only appears in the n-th qubit before h 1 , n must be in I. Meanwhile, any row in M has at most w + 1 non-zero entries, thus |I| ≤ w + 1. Let E = I\ {n} = {e 1 , . . . , e w } and for any i ∈ [w],
• if e i ∈ [p], choose arbitrary j ∈ [q] such that e i ∈ W j , then set v i = j;
• if e i = n + ind(j, U ) for some j ∈ [q], U ⊆ W j , set v i = j.
, as the value of the n-th qubit accumulates x 1 , . . . , x n after h w . Hence w ≥ k as desired.
C Topological Structures in CNOT-Circuit Synthesis
We first cast a specific CNOT circuit -prefix-⊕ summation circuit -which can be parallelized to O(log n) depth, to serve for the proof of Theorem 6 as shown in Claim 4. Proof. We will give (2 log n − 1) parallel row-elimination matrices for the CNOT circuit above when n is a perfect power of 2; and the result can be generalized to any integer n.
• In layer j ∈ [log n], perform R 2 j k + 2 j−1 , 2 j k + 2 j for 0 ≤ k < n/2 j .
• In layer (2 log n − j) with j ∈ [log n − 1], perform R 2 j k, 2 j k + 2 j−1 for 0 < k < n/2 j .
In the following, we give the proof of Theorem 6.
Proof. (Proof of Theorem 6)
We use the following classic result of Miller and Reif -known as parallel tree contraction -in parallel algorithm design [33, 34, 35] . Miller and Reif introduced two parallel abstract operations on trees:
• Rake -removing all leaves from the tree.
• Compress -replacing every maximal tree path by a tree path with one-half the length.
They proved that O(log n) rounds of Rake and Compress reduce any rooted tree to its root. By [38] , there exists a classical O(log n)-time parallel algorithm to produce all of prefixes with n/2 processors. This can be easily generalized to an O(log n)-depth CNOT circuit in which the outputs are exactly the prefixes of the inputs when we modify this algorithm slightly, as shown in Claim 4. We first cast Claim 4 in the language of CNOT trees, where the CNOT operator computes the prefix-⊕ of n-bits, i.e., . . .
Note that the prefix-⊕ can be expressed as a proper binary path-tree of depth n − 1. Claim 4 in fact shows that this special CNOT tree can be transformed into an equivalent CNOT circuit with O(log n) depth. Using this lemma, we can use the following modification of parallel tree contraction to build low-depth CNOT circuit for any CNOT trees.
We use two operations:
