We focus on optimization of the depth/size of CNOT circuits under topological connectivity constraints. We prove that any n-qubit CNOT circuit can be paralleled to O(n) depth with n 2 ancillas for 2-dimensional grid structure. For the high dimensional grid topological structure in which every quibit connects to 2 log n other qubits, we achieves the asymptotically optimal depth O(log n) with only n 2 ancillas. We also consider the synthesis without ancillas. We propose an algorithm uses at most 2n 2 CNOT gates for arbitrary connected graph, considerably better than previous works. Experiments also confirmed the performance of our algorithm. We also designed an algorithm for dense graph, which is asymptotically optimal for regular graph. All these results can be applied to stabilizer circuits.
Introduction -Quantum circuit synthesis is a process to construct a quantum circuit that implements a desired unitary operator and optimizes its depth and size in terms of a given gate set, which is an important task in the field of quantum computation and quantum information [19, 24, 26] . CNOT circuits are indispensable for quantum circuit synthesis to construct general circuits [1, 5, 22, 26] . Due to Ref. [4, 7, 25] , CNOT plus some single qubit gates are universal for quantum computing. Specifically for stabilizer circuits, Aaronson et al. [1] proved that any stabilizer circuit has a canonical form H-C-P-C-P-C-H-P-C-P-C, where H and P are one layer of Hadamard gates and Phase gates respectively, and C is a block of CNOT circuits. Optimizing CNOT circuits is an important topic, there are many researchers aiming at parallelizing the depth of CNOT circuits with no constraints on control and target qubits [8, 10, 13, 16, 17] , i.e., where CNOT gate can be operated on any pair of qubits. For instance, Moore and Nilsson [17] proposed an algorithm to parallel any CNOT circuit to O(log n) depth with O(n 2 ) ancillas, in which the depth matches the lower bound Ω(log n). Recently, Jiang et al. [13] presented an algorithm to optimize the depth to min{log n, n 2 (n+m) log(n+m) } for CNOT circuits with n input quibits and m ancillas qubits, and also proved their depth is asymptotically optimal for any number of ancillas.
However, current quantum circuits are limited to the connection constraints of qubits [3, 11, 27] and environ-mental noise [6] . The topological structure of superconducting processors arrange their qubits with only neighbouring qubits interactions. The connection constraints of a quantum processor can be represented as a topological graph. A vertex of the topological graph represents a qubit, and we can only perform a 2-qubits gate between two qubits when the corresponding vertices of these two qubits are connected.
In this manuscript, we mainly consider how to optimize the depth and size of CNOT circuits under topological structure constraints, with and without ancillas.
Firstly, we aim to optimize the depth of CNOT circuits in topological structure similar to near-term device [6, 11] with some ancillas. We present a construction of d-dimensional grid in which every quibit connects to at most 2d other qubits. In d-dimensional grid with each dimension size less than 2n 2/d , our construction can parallel any n-qubit CNOT circuit into O dn 2/d depth with at most 2n 2 ancillas. Specifically, if d = Ω(log n), the depth can be paralleled to O(log n) with n 2 ancillas, which implies we can use an O(log n)-regular graph instead of a complete graph, in which every pair of qubits are connected, to match the lower bound Ω(log n). And the size of our reduced circuit construction is O(n 2 ) for any CNOT circuit.
Next, we aim to optimize the size of CNOT circuits under any topological constraints without ancillas. There are two related works [14, 18] for optimizing the size of CNOT circuits under topological structure without an-cillas. Kissinger et al. [14] proposed an algorithm which gives a 2n 2 -size equivalent CNOT circuit for any CNOT circuit on the topological graph which has a Hamiltonian path. Their main idea is to eliminate each column along the Hamiltonian path. For the topological graph which has no Hamiltonian path, their generalized algorithm is not efficient, which may have O(n 3 ) size in the worst case. Nash et al. [18] proposed a similar algorithm simultaneously which gives a 4n 2 -size equivalent CNOT circuit for any CNOT circuit on any connected graph.
We put forward an algorithm, which achieves 2n 2 size in the worst case, regardless of topological structure. Our algorithm saves 41.4% and 9.4 % CNOT gates compared to the algorithms proposed by Kissinger et al. [14] and Nash et al. [18] on average, respectively from our numerical results. Furthermore, our algorithm save more CNOT gates in 82.3% test random circuit, compared to other algorithms. For the d-dimensional grid, our algorithm and these two algorithms all give O(dn 1+1/d ) paralleled depth.
Our third algorithm, which can serve as a perspective recommendations of qubit-connecting layout for topological superconducting processors, achieving an O n 2 log r worst-case bound for r-regular graph, furthermore, we prove this bound is tight, i.e., the lower bound of size for r-regular graph is also Ω n 2 log r . Optimize the depth of CNOT circuits on d dimensional grid graph -Due to decoherence, quantum computing task must be finished in a short time, which means it is essential to parallel the quantum circuit as shallow as possible, to ensure that the circuit can be performed in near-term quantum device. Due to technology restriction, one qubit cannot connect to too many other quibits in topological structure. We propose a CNOT circuit construction on d-dimensional grid topological structure, in which every qubit connects to at most 2d other qubits, and the parallel depth is asymptotically optimal to O(log n) when d = Ω(log n).
By the theorem in Ref. [17, 22] , any n-qubit CNOT circuit can be represented as a reversible binary matrix M ∈ GL(n, 2). Meanwhile, a CNOT gate with control qubit i and target qubit j is equivalent to a Gaussian row elimination step which adds row i to row j in matrix M by Patel et al. [22] . Thus optimizing the size of a CNOT circuit is equivalent to optimizing the number of row add operations to transfer a reversible matrix into identity, and optimizing the depth of a CNOT circuit is equivalent to optimizing the number of parallel rows add operations. The sequence of row add operations corresponds to the inverse matrix, so we can get the reduced circuit from the add operations sequence.
We say a circuit C n,m with n-qubit inputs and mqubit ancillas on topological graph G(V, E) implements a unitary operator U , if for any input |x with ancillas |0 ⊗m , the results of C n,m for output qubits and acnillas are U |x and |0 ⊗m . We firstly propose the algorithm which can optimize any n-qubit CNOT circuits to O(n) depth with n 2 ancillas on 2 dimensional grid graph, and then generalize the algorithm to any d dimensional grid for d ≥ 3.
Lemma 1. Any n-qubit CNOT circuit on n × n grid graph can be paralleled to O(n) depth, with at most n(n− 1) ancillas.
Proof. Consider an n × n grid, every vertex has a coordinate (i, j) that 1 ≤ i, j ≤ n. We lay out the input qubits |x = |x 1 , · · · , x n be the vertices in the first column that |x i at (i, 1), and other vertices for ancilla qubits |0 , as depicted in Fig. 1 (a) . We denote the CNOT gate with control qubit at (i 1 , j 1 ) and target qubit at (i 2 , j 2 ) as CNOT (i1,j1) (i2,j2) . For the invertible matrix M ∈ GL(n, 2), suppose y = Mx, then y i = j∈[n] M ij x j . The following algorithm can implement the CNOT circuit of M with n(n − 1) ancillas |0 ⊗n(n−1) .
(1) Copy n copies of x i for each i ∈ [n] in row i of grid in parallel by performing CNOT 3) , · · · sequentially for each row, as in Fig. 1 (2) Construct y i with qubits in column i for i ∈ [n] in parallel: perform (a) CNOT (i,j) (i,j+1) for j that M i,j = 0, range from n to 1; (b) CNOT (i,1)
sequentially of column i to give y i ; (c) Inverse process of (b) and (a) except CNOT (i,n) (i,n+1) to restore qubits between (i, 1) and (i, n), as in Fig. 1 (c).
(3) Restore the rest ancillas to 0 except the last row conserving y by performing the inverse process of step (1), as in Fig. 1 (d) .
(4) Transform x into |0 ⊗n by performing M −1 provided y as input, x and other qubits as ancillas, as in Fig. 1 (e), and then use swap operations to let the outputs y be the correct position, as in Fig. 1 (f).
Step (4) is correct since (Mx, x, 0)
Step (1) (2) (3) (4) , the algorithm can parallel any CNOT circuit to O(n) depth and O(n 2 ) size with n(n−1) ancillas, and a careful analysis will give a paralleled depth 13n in the worst case.
Lemma 1 can be generalized to d dimensional grid for any constant d, which is stated as the following theorem. Theorem 1. Any n-qubit CNOT circuit can be paralleled to O dn 2/d depth, with at most 2n 2 ancillas, on d dimensional grid, in which the size of the first d − 1 dimension are all n 2/d , and the last dimension is 2n 2/d . Specifically, when d = log n, the depth can be paralleled to O(log n) with n 2 ancillas.
More precisely, Theorem 1 can be obtained by firstly copying each x i for i ∈ [n] into a d 2 -dimensional grid, and then construct y i by the copies of x similar to Lemma 1. Specifically, when d is even, the ancillas can be reduced to n 2 , as depicted in Lemma 1. We postpone the proof of Theorem 1 into Appendix A.
Furthermore, when we eliminate a block of columns rather than just a column each step, we can save a log n factor of ancillas compared to Theorem 1. Some related work can be seen in Ref. [13, 22] . To achieve O n 2 log n ancillas, we need to firstly enumerate all of the configurations for each block, and then copies the required configurations to the proper column. Finally, we just need to add the copies of all blocks to give output y and restore all of ancillas. The details of the results and proof are in Appendix B.
Any n-qubit CNOT circuits can be optimized to O(n 2 ) size on the specific d-dimensional grid by Theorem 1. In the following, we give the optimized size for non-ancillas case on any connected graph.
Algorithm without ancilla -The quibit resource is usually rare so we may not have ancilla qubits. We study the optimization for circuits without ancilla. The topological structure of superconducting processors arrange their qubits with only neighbouring-qubits interactions, this means two-qubit gates can only perform in two adjacent qubits in corresponding topological graph. Therefore, for the elimination process on topological graph, we can not eliminate a row with another if their corresponding vertex are not adjacent. Thus in this case we need to find a path between this two vertex and eliminate across the path. And to improve the efficiency of elimination, we eliminate multi-rows together via the minimum Steiner tree of the vertices of these rows, see also [14, 18] .
Given the topological constraint graph G and the matrix M ∈ GL(n, 2) corresponding to a CNOT circuit. The techniques in Ref. [14, 18] are both firstly eliminate a given inverse matrix into a upper triangle matrix, and then eliminate the triangle matrix into identity. Differently, we propose an algorithm which eliminates the ith row and i-th column simultaneously for vertex i ∈ V which is not a cut vertex, then delete i from G, and repeat the process for the rest reversible matrix with gradually decreasing scale. Review that a cut vertex is any vertex whose removal will make the connected graph disconnected. Let e 1 := (1, 0, · · · , 0) T , e 2 := (0, 1, 0, · · · , 0) T , · · · , e n := (0, · · · , 0, 1) T . Steps (1-3) aim to make the i-th column become e i and Setps (4-6) aim to make the i-th row become e T i . All arithmetic operation are over binary field F 2 . The algorithm works as follows :
terminal set S for nodes of the rows with entry in column i equals to 1, additionally including i. Find a minimum Steiner tree T for set S ⊆ V in graph G.
(2) To make all entries associated with T in column i become 1, postorder traverse the Steiner tree T starting from node i, when reaching j with parent k, add row j to row k if M ji = 1 and M ki = 0.
(3) To make all entries of rows associated with T except row i in column i become 0, postorder traverse the Steiner tree T starting from i, add every rows to its children when reached.
(4) Create terminal set S for nodes of the rows whose summation equals to row i except in column i, additionally including i. Find a minimum Steiner tree T for set S ⊆ V in graph G.
(5) To make the summation of rows for nodes in T become e i , preorder traverse the Steiner tree T starting from i, when reaching j / ∈ S , add the j-th row to its parent.
(6) To make the i-th row of M become the summation of rows for nodes in T , postorder traverse the Steiner tree T starting from i, add every row to its parent when reached.
(7) Delete i from graph G and repeat steps (1-7) until the matrix is identity.
For any connected graph, there always exists vertex which is not a cut vertex and graph keeps connected after that vertex deleted. Therefore, this algorithm can be applied to any connected graph. When we operate CNOT gates in Step (2-3,5-6), the number of CNOT gates is less than the number of remaining nodes, so the total size is at most 4(n − 1) + 4(n − 2) + · · · + 4 × 1 ≤ 2n 2 . Meanwhile, the depth of this algorithm depends on the diameter D and maximal degree ∆ of the topological graph G. When we operate CNOT gates on the Steiner tree in Step (2-3,5-6), CNOT gates on different layer of the tree can be parallelized, so the depth of circuit we need for each step is at most D∆. Thus the total depth of elimination process is at most min(2n 2 , 4nD∆).To sum up, we have the following theorem.
Theorem 2. Given connected topological graph G(V, E) and an n-qubit CNOT circuit, there is a polynomial time algorithm to construct an equivalent CNOT circuit with at most 2n 2 gates and depth at most min(2n 2 , 4nD∆), where D and ∆ are respectively the diameter and degree of graph G.
Our algorithm has an obvious advantage compared to the existing algorithms [14, 18] in theoretical aspects. We next show the algorithm is experimentally advantageous as well.
Numerical results -We mainly compare the optimized CNOT circuit size between our algorithm and algorithms proposed by Kissinger et al. [14] and Nash [18] on several NISQ devices. We generate a set of circuits by operating CNOT gates between some pairs of connected quibits uniformly randomly, and optimize these circuits on different topological graph structure using different algorithms.
To show the better performance of our algorithm on increasing number of qubits in CNOT circuit, we classically simulate the above three algorithms on IBM Q20-Tokyo device and a claw graph of Figure 2 , then compare the average optimized size of 200 random input CNOT circuits as depicted in Figure 3 . Our algorithm is superior to the algorithm of Nash et al. obviously for all of physical devices, since their algorithm has a larger factor. Even though Kissinger's algorithm seems close to our algorithm, our algorithm perform better in 82.3% test circuit. Figure 3 (b) gives comparison of our algorithm and two existing algorithms on graph of Figure 2 with n range from 5 to 22, which indicate that the optimized size of algorithm proposed by Kissinger et al. [14] are gradually getting worse compared to our algorithm on the non-Hamiltonian topological graph.
Further, we compare the above three algorithms on more other topological structures to show our generally better performance, say IBMQx4, IBMQx5, IBMQ20-Tokyo, 4 × 5 grid and 2 × 12 grid (which is implemented by Ye et al. [27] ). Similarly, we also run the classical sim- Table 1 of Appendix D. The results is coincident with the left side of Figure 3 , which implies our algorithm may be more suitable for large and complex structures in future.
We also compare the performance of graphs with different sparseness. We randomly generate lots of connected graph with different number of edges, and 200 random reversible matrix in GL(n, 2). Then output the average optimized size of the 200 random circuits on different graphs, as depicted in Figure 5 . In Figure 5 (a) , the graph is uniformly randomly generated with 20 vertices. In Figure 5 (b), we uniformly randomly select an edge in 4 × 5 grid. The numerical results indicate that the number of optimized CNOTs is proportional to the sparseness of the graph.
Algorithm suitable for near-future device -As the innovation of quantum technologies such as superconducting qubits [5, 6, 11, 27 ] and quantum dots [12, 15] , the processors which allow denser quibits connectivity will arise in the near-future. We propose an algorithm aiming at optimizing the size of the denser topological superconducting processors.
Our algorithm is the generalization of Patel's algorithm [22] , which optimizes the CNOT size for the complete graph. The most significant difference between this algorithm and techniques in [14, 18] is that we eliminate several columns simultaneously instead of a single column each time in [14, 18] .
For a graph G(V, E) with n vertices, without loss of generality, we assume the degree of vertices are denoted as an equivalent O n 2 log(n/k) size CNOT circuit for any nqubit CNOT circuit on topological graph G, and there needs at least Ω n 2 log dn size of CNOT gates for some invertible matrix.
Theorem 3 implies that for a r-regular graph G(V, E), any CNOT circuit can be optimized to O(n 2 / log r) size CNOT circuit on G, which also matches the lower bound.
Here we give the main steps for how to optimize the size of CNOT circuits to O( n 2 log n/k ), and leave the whole proof in Appendix C.
As observed by Moore et al. [17] , any CNOT circuit can be represented as a matrix M ∈ GL(n, 2), and the synthesis of CNOT circuit is equivalent to transform M to I by performing Gaussian row eliminations [22] .
The following algorithm gives O( n 2 log n/k ) row operators to transform M to I.
. Let I n = I n,1 · · · I n, 2n log n/k . Then transform M j to I n,j for j ∈ {1, · · · , 2n log n/k }.
Step 1-3) states how to transform M 1 to I 1 .
(1) Eliminate the first log n/k 2 rows to I log n/k 2 .
(2) Find the row whose corresponding vertex has the maximum degree in the rest rows, denoted as row l. Traverse all of binary vectors in F n 2 for l in order of Gray code, by adding one of rows of I log n/k 2 to the row l.
(3) For each Gray code of row l, use row l to eliminate all of the rows which have the same value with row l, each time eliminate k rows simultaneously if the number of the rest same rows are larger than k, otherwise eliminate all of the rest rows simultaneously.
The process of transforming M j to I j for j > 1 are almost the same with the process of transforming M 1 to I 1 , except in step (1), we need to eliminate the corresponding rows of j-th block to I log n/k 2 , the above steps are summarized in Figure 4 .
All of eliminations for the above algorithm, containing step (1)-(3), uses our first algorithm (algorithm for NISQ structure) as a sub-process to ensure the elimination process will not change other rows. The analysis of optimized size of this algorithm is in Appendix C.
It is widely known that the best lower bound of unrestricted CNOT circuit synthesis is Ω(n 2 / log n) size by [22] . This lower bound is obtained by counting the number of distinct CNOT circuit with some fixed CNOTs. For CNOT circuit synthesis restricted to NISQ structure, we can just obtain the same lower bound Ω(n 2 / log n) using the above naive counting method. Nevertheless, inspired by the much more detailed counting technique from Christofides [9] , we prove a tighter lower bound Ω(n 2 /logD), where D is the degree of the topological graph, for synthesis of topological CNOT circuit. The main idea of the proof is still counting the number of distinct CNOT circuits with certain CNOT gates, but using more detailed counting methods. A key observation is that if CNOT gates CNOT i,j doesn't conflict with CNOT k,l , i.e., they can be put into the same level and executed simultaneously. We delay the proof into Appendix C.
Summary and outlook -Optimization of size and depth of topological quantum circuit is one of the main challenges in near-term quantum computing [20, 21, 23] . We propose an algorithm for optimizing the depth of CNOT circuit on near-term achievable topological structured dimensional grid. Some physical devices with limited qubits on 2 dimensional grid have been implemented by some groups [11, 27] , and any CNOT circuit can be paralleled to O(n) depth with n 2 ancillas by our algorithm. Furthermore, when d = log n, we can parallel any CNOT circuit to O(log n) depth with n 2 ancillas, achieving the asymptotically optimal depth.
We also propose two algorithm to optimize the size of CNOT circuits on near-term physical device and nearfuture device without ancillas, the numerical results shows that our first algorithm can reduce the size of CNOT circuit significantly compared to pre-existed techniques [14, 18] . Our second algorithm, which can serve as a theoretical instruction for the design of denser physical device.
More generally we may consider the problem of optimizing the quantum circuits of CNOT plus single quibit gates. We can naively divide the circuits into some blocks of single quibit gates or CNOT gates, and optimize the CNOT blocks using our algorithm. However, how to optimize quantum circuits more efficiently is still a open problem worth concern.
[27] Yangsen Ye, A The paralleled depth for d dimensional grid.
In this section, we give the proof for O dn 2/d paralleled depth of CNOT circuit on d dimensional grid, as a generalization of 2 dimensional grid.
Proof of Theorem 1. Suppose we have a d dimensional grid, the size of each dimension is n 2/d , for simplification, we firstly suppose d is even, for odd d the analysis is similar.
Consider an n 2/d × · · · × n 2/d d-dimensional grid. We lay out each input qubit |x i be a vertex of the disjoint n 2/d × · · · × n 2/d d 2 -dimensional grid, and all of the other vertex be ancillas |0 .
(1) Copy n copies of x i for each i ∈ [n] to the vertices of d 2 -dimensional grid.
(2) Construct y j by combining all of essential copies of input x in d 2 different d 2 sub-dimensional grid, similar to the Algorithm of 2-dimensional grid.
(3) Restore the rest ancillas to |0 except y and input qubit x.
(4) Transform x into |0 ⊗n by performing M −1 provided y as input, x and other qubits as ancillas (similar to step (1-3)), and then swap y to the correct position.
Step(1-2) has depth O dn 2/d and size O(n 2 ), thus the total depth is O dn 2/d and size O(n 2 ), and the total ancillas is exactly n 2 . For odd d, we construct the copies of a block of n 1/d input x i , in a n 2/d × · · · × n 2/d -size d+1 2 -dimensional grid, more precisely, construct n copies of x i with n 2/d × · · · × n 2/d × n 1/d -size d+1 2 -dimensional grid, since d 2 is not a integer. For constructing y j , it needs (a) Combining all of essential copies of input x in d−1 2 different d+1 2 dimensional grid to the last d+1 2 dimensional gird.
(b) Construct y j by combining corresponding n 1/d different block of outputs of the last d+1 2 -dimensional grid. Since there only contain collisions of one dimension for combining process, we can avoid of collision by copying the collision part to an additional n 2/d × · · · × n 2/d × n 1/d -size d-dimensional grid.
The restore processes are the same as step (3-4) of even case. By step (b), there needs totally
ancillas.
B Optimize ancillas to O n 2 log n of Theorem 1.
In this section, we clarify how to optimize the depth to O dn 2/d with O n 2 log n ancillas on d dimensional grid for d ≥ 3. Here, we need d ≥ 3 since when d = 2, it seems not work for our algorithm. For simplification, we only gives the proof of 3 dimensional grid, with similar analysis one can obtain the results of d > 3 dimensional grid.
Proof. For the n 2/3 + a log n × n 2/3 + n 1/3+a × n 2/3 log n + n 1/3 gird, where a < 1 3 , n inputs |x lay out in the first a log n rows of a n 2/3 + a log n × 2n 1/3 grid, as the rightmost lattice of Figure 6 .
The construction of y j is depicted in Figure 7 and Figure 8 , which is similar to the construction in Lemma 1. Let set X j := {x ja log n+1 , · · · , x (j+1)a log n } for 0 ≤ j < n log n ,
 v s,j := ja log n<i<(j+1)a log n M s,i x i , then
Meanwhile, v s,j ∈ U j for any s ∈ [n]. Our algorithm is as follows. (1) Copy n a copies of x i for each i ∈ [n] in parallel.
(2) Generate n 2/3 copies of U j in parallel with n a copies of set X j for 0 ≤ j < n log n .
(3) For rn 2/3 < s ≤ (r + 1)n 2/3 , where 0 ≤ r < n 2/3 , generate v s,j with n 2/3 copies U j in parallel, and in the same time transfer the output v s,j to the leftmost nonzero qubits by SWAP operations (which can be implemented by three CNOTs).
(4) Add all of v s,j with the same j in different slice sequentially (from the first slice to the an 2/3 log n -th slice).
(5) For the last n 1/3 layers of slice, (1) copy v s,j to the j-th layer for 2 ≤ j ≤ n 1/3 , and (2) add all of v s,j for different s from right to left sequentially in the j-th layer, as in Figure 8 . The total ancillas is O n 2 log n , and the depth is O a log n + n a + n 2/3 + an 2/3 log n + n 2/3 , which equals to O n 2/3 for some small a.
C Proof of theorem 3
Let d G (S) be the the size of the minimum Steiner tree of S ⊆ V in G(V, E). d k (G) be the maximum d G (S) for all of S such that |S| = k.d s denote the average of the degree for all of vertex in S. v n 2/3 vn · · · · · · · · · · · · Figure 7 : The process for construction of each small block of the first n 2/3 log n layers in Figure 6 . This Lemma can be obtained by applying the technique in Theorem 2.4 of [2] . For the completeness, we give the proof the lemma as follows. The main idea of the proof is that if two vertex has distance 3, then they share no common neighbors.
Proof of Lemma 2. Let S := {u 1 , · · · , u k } ⊆ V and A be an empty set. Firstly put in u 1 to the set A, and then put in all of v i to A for which d(A, v i ) = 3. That is,
Let A be a set such that the element a j ∈ A is a vertex in set A and closest to u i in graph G. By the construction of A, we have d G ({a 1 , · · · , a k }) ≤ 3(|A| − 1)
Contrary with the assumption, thus |A| ≤ k. Therefore, we have
We cast the following lemma, to serve for Theorem 3. Proof. Our second algorithm can be served as the polynomial time algorithm to give an O n 2 log n/k size CNOT circuits for any CNOT circuit on topological graph G. Now we prove the optimized size for our algorithm is O n 2 log n/k in the worst case. Since d k (G) ≤ ck, thus each time we can eliminate any k rows with ≤ ck CNOTs by our first algorithm or techniques in [14, 18] , thus the size for step a-c) to transform M 1 to I n,1 is O k log 2 n k
Step a)
+ O(k n/k)
Step b)
+ O(n)
Step c = O(n) since k ≤ n. Thus the total size is 2n log n/k × O(n) = O n 2 log n/k .
Proof of Theorem 3. The upper bound of Theorem 3 holds by Lemma 2 and Lemma 3. Now we give the proof for the lower bound Ω n 2 log dn , where d n is the maximum degree.
We denote the elementary row operation that adds i-th row to j-th row by R(i, j) and we call the {i, j} its index set.
The main idea of the proof is still counting the number of distinct CNOT circuits with k CNOT gates, but using more detailed counting method. A key observation is that R(i, j) commutates with R(k, l) if {i, j} ∩ {k, l} = ∅. In other words, if CNOT gates CNOT i,j doesn't conflict with CNOT k,l , they can be put into the same level and executed simultaneously. Therefore, we can rearrange the order of CNOT gates using the commutativity so that the depth of the circuits is minimized, which we call the 'canonical' form of the CNOT circuit. For any CNOT circuits with k gates represented as a sequence of elementary row operations, R 1 , R 2 , . . . , R k , the specific transformation process is shown as follows.
1. Firstly, we partition the matrix sequence R 1 , R 2 , . . . , R k into several blocks G 1 , G 2 , . . . , G s . The index set of matrix in each block should be disjoint with each other. In other words, matrices in the same block are commutative.
2. From i = 2, for every matrix in block G i , we check whether its index set is disjoint with that of every matrix in previous block, and if so, we put this matrix into block G i−1 using the commutativity. And then we repeat the above process for this matrix until it's in the first block or there exist a matrix in previous block conflicts with its index set.
3. Let i = i + 1 and we execute step 2 repeatedly until i = s. At last, we have the modified partitioned blocks G 1 , G 2 , . . . , G s of the matrix sequence R 1 , R 2 , . . . , R k satisfying the following properties and we call it canonical form.
(a) The index set of matrix in each block is disjoint with each other;
(b) From the second block, for every matrix in the block, there exists at least one element of its index set belonging to the index set of some matrix in the previous block.
It is easily shown that every CNOT circuit has its canonical form and can be transformed into it within definite steps through above procedure. Thus, given the constrained graph of NISQ, G = (V, E), and the maximum degree ∆ := d n of G, we can prove the lower bound Ω(n 2 / log ∆) by counting the number of distinct CNOT circuits in canonical form. Table 1 : Performance of different algorithms running on different architecture.The fist column list the architecture we test. Because Kissinger's algorithm and our algorithm depend on the sequence of vertex, we choose some Hamiltonian path in 4*5q-grid and test the performance of each algorithm. The second column show the original number of random CNOTs gate (random means we test with a random legal 0/1 matrix). We test each algorithm in each graph and each size of original circuit, whose result listed in remaining columns.
