Optimization of CNOT circuits on topological superconducting processors by Wu, Bujiao et al.
Optimization of CNOT circuits on topological superconducting
processors
Bujiao Wu 1,2, Xiaoyu He1,2, Shuai Yang1,2, Lifu Shou1,2, Guojing Tian1,2, Jialin Zhang1,2, and
Xiaoming Sun1,2
1Institute of Computing Technology, CAS, Beijing, China
2University of Chinese Academy of Sciences, Beijing, China
sunxiaoming@ict.ac.cn
Abstract
We focus on optimization of the depth/size of CNOT circuits under topological connectivity constraints.
We prove that any n-qubit CNOT circuit can be paralleled to O(n) depth with n2 ancillas for 2-dimensional
grid structure. For the high dimensional grid topological structure in which every quibit connects to 2 logn
other qubits, we achieves the asymptotically optimal depth O(logn) with only n2 ancillas. We also consider
the synthesis without ancillas. We propose an algorithm uses at most 2n2 CNOT gates for arbitrary connected
graph, considerably better than previous works. Experiments also confirmed the performance of our algorithm.
We also designed an algorithm for dense graph, which is asymptotically optimal for regular graph. All these
results can be applied to stabilizer circuits.
Introduction — Quantum circuit synthesis is a pro-
cess to construct a quantum circuit that implements a
desired unitary operator and optimizes its depth and
size in terms of a given gate set, which is an impor-
tant task in the field of quantum computation and quan-
tum information [19, 24, 26]. CNOT circuits are indis-
pensable for quantum circuit synthesis to construct gen-
eral circuits [1, 5, 22, 26]. Due to Ref. [4, 7, 25], CNOT
plus some single qubit gates are universal for quantum
computing. Specifically for stabilizer circuits, Aaron-
son et al. [1] proved that any stabilizer circuit has a
canonical form H-C-P-C-P-C-H-P-C-P-C, where H and
P are one layer of Hadamard gates and Phase gates re-
spectively, and C is a block of CNOT circuits. Opti-
mizing CNOT circuits is an important topic, there are
many researchers aiming at parallelizing the depth of
CNOT circuits with no constraints on control and tar-
get qubits [8, 10, 13, 16, 17], i.e., where CNOT gate can
be operated on any pair of qubits. For instance, Moore
and Nilsson [17] proposed an algorithm to parallel any
CNOT circuit to O(log n) depth with O(n2) ancillas, in
which the depth matches the lower bound Ω(log n). Re-
cently, Jiang et al. [13] presented an algorithm to opti-
mize the depth to min{log n, n2(n+m) log(n+m)} for CNOT
circuits with n input quibits and m ancillas qubits, and
also proved their depth is asymptotically optimal for any
number of ancillas.
However, current quantum circuits are limited to the
connection constraints of qubits [3, 11, 27] and environ-
mental noise [6]. The topological structure of supercon-
ducting processors arrange their qubits with only neigh-
bouring qubits interactions. The connection constraints
of a quantum processor can be represented as a topologi-
cal graph. A vertex of the topological graph represents a
qubit, and we can only perform a 2-qubits gate between
two qubits when the corresponding vertices of these two
qubits are connected.
In this manuscript, we mainly consider how to opti-
mize the depth and size of CNOT circuits under topo-
logical structure constraints, with and without ancillas.
Firstly, we aim to optimize the depth of CNOT cir-
cuits in topological structure similar to near-term de-
vice [6, 11] with some ancillas. We present a construc-
tion of d-dimensional grid in which every quibit connects
to at most 2d other qubits. In d-dimensional grid with
each dimension size less than 2n2/d, our construction can
parallel any n-qubit CNOT circuit into O
(
dn2/d
)
depth
with at most 2n2 ancillas. Specifically, if d = Ω(log n),
the depth can be paralleled to O(log n) with n2 ancillas,
which implies we can use an O(log n)-regular graph in-
stead of a complete graph, in which every pair of qubits
are connected, to match the lower bound Ω(log n). And
the size of our reduced circuit construction is O(n2) for
any CNOT circuit.
Next, we aim to optimize the size of CNOT circuits
under any topological constraints without ancillas. There
are two related works [14, 18] for optimizing the size of
CNOT circuits under topological structure without an-
1
ar
X
iv
:1
91
0.
14
47
8v
1 
 [q
ua
nt-
ph
]  
31
 O
ct 
20
19
cillas. Kissinger et al. [14] proposed an algorithm which
gives a 2n2-size equivalent CNOT circuit for any CNOT
circuit on the topological graph which has a Hamilto-
nian path. Their main idea is to eliminate each column
along the Hamiltonian path. For the topological graph
which has no Hamiltonian path, their generalized algo-
rithm is not efficient, which may have O(n3) size in the
worst case. Nash et al. [18] proposed a similar algorithm
simultaneously which gives a 4n2-size equivalent CNOT
circuit for any CNOT circuit on any connected graph.
We put forward an algorithm, which achieves 2n2 size
in the worst case, regardless of topological structure. Our
algorithm saves 41.4% and 9.4 % CNOT gates compared
to the algorithms proposed by Kissinger et al. [14] and
Nash et al. [18] on average, respectively from our nu-
merical results. Furthermore, our algorithm save more
CNOT gates in 82.3% test random circuit, compared to
other algorithms. For the d-dimensional grid, our al-
gorithm and these two algorithms all give O(dn1+1/d)
paralleled depth.
Our third algorithm, which can serve as a perspective
recommendations of qubit-connecting layout for topolog-
ical superconducting processors, achieving an O
(
n2
log r
)
worst-case bound for r-regular graph, furthermore, we
prove this bound is tight, i.e., the lower bound of size for
r-regular graph is also Ω
(
n2
log r
)
.
Optimize the depth of CNOT circuits on d dimen-
sional grid graph — Due to decoherence, quantum com-
puting task must be finished in a short time, which means
it is essential to parallel the quantum circuit as shal-
low as possible, to ensure that the circuit can be per-
formed in near-term quantum device. Due to technology
restriction, one qubit cannot connect to too many other
quibits in topological structure. We propose a CNOT cir-
cuit construction on d-dimensional grid topological struc-
ture, in which every qubit connects to at most 2d other
qubits, and the parallel depth is asymptotically optimal
to O(log n) when d = Ω(log n).
By the theorem in Ref. [17, 22], any n-qubit CNOT
circuit can be represented as a reversible binary matrix
M ∈ GL(n, 2). Meanwhile, a CNOT gate with control
qubit i and target qubit j is equivalent to a Gaussian row
elimination step which adds row i to row j in matrix M
by Patel et al. [22]. Thus optimizing the size of a CNOT
circuit is equivalent to optimizing the number of row add
operations to transfer a reversible matrix into identity,
and optimizing the depth of a CNOT circuit is equivalent
to optimizing the number of parallel rows add operations.
The sequence of row add operations corresponds to the
inverse matrix, so we can get the reduced circuit from
the add operations sequence.
We say a circuit Cn,m with n-qubit inputs and m-
qubit ancillas on topological graph G(V,E) implements
𝑛
𝑥#𝑥$
𝑥%
⋮
0 0
0
⋮
⋯
0 ⋯ 𝑛
𝑥# 𝑥#𝑥$
𝑥%
𝑥#𝑥$
𝑥%
⋮ ⋮
(a) (b) 𝑥#𝑥$
𝑦# 𝑛
𝑥# 𝑥#𝑥$
𝑦%𝑦$
⋮ ⋮
(c)
𝑥"𝑥#
𝑛𝑦" 𝑦&𝑦#
⋮
0 0
0⋮
⋯ 0 = 𝑥"⊕ 𝑥"
𝑛𝑦" 𝑦&𝑦#
0 = 𝑥# ⊕ 𝑥#
0 = 𝑥&," ⊕ 𝑥&,"⋮
0 0
0⋮
⋯ 𝑦"𝑦#
𝑦& 𝑛
⋮
0 0
0
⋮
⋯
0 ⋯
(d) (e) (f)
Figure 1: The algorithm process of CNOT circuits on 2
dimensional grid graph.
a unitary operator U , if for any input |x〉 with ancillas
|0⊗m〉 , the results of Cn,m for output qubits and acnillas
are U |x〉 and |0⊗m〉.
We firstly propose the algorithm which can optimize
any n-qubit CNOT circuits to O(n) depth with n2 ancil-
las on 2 dimensional grid graph, and then generalize the
algorithm to any d dimensional grid for d ≥ 3.
Lemma 1. Any n-qubit CNOT circuit on n × n grid
graph can be paralleled to O(n) depth, with at most n(n−
1) ancillas.
Proof. Consider an n×n grid, every vertex has a coordi-
nate (i, j) that 1 ≤ i, j ≤ n. We lay out the input qubits
|x〉 = |x1, · · · , xn〉 be the vertices in the first column that
|xi〉 at (i, 1), and other vertices for ancilla qubits |0〉, as
depicted in Fig. 1 (a). We denote the CNOT gate with
control qubit at (i1, j1) and target qubit at (i2, j2) as
CNOT
(i1,j1)
(i2,j2)
.
For the invertible matrix M ∈ GL(n, 2), suppose y =
Mx, then yi =
∑
j∈[n]Mijxj . The following algorithm
can implement the CNOT circuit of M with n(n − 1)
ancillas |0⊗n(n−1)〉.
(1) Copy n copies of xi for each i ∈ [n] in row i of grid
in parallel by performing CNOT
(i,1)
(i,2), CNOT
(i,2)
(i,3), · · · se-
quentially for each row, as in Fig. 1 (b).
(2) Construct yi with qubits in column i for i ∈ [n] in
parallel: perform (a) CNOT
(i,j)
(i,j+1) for j that Mi,j =
0, range from n to 1; (b) CNOT
(i,1)
(i,2), CNOT
(i,2)
(i,3), · · · ,
CNOT
(i,n−1)
(i,n) sequentially of column i to give yi; (c)
Inverse process of (b) and (a) except CNOT
(i,n)
(i,n+1) to
restore qubits between (i, 1) and (i, n), as in Fig. 1
(c).
2
(3) Restore the rest ancillas to 0 except the last row
conserving y by performing the inverse process of
step (1), as in Fig. 1 (d).
(4) Transform x into |0⊗n〉 by performing M−1 pro-
vided y as input, x and other qubits as ancillas, as
in Fig. 1 (e), and then use swap operations to let
the outputs y be the correct position, as in Fig. 1
(f).
Step (4) is correct since (Mx, x, 0)
M−1−−−→ (Mx, x⊕M−1 ·
Mx, 0) = (Mx, 0, 0) [17]. After Step (1-4), the algorithm
can parallel any CNOT circuit to O(n) depth and O(n2)
size with n(n−1) ancillas, and a careful analysis will give
a paralleled depth 13n in the worst case.
Lemma 1 can be generalized to d dimensional grid for
any constant d, which is stated as the following theorem.
Theorem 1. Any n-qubit CNOT circuit can be paral-
leled to O
(
dn2/d
)
depth, with at most 2n2 ancillas, on
d dimensional grid, in which the size of the first d − 1
dimension are all n2/d, and the last dimension is 2n2/d.
Specifically, when d = log n, the depth can be paralleled
to O(log n) with n2 ancillas.
More precisely, Theorem 1 can be obtained by firstly
copying each xi for i ∈ [n] into a dd2e-dimensional grid,
and then construct yi by the copies of x similar to Lemma
1. Specifically, when d is even, the ancillas can be re-
duced to n2, as depicted in Lemma 1. We postpone the
proof of Theorem 1 into Appendix A.
Furthermore, when we eliminate a block of columns
rather than just a column each step, we can save a log n
factor of ancillas compared to Theorem 1. Some related
work can be seen in Ref. [13, 22]. To achieve O
(
n2
logn
)
ancillas, we need to firstly enumerate all of the configura-
tions for each block, and then copies the required config-
urations to the proper column. Finally, we just need to
add the copies of all blocks to give output y and restore
all of ancillas. The details of the results and proof are in
Appendix B.
Any n-qubit CNOT circuits can be optimized to O(n2)
size on the specific d-dimensional grid by Theorem 1. In
the following, we give the optimized size for non-ancillas
case on any connected graph.
Algorithm without ancilla — The quibit resource is
usually rare so we may not have ancilla qubits. We study
the optimization for circuits without ancilla. The topo-
logical structure of superconducting processors arrange
their qubits with only neighbouring-qubits interactions,
this means two-qubit gates can only perform in two ad-
jacent qubits in corresponding topological graph. There-
fore, for the elimination process on topological graph,
we can not eliminate a row with another if their corre-
sponding vertex are not adjacent. Thus in this case we
need to find a path between this two vertex and eliminate
across the path. And to improve the efficiency of elimina-
tion, we eliminate multi-rows together via the minimum
Steiner tree of the vertices of these rows, see also [14,18].
Given the topological constraint graph G and the ma-
trix M ∈ GL(n, 2) corresponding to a CNOT circuit.
The techniques in Ref. [14, 18] are both firstly eliminate
a given inverse matrix into a upper triangle matrix, and
then eliminate the triangle matrix into identity. Differ-
ently, we propose an algorithm which eliminates the i-
th row and i-th column simultaneously for vertex i ∈ V
which is not a cut vertex, then delete i from G, and repeat
the process for the rest reversible matrix with gradually
decreasing scale. Review that a cut vertex is any vertex
whose removal will make the connected graph discon-
nected. Let e1 := (1, 0, · · · , 0)T , e2 := (0, 1, 0, · · · , 0)T ,
· · · , en := (0, · · · , 0, 1)T . Steps (1-3) aim to make the
i-th column become ei and Setps (4-6) aim to make the
i-th row become eTi . All arithmetic operation are over
binary field F2. The algorithm works as follows :
(1) Pick vertex i ∈ V which is not a cut vertex. Create
terminal set S for nodes of the rows with entry in
column i equals to 1, additionally including i. Find
a minimum Steiner tree T for set S ⊆ V in graph
G.
(2) To make all entries associated with T in column
i become 1, postorder traverse the Steiner tree T
starting from node i, when reaching j with parent
k, add row j to row k if Mji = 1 and Mki = 0.
(3) To make all entries of rows associated with T ex-
cept row i in column i become 0, postorder traverse
the Steiner tree T starting from i, add every rows
to its children when reached.
(4) Create terminal set S′ for nodes of the rows whose
summation equals to row i except in column i, ad-
ditionally including i. Find a minimum Steiner tree
T ′ for set S′ ⊆ V in graph G.
(5) To make the summation of rows for nodes in T ′
become ei, preorder traverse the Steiner tree T
′
starting from i, when reaching j /∈ S′, add the j-th
row to its parent.
(6) To make the i-th row of M become the summa-
tion of rows for nodes in T ′, postorder traverse the
Steiner tree T ′ starting from i, add every row to its
parent when reached.
(7) Delete i from graph G and repeat steps (1-7) until
the matrix is identity.
For any connected graph, there always exists vertex which
is not a cut vertex and graph keeps connected after that
3
vertex deleted. Therefore, this algorithm can be applied
to any connected graph.
When we operate CNOT gates in Step (2-3,5-6), the
number of CNOT gates is less than the number of re-
maining nodes, so the total size is at most 4(n − 1) +
4(n − 2) + · · · + 4 × 1 ≤ 2n2. Meanwhile, the depth of
this algorithm depends on the diameter D and maximal
degree ∆ of the topological graph G. When we operate
CNOT gates on the Steiner tree in Step (2-3,5-6), CNOT
gates on different layer of the tree can be parallelized, so
the depth of circuit we need for each step is at most
D∆. Thus the total depth of elimination process is at
most min(2n2, 4nD∆).To sum up, we have the following
theorem.
Theorem 2. Given connected topological graph G(V,E)
and an n-qubit CNOT circuit, there is a polynomial time
algorithm to construct an equivalent CNOT circuit with
at most 2n2 gates and depth at most min(2n2, 4nD∆),
where D and ∆ are respectively the diameter and degree
of graph G.
Our algorithm has an obvious advantage compared to
the existing algorithms [14,18] in theoretical aspects. We
next show the algorithm is experimentally advantageous
as well.
Numerical results — We mainly compare the opti-
mized CNOT circuit size between our algorithm and al-
gorithms proposed by Kissinger et al. [14] and Nash [18]
on several NISQ devices. We generate a set of circuits by
operating CNOT gates between some pairs of connected
quibits uniformly randomly, and optimize these circuits
on different topological graph structure using different
algorithms.
To show the better performance of our algorithm on
increasing number of qubits in CNOT circuit, we classi-
cally simulate the above three algorithms on IBM Q20-
Tokyo device and a claw graph of Figure 2, then compare
the average optimized size of 200 random input CNOT
circuits as depicted in Figure 3. Our algorithm is supe-
rior to the algorithm of Nash et al. obviously for all of
physical devices, since their algorithm has a larger fac-
tor. Even though Kissinger’s algorithm seems close to
our algorithm, our algorithm perform better in 82.3%
test circuit. Figure 3 (b) gives comparison of our algo-
rithm and two existing algorithms on graph of Figure 2
with n range from 5 to 22, which indicate that the opti-
mized size of algorithm proposed by Kissinger et al. [14]
are gradually getting worse compared to our algorithm
on the non-Hamiltonian topological graph.
Further, we compare the above three algorithms on
more other topological structures to show our generally
better performance, say IBMQx4, IBMQx5, IBMQ20-
Tokyo, 4× 5 grid and 2× 12 grid (which is implemented
by Ye et al. [27]). Similarly, we also run the classical sim-
n
2 n-1
...
3
1
Figure 2: A claw graph, which is the topology of ibmq
vigo and ibmq ourense when n = 5
ulation for 200 initial random CNOT circuit and compute
the average optimized number of CNOT gates for in Ta-
ble 1 of Appendix D. The results is coincident with the
left side of Figure 3, which implies our algorithm may be
more suitable for large and complex structures in future.
We also compare the performance of graphs with dif-
ferent sparseness. We randomly generate lots of con-
nected graph with different number of edges, and 200
random reversible matrix in GL(n, 2). Then output the
average optimized size of the 200 random circuits on dif-
ferent graphs, as depicted in Figure 5. In Figure 5 (a),
the graph is uniformly randomly generated with 20 ver-
tices. In Figure 5 (b), we uniformly randomly select an
edge in 4 × 5 grid. The numerical results indicate that
the number of optimized CNOTs is proportional to the
sparseness of the graph.
Algorithm suitable for near-future device — As the
innovation of quantum technologies such as supercon-
ducting qubits [5, 6, 11, 27] and quantum dots [12, 15],
the processors which allow denser quibits connectivity
will arise in the near-future. We propose an algorithm
aiming at optimizing the size of the denser topological
superconducting processors.
Our algorithm is the generalization of Patel’s algo-
rithm [22], which optimizes the CNOT size for the com-
plete graph. The most significant difference between this
algorithm and techniques in [14, 18] is that we eliminate
several columns simultaneously instead of a single col-
umn each time in [14,18].
For a graph G(V,E) with n vertices, without loss of
generality, we assume the degree of vertices are denoted
as d1 ≤ d2 ≤ . . . ≤ dn.
Theorem 3. Given connected graph G(V,E) with∑
i≤k
di ≥ n,
then there is a polynomial time algorithm to construct
4
19.1 
106.3 
267.9 
506.2 
632.4 
22.9 
144.1 
382.9 
731.4 
907.5 
28.0 
235.5 
709.9 
1476.6 
1854.8 
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
5 10 15 20 22
Nu
m
be
ro
fC
NO
Ts
size of graph
(b)
Our Algorithm
Kissinger et al., arXiv 2019
Nash et al., arXiv 2019
21.7 
63.0 
133.9 
229.8 
289.9 296.8 295.5 
22.0 
69.1 
149.2 
249.8 
308.1 319.2 315.9 
26.5 
99.0 
247.8 
449.5 
560.7 568.1 568.6 
0.0
100.0
200.0
300.0
400.0
500.0
600.0
20 50 100 200 400 600 800
Nu
m
be
ro
fC
NO
Ts
Number of CNOT gates in input circuit
(a)
Our Algorithm
Kissinger et al., arXiv 2019
Nash et al., arXiv 2019
Figure 3: Comparison of the optimized size of our algorithm with algorithms proposed by Kissinger et al. [14] and
Nash et al. [18]. Figure (a) is on physical device IBM Q20-tokyo, with different initial number of CNOT gates in
the circuit to optimize, and Figure (b) is on a claw graph, with a totally random matrix in GF(2).
1
2 log
n
k
1
2 log
n
k
n
⇒ ⇒
⇓
⇐⇐
Figure 4: The elimination process of the algorithm suit-
able for near-future device.
an equivalent O
(
n2
log(n/k)
)
size CNOT circuit for any n-
qubit CNOT circuit on topological graph G, and there
needs at least Ω
(
n2
log dn
)
size of CNOT gates for some
invertible matrix.
Theorem 3 implies that for a r-regular graph G(V,E),
any CNOT circuit can be optimized to O(n2/ log r) size
CNOT circuit on G, which also matches the lower bound.
Here we give the main steps for how to optimize the
size of CNOT circuits to O( n
2
logn/k ), and leave the whole
proof in Appendix C.
As observed by Moore et al. [17], any CNOT circuit
can be represented as a matrix M ∈ GL(n, 2), and the
synthesis of CNOT circuit is equivalent to transform M
to I by performing Gaussian row eliminations [22].
The following algorithm gives O( n
2
logn/k ) row opera-
tors to transform M to I.
Firstly, divideM into 2nlogn/k blocks
[
M1 · · ·M 2n
logn/k
]
,
where Mi ∈ Fn×
logn/k
2
2 . Let In =
[
In,1 · · · In, 2n
logn/k
]
.
Then transform Mj to In,j for j ∈ {1, · · · , 2nlogn/k}.
Step 1-3) states how to transform M1 to I1.
(1) Eliminate the first logn/k2 rows to I logn/k
2
.
(2) Find the row whose corresponding vertex has the
maximum degree in the rest rows, denoted as row
l. Traverse all of binary vectors in Fn2 for l in order
of Gray code, by adding one of rows of I logn/k
2
to
the row l.
(3) For each Gray code of row l, use row l to eliminate
all of the rows which have the same value with row
l, each time eliminate k rows simultaneously if the
number of the rest same rows are larger than k,
otherwise eliminate all of the rest rows simultane-
ously.
The process of transforming Mj to Ij for j > 1 are al-
most the same with the process of transforming M1 to
I1, except in step (1), we need to eliminate the corre-
sponding rows of j-th block to I logn/k
2
, the above steps
are summarized in Figure 4.
All of eliminations for the above algorithm, contain-
ing step (1)-(3), uses our first algorithm (algorithm for
NISQ structure) as a sub-process to ensure the elimina-
tion process will not change other rows. The analysis of
optimized size of this algorithm is in Appendix C.
It is widely known that the best lower bound of un-
restricted CNOT circuit synthesis is Ω(n2/ log n) size
by [22]. This lower bound is obtained by counting the
number of distinct CNOT circuit with some fixed CNOTs.
5
0100
200
300
400
500
600
0 50 100 150 200
Nu
m
be
ro
fC
NO
Ts
Number of edge in general graph
a.
200
220
240
260
280
300
320
340
360
380
20 25 30 35 40 45 50 55 60
Nu
m
be
ro
fC
NO
TS
Number of edge in 4*5-grid
b.
Figure 5: Performance of graphs with different sparseness. The vertical axis is the average optimized size of 200
random circuits on a random graph. The graph is generated by uniformly randomly selecting an edge from the 20
vertices clique in (a), and generated by uniformly randomly selecting an edge in 4× 5 grid in (b).
For CNOT circuit synthesis restricted to NISQ structure,
we can just obtain the same lower bound Ω(n2/ log n)
using the above naive counting method. Nevertheless,
inspired by the much more detailed counting technique
from Christofides [9], we prove a tighter lower bound
Ω(n2/logD), where D is the degree of the topological
graph, for synthesis of topological CNOT circuit.
The main idea of the proof is still counting the num-
ber of distinct CNOT circuits with certain CNOT gates,
but using more detailed counting methods. A key ob-
servation is that if CNOT gates CNOTi,j doesn’t conflict
with CNOTk,l, i.e., they can be put into the same level
and executed simultaneously. We delay the proof into
Appendix C.
Summary and outlook —Optimization of size and depth
of topological quantum circuit is one of the main chal-
lenges in near-term quantum computing [20, 21, 23]. We
propose an algorithm for optimizing the depth of CNOT
circuit on near-term achievable topological structure —
d dimensional grid. Some physical devices with limited
qubits on 2 dimensional grid have been implemented by
some groups [11,27], and any CNOT circuit can be par-
alleled to O(n) depth with n2 ancillas by our algorithm.
Furthermore, when d = log n, we can parallel any CNOT
circuit to O(log n) depth with n2 ancillas, achieving the
asymptotically optimal depth.
We also propose two algorithm to optimize the size
of CNOT circuits on near-term physical device and near-
future device without ancillas, the numerical results shows
that our first algorithm can reduce the size of CNOT cir-
cuit significantly compared to pre-existed techniques [14,
18]. Our second algorithm, which can serve as a theoret-
ical instruction for the design of denser physical device.
More generally we may consider the problem of opti-
mizing the quantum circuits of CNOT plus single quibit
gates. We can naively divide the circuits into some blocks
of single quibit gates or CNOT gates, and optimize the
CNOT blocks using our algorithm. However, how to op-
timize quantum circuits more efficiently is still a open
problem worth concern.
References
[1] Scott Aaronson and Daniel Gottesman. Improved
simulation of stabilizer circuits. Physical Review A,
70(5):052328, 2004.
[2] Patrick Ali, Peter Dankelmann, and Simon Muk-
wembi. Upper bounds on the steiner diame-
ter of a graph. Discrete Applied Mathematics,
160(12):1845–1850, 2012.
[3] Frank Arute, Kunal Arya, Ryan Babbush, Dave
Bacon, Joseph C. Bardin, Rami Barends, Rupak
Biswas, Sergio Boixo, Fernando G. S. L. Brandao,
David A. Buell, Brian Burkett, Yu Chen, Zijun
Chen, Ben Chiaro, Roberto Collins, William Court-
ney, Andrew Dunsworth, Edward Farhi, Brooks
Foxen, Austin Fowler, Craig Gidney, Marissa
Giustina, Rob Graff, Keith Guerin, Steve Habeg-
ger, Matthew P. Harrigan, Michael J. Hartmann,
Alan Ho, Markus Hoffmann, Trent Huang, Travis S.
Humble, Sergei V. Isakov, Evan Jeffrey, Zhang
Jiang, Dvir Kafri, Kostyantyn Kechedzhi, Julian
Kelly, Paul V. Klimov, Sergey Knysh, Alexander
Korotkov, Fedor Kostritsa, David Landhuis, Mike
Lindmark, Erik Lucero, Dmitry Lyakh, Salvatore
Mandra`, Jarrod R. McClean, Matthew McEwen,
Anthony Megrant, Xiao Mi, Kristel Michielsen, Ma-
soud Mohseni, Josh Mutus, Ofer Naaman, Matthew
Neeley, Charles Neill, Murphy Yuezhen Niu, Eric
Ostby, Andre Petukhov, John C. Platt, Chris
6
Quintana, Eleanor G. Rieffel, Pedram Roushan,
Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger,
Vadim Smelyanskiy, Kevin J. Sung, Matthew D.
Trevithick, Amit Vainsencher, Benjamin Villalonga,
Theodore White, Z. Jamie Yao, Ping Yeh, Adam
Zalcman, Hartmut Neven, and John M. Martinis.
Quantum supremacy using a programmable super-
conducting processor. Nature, 574(7779):505–510,
2019.
[4] Adriano Barenco, Charles H Bennett, Richard
Cleve, David P DiVincenzo, Norman Margolus, Pe-
ter Shor, Tycho Sleator, John A Smolin, and Harald
Weinfurter. Elementary gates for quantum compu-
tation. Physical review A, 52(5):3457, 1995.
[5] Rami Barends, Julian Kelly, Anthony Megrant, An-
drzej Veitia, Daniel Sank, Evan Jeffrey, Ted C
White, Josh Mutus, Austin G Fowler, Brooks
Campbell, et al. Superconducting quantum circuits
at the surface code threshold for fault tolerance. Na-
ture, 508(7497):500, 2014.
[6] Sergio Boixo, Sergei V Isakov, Vadim N Smelyan-
skiy, Ryan Babbush, Nan Ding, Zhang Jiang,
Michael J Bremner, John M Martinis, and Hart-
mut Neven. Characterizing quantum supremacy in
near-term devices. Nature Physics, 14(6):595, 2018.
[7] P Oscar Boykin, Tal Mor, Matthew Pulver, Vwani
Roychowdhury, and Farrokh Vatan. A new univer-
sal and fault-tolerant quantum basis. Information
Processing Letters, 75(3):101–107, 2000.
[8] Anne Broadbent and Elham Kashefi. Paralleliz-
ing quantum circuits. Theoretical computer science,
410(26):2489–2510, 2009.
[9] Demetres Christofides. The asymptotic complexity
of matrix reduction over finite fields. arXiv preprint
arXiv:1406.5826, 2014.
[10] Richard Cleve and John Watrous. Fast parallel cir-
cuits for the quantum fourier transform. In Pro-
ceedings 41st Annual Symposium on Foundations of
Computer Science, pages 526–536. IEEE, 2000.
[11] IBM. Quantum experience. 2017.
[12] A Imamog, David D Awschalom, Guido Burkard,
David P DiVincenzo, Daniel Loss, M Sherwin,
A Small, et al. Quantum information processing
using quantum dot spins and cavity qed. Physical
review letters, 83(20):4204, 1999.
[13] Jiaqing Jiang, Xiaoming Sun, Shang-Hua Teng, Bu-
jiao Wu, Kewen Wu, and Jialin Zhang. Optimal
space-depth trade-off of cnot circuits in quantum
logic synthesis. arXiv preprint arXiv:1907.05087,
2019.
[14] Aleks Kissinger and Arianne Meijer-van de Griend.
Cnot circuit extraction for topologically-
constrained quantum memories. arXiv preprint
arXiv:1904.00633, 2019.
[15] Thaddeus D Ladd, Fedor Jelezko, Raymond
Laflamme, Yasunobu Nakamura, Christopher Mon-
roe, and Jeremy Lloyd OBrien. Quantum comput-
ers. nature, 464(7285):45, 2010.
[16] Dmitri Maslov, Gerhard W Dueck, D Michael
Miller, and Camille Negrevergne. Quantum circuit
simplification and level compaction. IEEE Transac-
tions on Computer-Aided Design of Integrated Cir-
cuits and Systems, 27(3):436–444, 2008.
[17] Cristopher Moore and Martin Nilsson. Parallel
quantum computation and quantum codes. SIAM
Journal on Computing, 31(3):799–815, 2001.
[18] Beatrice Nash, Vlad Gheorghiu, and Michele Mosca.
Quantum circuit optimizations for nisq architec-
tures. arXiv preprint arXiv:1904.01972, 2019.
[19] Michael A Nielsen and Isaac Chuang. Quantum
computation and quantum information, 2002.
[20] Alexandru Paler, Simon J Devitt, and Austin G
Fowler. Synthesis of arbitrary quantum circuits to
topological assembly. Scientific reports, 6:30600,
2016.
[21] Alexandru Paler, Simon J Devitt, Kae Nemoto, and
Ilia Polian. Mapping of topological quantum circuits
to physical hardware. Scientific reports, 4:4657,
2014.
[22] Ketan N Patel, Igor L Markov, and John P Hayes.
Optimal synthesis of linear reversible circuits. Quan-
tum Information & Computation, 8(3):282–294,
2008.
[23] John Preskill. Quantum computing in the nisq era
and beyond. Quantum, 2:79, 2018.
[24] Vivek V. Shende, Stephen S. Bullock, and Igor L.
Markov. Synthesis of quantum-logic circuits. IEEE
Trans. on CAD of Integrated Circuits and Systems,
25(6):1000–1010, 2006.
[25] Yaoyun Shi. Both toffoli and controlled-not need
little help to do universal quantum computation.
arXiv preprint quant-ph/0205115, 2002.
[26] Juha J Vartiainen, Mikko Mo¨tto¨nen, and Martti M
Salomaa. Efficient decomposition of quantum gates.
Physical review letters, 92(17):177902, 2004.
7
[27] Yangsen Ye, Zi-Yong Ge, Yulin Wu, Shiyu Wang,
Ming Gong, Yu-Ran Zhang, Qingling Zhu, Rui
Yang, Shaowei Li, Futian Liang, et al. Propaga-
tion and localization of collective excitations on a
24-qubit superconducting processor. Physical Re-
view Letters, 123(5):050502, 2019.
A The paralleled depth for d di-
mensional grid.
In this section, we give the proof for O
(
dn2/d
)
paral-
leled depth of CNOT circuit on d dimensional grid, as a
generalization of 2 dimensional grid.
Proof of Theorem 1. Suppose we have a d dimensional
grid, the size of each dimension is n2/d, for simplification,
we firstly suppose d is even, for odd d the analysis is
similar.
Consider an
(
n2/d × · · · × n2/d) d-dimensional grid.
We lay out each input qubit |xi〉 be a vertex of the dis-
joint
(
n2/d × · · · × n2/d) d2 -dimensional grid, and all of
the other vertex be ancillas |0〉.
(1) Copy n copies of xi for each i ∈ [n] to the vertices
of d2 -dimensional grid.
(2) Construct yj by combining all of essential copies
of input x in d2 different
d
2 sub-dimensional grid,
similar to the Algorithm of 2-dimensional grid.
(3) Restore the rest ancillas to |0〉 except y and input
qubit x.
(4) Transform x into |0⊗n〉 by performing M−1 pro-
vided y as input, x and other qubits as ancillas
(similar to step (1-3)), and then swap y to the cor-
rect position.
Step(1-2) has depth O
(
dn2/d
)
and size O(n2), thus the
total depth is O
(
dn2/d
)
and size O(n2), and the total
ancillas is exactly n2.
For odd d, we construct the copies of a block of n1/d
input xi, in a
(
n2/d × · · · × n2/d)-size d+12 -dimensional
grid, more precisely, construct n copies of xi with(
n2/d × · · · × n2/d × n1/d
)
-size d+12 -dimensional grid, since
d
2 is not a integer. For
constructing yj , it needs
(a) Combining all of essential copies of input x in d−12
different d+12 dimensional grid to the last
d+1
2 di-
mensional gird.
(b) Construct yj by combining corresponding n
1/d dif-
ferent block of outputs of the last d+12 -dimensional
grid. Since there only contain collisions of one di-
mension for combining process, we can avoid of col-
lision by copying the collision part to an additional(
n2/d × · · · × n2/d × n1/d)-size d-dimensional grid.
The restore processes are the same as step (3-4) of even
case. By step (b), there needs totally(
n2/d
)d−1
×
(
nd/2 + n1/d
)
=
(
n2 + n2−
1
d
)
< 2n2
ancillas.
B Optimize ancillas to O
(
n2
log n
)
of
Theorem 1.
In this section, we clarify how to optimize the depth to
O
(
dn2/d
)
with O
(
n2
logn
)
ancillas on d dimensional grid
for d ≥ 3. Here, we need d ≥ 3 since when d = 2, it
seems not work for our algorithm.
Corollary 1. Any CNOT circuits can be paralleled to
O
(
dn
2
d
)
depth with O
(
n2
logn
)
ancillas on d dimensional
grid, where d ≥ 3.
For simplification, we only gives the proof of 3 di-
mensional grid, with similar analysis one can obtain the
results of d > 3 dimensional grid.
Proof. For the(
n2/3 + a log n
)
×
(
n2/3 + n1/3+a
)
×
(
n2/3
log n
+ n1/3
)
gird, where a < 13 , n inputs |x〉 lay out in the first a log n
rows of a
(
n2/3 + a log n
)× 2n1/3 grid, as the rightmost
lattice of Figure 6.
The construction of yj is depicted in Figure 7 and
Figure 8, which is similar to the construction in Lemma
1. Let set Xj := {xja logn+1, · · · , x(j+1)a logn} for 0 ≤
j < nlogn ,
Uj :=
∑
i≤l
x′i|0 < l ≤ a log n, x′i ∈ Xj

vs,j :=
∑
ja logn<i<(j+1)a lognMs,ixi, then
ys =
∑
i
Msixi =
∑
j
vs,j
Meanwhile, vs,j ∈ Uj for any s ∈ [n]. Our algorithm is
as follows.
8
n
2
3 +· · ·
n
2
3 +· · ·
n2/3
logn +· · ·
n
1
3
· · ·
n
2
3 +· · ·
n
2
3 +· · ·
na
alogn
alogn
n
2
3
n
2
3
Figure 6: The CNOT circuit construction process of the
algorithm with O
(
n2/ log n
)
ancillas on 3 dimensional
grid.
(1) Copy na copies of xi for each i ∈ [n] in parallel.
(2) Generate n2/3 copies of Uj in parallel with n
a copies
of set Xj for 0 ≤ j < nlogn .
(3) For rn2/3 < s ≤ (r + 1)n2/3, where 0 ≤ r < n2/3,
generate vs,j with n
2/3 copies Uj in parallel, and in
the same time transfer the output vs,j to the left-
most nonzero qubits by SWAP operations (which
can be implemented by three CNOTs).
(4) Add all of vs,j with the same j in different slice se-
quentially (from the first slice to the an
2/3
logn -th slice).
(5) For the last n1/3 layers of slice, (1) copy vs,j to the
j-th layer for 2 ≤ j ≤ n1/3, and (2) add all of vs,j
for different s from right to left sequentially in the
j-th layer, as in Figure 8.
(6) Restore all of ancillas and move y to the symmet-
ric position of input x, then transform x into 0 by
regarding y as input and performing inverse oper-
ation M−1.
The total ancillas is O
(
n2
logn
)
, and the depth is
O
(
a log n + na + n2/3 +
an2/3
log n
+ n2/3
)
,
which equals to O
(
n2/3
)
for some small a.
C Proof of theorem 3
Let dG(S) be the the size of the minimum Steiner tree
of S ⊆ V in G(V,E). dk(G) be the maximum dG(S) for
all of S such that |S| = k. d¯s denote the average of the
degree for all of vertex in S.
Lemma 2. Given integer k, connected graph G(V,E)
with
∑
i≤k di ≥ n, we have dk(G) ≤ 5k.
x1
x2...
xalogn
1©
=⇒
x1
x1 · · · x1
x2...
x1
a logn
x1
a logn
2©
=⇒
u1
...
...
u1
u1 u2 · · · · · · una
...
...
una
una
3©
=⇒ ···=⇒
alogn
n2/3
na
n1/3
v1
v2
...
...
vn2/3
vn · · · · · ·
· · · · · ·
Figure 7: The process for construction of each small
block of the first n
2/3
logn layers in Figure 6.
n
2
3 +· · ·
n
2
3 +· · ·
n
1
3
y
n
2
3 +· · ·
n
2
3 +· · ·
Figure 8: The process for construction of the last n1/3
layers in Figure 6.
9
This Lemma can be obtained by applying the tech-
nique in Theorem 2.4 of [2]. For the completeness, we
give the proof the lemma as follows. The main idea of
the proof is that if two vertex has distance 3, then they
share no common neighbors.
Proof of Lemma 2. Let S := {u1, · · · , uk} ⊆ V and A be
an empty set. Firstly put in u1 to the set A, and then
put in all of vi to A for which d(A, vi) = 3. That is,
• d(vi, vj) ≥ 3 if vi, vj ∈ A.
• d(A, vi) ≤ 2 if vi 6∈ A.
Let A′ be a set such that the element aj ∈ A′ is a vertex
in set A and closest to ui in graph G. By the construction
of A, we have dG({a1, · · · , ak}) ≤ 3(|A| − 1) + 1 ≤ 3|A|.
If |A| > k, then d¯A ≥ nk ,
|A|(d¯A + 1) ≤ n⇒ |A| ≤ k.
Contrary with the assumption, thus |A| ≤ k. Therefore,
we have
dG(S) ≤ dG({a1, · · · , ak}) +
k∑
j=1
dG(aj , uj)
≤ 3|A|+ 2(k − 1)
≤ 5k
We cast the following lemma, to serve for Theorem 3.
Lemma 3. Given integer k, connected graph G(V,E)
with dk(G) ≤ ck, where c is a constant, then there is a
polynomial time algorithm to construct an O
(
n2
log(n/k)
)
equivalent CNOT circuit for any CNOT circuit on topo-
logical graph G.
Proof. Our second algorithm can be served as the poly-
nomial time algorithm to give an O
(
n2
logn/k
)
size CNOT
circuits for any CNOT circuit on topological graph G.
Now we prove the optimized size for our algorithm is
O
(
n2
logn/k
)
in the worst case. Since dk(G) ≤ ck, thus
each time we can eliminate any k rows with ≤ ck CNOTs
by our first algorithm or techniques in [14, 18], thus the
size for step a-c) to transform M1 to In,1 is
O
(
k log2
n
k
)
︸ ︷︷ ︸
Step a)
+O(k
√
n/k)︸ ︷︷ ︸
Step b)
+O(n)︸ ︷︷ ︸
Step c
= O(n)
since k ≤ n. Thus the total size is
2n
log n/k
×O(n) = O
(
n2
log n/k
)
.
Proof of Theorem 3. The upper bound of Theorem 3 holds
by Lemma 2 and Lemma 3. Now we give the proof for
the lower bound Ω
(
n2
log dn
)
, where dn is the maximum
degree.
We denote the elementary row operation that adds
i-th row to j-th row by R(i, j) and we call the {i, j} its
index set.
The main idea of the proof is still counting the num-
ber of distinct CNOT circuits with k CNOT gates, but
using more detailed counting method. A key observation
is that R(i, j) commutates with R(k, l) if {i, j}∩{k, l} =
∅. In other words, if CNOT gates CNOTi,j doesn’t con-
flict with CNOTk,l, they can be put into the same level
and executed simultaneously. Therefore, we can rear-
range the order of CNOT gates using the commutativity
so that the depth of the circuits is minimized, which we
call the ’canonical’ form of the CNOT circuit. For any
CNOT circuits with k gates represented as a sequence of
elementary row operations, R1, R2, . . . , Rk, the specific
transformation process is shown as follows.
1. Firstly, we partition the matrix sequence R1, R2,
. . . , Rk into several blocks G1, G2, . . . , Gs. The in-
dex set of matrix in each block should be disjoint
with each other. In other words, matrices in the
same block are commutative.
2. From i = 2, for every matrix in block Gi, we check
whether its index set is disjoint with that of every
matrix in previous block, and if so, we put this ma-
trix into block Gi−1 using the commutativity. And
then we repeat the above process for this matrix
until it’s in the first block or there exist a matrix
in previous block conflicts with its index set.
3. Let i = i+ 1 and we execute step 2 repeatedly un-
til i = s. At last, we have the modified partitioned
blocks G′1, G
′
2, . . . , G
′
s of the matrix sequence R1, R2,
. . . , Rk satisfying the following properties and we
call it canonical form.
(a) The index set of matrix in each block is dis-
joint with each other;
(b) From the second block, for every matrix in the
block, there exists at least one element of its
index set belonging to the index set of some
matrix in the previous block.
It is easily shown that every CNOT circuit has its
canonical form and can be transformed into it within
definite steps through above procedure. Thus, given the
constrained graph of NISQ, G = (V,E), and the maxi-
mum degree ∆ := dn of G, we can prove the lower bound
Ω(n2/ log ∆) by counting the number of distinct CNOT
circuits in canonical form.
10
For any CNOT circuit with k gates R1, R2, . . . , Rk,
we denote its canonical form by {G1, G2, . . . , Gs}, in
which the length of the blocks are respectively r1, r2, . . . , rs.
We first consider the number of different partitioning
ways, i.e., different choices of r1, r2, . . . , rs. It’s not hard
to see the number is 2k−1 for any combination from set
[k − 1] being a partition of set [k].
Next, we derive the upper bound of the number of
distinct canonical forms, given the specific partitioning
r1, r2, . . . , rs. For block 1, the index set of each matrix
are required to be disjoint and the size of edge indepen-
dent set is less than n, therefore, the number is at most
2
(
n
r1
)
, where the factor 2 is due to the different order
of the indices .Considering the matrix in the same block
is commutative, the number of distinct combinations is(
n
r1
)
/r1! .Subsequently, for block 2, each index set of
its matrix has at least one element intersect with that of
block 1, so we need to choose r2 index from the index set
of block 1. The number of possible combination is
(
2r1
r2
)
and there are at most ∆ choices for another index of the
matrix since the CNOT gates can only act on the nearest
neighbour qubits and the maximal degree of the graph
is ∆. All this leaves for block 2 at most 2∆
(
2r1
r2
)
/r2!
possible combination. For the same reason, block i has
at most 2∆
(
2ri−1
ri
)
/ri! possible combination. In all, the
upper bound of the number of distinct canonical forms
is
2s∆s−1
(
n
r1
)(
2r1
r2
)
. . .
(
2rs−1
rs
)
r1!r2! · · · rs! (1)
Since
(
n
r1
)
< nr1 < 2n logn, s < k, and
(
2ri
ri+1
)
<
2ri+1r
ri+1
i , we can relax the upper bound to
4k∆k2n lognrr21 r
r3
2 · · · rrss−1
r1!r2! · · · rs! (2)
Using Stirling formula and sequence inequality, we have
the following inequality
r1!r2! · · · rs! ≥
(r1
e
)r1 (r2
e
)r2 · · ·(rs
e
)rs
≥ rr21 rr32 · · · rrss−1rr1s
≥ rr21 rr32 · · · rrss−1
(3)
Therefore, we can obtain a more relaxed upper bound
4kek∆k2n logn (4)
Since the number of n-qubit CNOT circuits equals to
the number of invertible matrix Fn×n2 , There are at least
2n(n−1)/2 distinct n-qubit CNOT circuits. If we want
to construct all CNOT circuits within k CNOT gates, k
must satisfies
k ≥ n(n− 1)/2− n log n
log ∆ + 2 + log e
(5)
In other words, We need at least Ω(n2/ log ∆) CNOT
gates to construct all of the invertible matrices in GL(n,2).
D Performance of different algo-
rithm on different architecture.
The comparison of our algorithm and two existing algo-
rithms are depicted in Table 1. The fist column of the
Figure list the architecture we test. Because Kissinger’s
algorithm and our algorithm depend on the sequence of
vertex, we choose some Hamiltonian path in 4*5q-grid
and test the performance of each algorithm. The second
column show the original number of random CNOTs gate
(random means we test with a random legal 0/1 ma-
trix). We test each algorithm in each graph and each
size of original circuit, whose result listed in remaining
columns.
11
Architecture #CNOTs Nash et al.,arxiv 2019 Kissinger et al.,arxiv 2019 Our algorithm
4*5q-grid 20 26.46 22.04 21.68
4*5q-grid 50 99.02 69.11 62.99
4*5q-grid 100 247.78 149.15 133.90
4*5q-grid 200 449.50 249.81 229.75
4*5q-grid 400 560.74 308.12 289.88
4*5q-grid 600 568.10 319.20 296.81
4*5q-grid 800 568.59 315.88 295.52
4*5q-grid(different Hamiltonian path) random - 302.60 278.86
4*5q-grid(different Hamiltonian path) random - 356.26 285.95
4*5q-grid(different Hamiltonian path) random - 310.41 287.45
4*5q-grid(different Hamiltonian path) random - 353.06 281.74
4*5q-grid(different Hamiltonian path) random - 394.62 382.13
20q-star 20 58.97 - 39.05
20q-star 50 172.44 - 110.94
20q-star 100 289.91 - 190.04
20q-star 200 352.08 - 230.84
20q-star 400 358.95 - 234.87
20q-star 800 357.05 - 233.70
IBM-Qx4 random 18.16 15.40 15.18
IBM-Qx5 random 461.79 243.90 234.48
IBM-Qx20-tokyo random 531.34 302.60 278.86
Yangsen Ye et al.,PRL 2019 random 1258.40 578.07 571.99
Table 1: Performance of different algorithms running on different architecture.The fist column list the architecture
we test. Because Kissinger’s algorithm and our algorithm depend on the sequence of vertex, we choose some
Hamiltonian path in 4*5q-grid and test the performance of each algorithm. The second column show the original
number of random CNOTs gate (random means we test with a random legal 0/1 matrix). We test each algorithm
in each graph and each size of original circuit, whose result listed in remaining columns.
12
