Quantum Supremacy Circuit Simulation on Sunway TaihuLight by Li, Riling et al.
Quantum Supremacy Circuit Simulation on Sunway
TaihuLight
1st Riling Li
Department of Computer Science & Technology
Tsinghua University
Beijing, China
rl-li16@mails.tsinghua.edu.cn
2nd Bujiao Wu
CAS Key Lab of Network Data Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
Beijing, China
wubujiao@ict.ac.cn
3rd Mingsheng Ying
Centre for Quantum Software and Information, University of Technology Sydney
Sydney, Australia
State Key Lab of Computer Science, Institute of Software, Chinese Academy of Sciences
Department of Computer Science & Technology, Tsinghua University
Beijing, China
Mingsheng.Ying@uts.edu.au
4th Xiaoming Sun
CAS Key Lab of Network Data Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
Beijing, China
sunxiaoming@ict.ac.cn
5th Guangwen Yang
Department of Computer Science & Technology
Tsinghua University
Beijing, China
ygw@tsinghua.edu.cn
Abstract— With the rapid progress made by industry and
academia, quantum computers with dozens of qubits or even
larger size are being realized. However, the fidelity of existing
quantum computers often sharply decreases as the circuit depth
increases. Thus, an ideal quantum circuit simulator on classical
computers, especially on high-performance computers, is needed
for benchmarking and validation. We design a large-scale simula-
tor of universal random quantum circuits, often called “quantum
supremacy circuits”, and implement it on Sunway TaihuLight.
The simulator can be used to accomplish the following two tasks:
1) Computing a complete output state-vector; 2) Calculating one
or a few amplitudes. We target the simulation of 49-qubit circuits.
For task 1), we successfully simulate such a circuit of depth 39,
and for task 2) we reach the 55-depth level. To the best of our
knowledge, both of the simulation results reach the largest depth
for 49-qubit quantum supremacy circuits.
Index Terms—quantum computing, quantum circuit simula-
tion, Sunway TaihuLight, quantum supremacy
I. INTRODUCTION
The concept of quantum computer was proposed almost
four decades ago [9]. But until recently it had been un-
known whether quantum computers can indeed exceed the
computing capability of their classical predecessors. Thanks
to the progress made by industry and academia in recent
years, practical quantum computing might become reality
soon. However, before a commercial quantum computer is
launched on market, many tests and verification need to be
done. One of the most important is to test the fidelity of
quantum circuit. One way to accomplish this is to simulate
quantum circuits by computing the ideal state amplitudes on a
classical computer. A quantum simulator on classical computer
could also be used to verify correctness of certain quantum
algorithms and benchmark the notion “quantum supremacy”.
Quite a few implementations of quantum circuit simulators
have been developed in last few years [14, 18, 17, 11, 13, 5].
To test the limitation of quantum circuit simulation on classical
computers, we design a simulator on Sunway TaihuLight, one
of the most powerful high-performance computer at present in
the world. We mainly aim at 49-qubit circuits, because such
circuits are hard to directly simulate on other supercomputers
due to the limited memory spaces, and it is widely believed
that near-term quantum device will first achieve quantum
supremacy at this scale. Our simulator can accomplish the
following:
• Task 1: Computing a complete output state-vector repre-
senting a quantum state output by the simulated quantum
circuit;
• Task 2: Sampling (i.e. calculating a small amount of)
the amplitudes of a quantum state output by the quantum
circuit.
For Task 1, we are able to solve the lattice of 7×7 qubits with
ar
X
iv
:1
80
4.
04
79
7v
3 
 [q
ua
nt-
ph
]  
13
 A
ug
 20
18
Reference Platform Qubits Depth Time
ETH[11] Cori II 45 25 10 minutes
IBM[13] Blue Gene/Q 49 27 2 days
Sunway Sunway TaihuLight 49 39 4.2 hours
TABLE I
Several quantum circuit simulators implemented on supercomputers in
recent years. Our simulator reaches the largest number of qubits and also
the largest depth for 49-qubit circuits in the case of solving all amplitudes
of the final state (Task 1).
depth 39 1 in 4.2 hours using 131072 core groups, which is
around 80% of the computing resource of Sunway. As to Task
2, our method can calculate one amplitude for 49-qubit circuits
of depth 55. Moreover, our method for Task 1 can also be
directly extended to the lattices of 7×8, 8×8, and 9×8 qubits
for calculating a few amplitudes, though the reachable depth
of the random circuit would be decreasing with the increasing
of the qubits. Figure 1 shows the maximal depth our simulator
can reach for different number of qubits.
A. Comparison with other quantum simulators
For Task 1, the simulator described in [11] can simulate a
quantum circuit with 5 × 9 qubits and depth 25 on the Cori
II supercomputer in less than 10 minutes, using 0.5 petabytes
and 8192 nodes. Reference [13] reported the simulation of a
7×7 grid of qubits with depth 27, which costs more than one
day on IBM Blue Gene/Q. In contrast, our method simulates
a 7 × 7 grid of qubits with depth 39 in much less time. The
task of simulating 7 × 8 grid of qubits with depth 23 was
not finished in [13] because 256 amplitudes are too many to
calculate. For this task, they only calculated 239 amplitudes.
In contrast, our simulator can calculate 237 ∼ 242 amplitudes
in a short time for 7 × 8-qubit circuits of depth 35. Table I
compares our implementation and several previous works.
Task 2 of sampling amplitudes can be used to estimate the
fidelity by computing the cross entropy, which usually needs
103 ∼ 106 samples [6]. The sampling target is to calculate an
amplitude αx, where 0 ≤ x ≤ 2n − 1 for an n-qubit circuit.
The method of calculating the state-vector can also be directly
applied to calculate a small number of amplitudes, though in
this way the reachable depth is not very large. However, we
can get one amplitude for 7 × 7-qubit circuits of depth 55
by calculating the inner product of two 49-qubit state-vector,
although this is very expensive. Our results of Task 1 and Task
2 shows that the bound of 49-qubit with depth 40 in [5, 6] is
no more out of reach.
B. Technical contributions of this paper
The first contribution of this paper is that we introduce a
novel partition scheme via a dynamic programming algorithm,
by which we can save more time and space than [13]. At the
same time, we propose several global optimizing techniques,
including some new optimizations adaptive to the structure of
2D grid circuits and several other optimizations for reducing
1The first layer of Hadamard gates are not counted as the circuit depth, we
treat it as layer 0. So the circuit we simulate is from layer 0 to layer 39. This
also holds for Task 2.
49 56 64 72
10
20
30
40
number of qubits
de
pt
h
 
 
Sunway
IBM
Fig. 1. The maximal circuit depth we can simulation in cases of different
number of qubits. Both our simulator and IBM [13] can calculate all 249
amplitudes for 49-qubit case. For 56- or more qubit cases, we only calculate
a slice of 232 amplitudes for purpose of demonstration.
the network communication amount when implementing our
method.
Our second contribution is some new local optimizing tech-
niques on a single node, which take the advantage of the many-
core heterogeneous processor of Sunway. Our optimization
greatly reduces the amount of memory access while it only
increases a small quantity of calculation. We also apply some
standard optimization such as vectorization. By all of these
techniques, we can simulate a 28-qubit circuit on one core
group very quickly, which is significant for improving the
performance of 49-qubit circuit simulation.
Sunway TaihuLight adopts a popular many-core heteroge-
neous system architecture, of which in one CPU chip there are
4 core groups (CGs) each with 1 management processing ele-
ments (MPEs) and 64 computing processing elements (CPEs)
[10]. Since in our implementation each core group corresponds
to an unique MPI process, we regard a core group instead of
a CPU chip as a single node in this paper to avoid some
messy and confusing description. That is, each node contains
only 1 master core (MPE) and 64 slave cores (CPEs) and
corresponds to only one MPI process.
C. Organisation of the paper
Section 2 describe the simulation methodologies and opti-
mization techniques. Section 3 presents the numerical results
of our implementation as well as verification of the results.
Section 4 draws a conclusion and discusses the problems to
be solved in the future work.
II. METHODOLOGIES AND OPTIMIZATIONS
Before describing our methods and optimizations, let us
recall some basic notions and notations.
A basic storage unit in a quantum computer is a quantum bit
(qubit). Generally, we can use a 2n-length complex vector to
describe an n-qubit state, such as |ψ〉 = (α0, α1, ...α2n−1)T . A
practical quantum circuit consists of single-qubit and 2-qubit
gates. For example, a single-qubit gate Uk =
(
a b
c d
)
on
qubit k can be treated as an n-qubit gate which acts as identity
on the other n − 1 qubits, as denoted by U = I⊗n−k−1 ⊗
Uk ⊗ I⊗k, where I is the identity operator on a single qubit.
In this paper, a superscript indicates the qubit that the gate is
performed on, and a subscript is used as a gate label or an
index of amplitude.
To perform Uk on a state |ψ〉 = (α0, α1, · · · , α2n−1)T , we
have: (
α′i
α′i+2k
)
=
(
a b
c d
)(
αi
αi+2k
)
for every i = in−1...i1i0 where ik = 0. And the resulting
state is then |ψ′〉 = (α′0, α′1, ...α′2n−1)T = Uk|ψ〉. The 2-qubit
gates used in this paper are mainly the controlled-gate: CZ.
A controlled-gate CU c,t act on qubits c and t, with the first
being the control bit and the second being the target bit. The
performance of a 2-qubit gate is similar to that of a single-
qubit gate with the only extra consideration of whether the
control qubit is in state |1〉; for more details, we refer to [12].
For a quantum circuit simulation, the initial state is usually
the product state:
|0〉⊗n = |0〉 ⊗ |0〉 ⊗ ...⊗ |0〉︸ ︷︷ ︸
n
.
If there is no 2-qubit gate in the circuit, the state will
persistently remain a product state and only O(n) space is
needed to describe the state. But as the number of two-
qubit gates increases, the quantum state may become highly
entangled. In this case, the storage of the state will require
O(2n) space. As n increases, the memory required becomes
too large even for the most powerful supercomputers. For
example, the maximal qubit number of a state vector that
could be stored in the memory of Sunway is 45 (46) using
double (float), which requires 0.5 petabytes of memory space.
However, for a 2-qubit gate CU c,t, we can decompose it
to CU c,t = P c0 ⊗ It + P c1 ⊗ U t, where P0 = |0〉〈0| and
P1 = |1〉〈1|, and go along two branching path with respect to
P0 and P1. deferring the entanglement it brings. Until an ap-
propriate stage we combine the branching paths, finishing the
deferred entanglement. Essentially, this idea comes from the
notion of “Feynman path integral” and appeared in [2, 13, 5].
Figure 2 gives a simple example showing how circuit partition
works.
A universal random quantum circuit has a 2D grid archi-
tecture [6]. The quantum gates used in this type of circuits
are H,X1/2, Y 1/2, T and CZ. The 2-qubit gate CZ only
appears between two adjacent qubits in the grid. Since in
each layer of an n × m-grid universal random circuit the
positions of CZ gates are fixed, we can find a concrete partition
scheme according to the scale of circuit to be simulated. Our
techniques for finding this partition scheme are described in
detail in the following subsection.
A. Method for computing the complete state-vector (Task 1)
Figure 2 describes the basic ideas of circuit partition. But
as the circuit depth increases, the number of decomposed 2-
qubit gates also increases. To increase the depth of circuits
that can be simulated, we propose two optimizing techniques,
which enable us to simulate 49-qubit circuits of depth 39,
computing the complete output state-vector:
• Technique 1: We analyze the structure of universal ran-
dom circuits, exploit the diagonal properties of CZ gates,
and propose a technique, called implicit decomposition,
which can decompose extra 7 CZ gates without requiring
too much extra memory space, so as to increasing the
depth of circuits that could be simulated by 8.
• Technique 2: We propose a dynamic programming (DP)
algorithm to find a good partition scheme for a given
simulation task. In contrast to the general heuristic search
method that usually takes a long time, this DP algorithm
is efficient and can find an optimal partition scheme
under certain constrains. It also makes the simulation
easier to optimize and parallelize and thus improves the
performance in time-to-solution.
1) Implicit decomposition: Our partition scheme divides
the target circuit into three parts, and each part requires less
memory than the memory needed to store the entire 49-qubit
state vector. Implicit decomposition balances the memory
requirement of these 3 parts, decomposes extra CZ gates and
increases the depth of circuits that could be simulated.
For a better understanding of Technique 1, let us first
consider a circuit example shown in Fig. 3. The partition
scheme for simulation is illustrated by the blue dotted line
and yellow dotted line. The dotted lines cut two CZ gates,
and partition the circuit into 3 parts: A,B,C. We use |φ〉
and |ξ〉 to describe the initial states in the two subsystems.
Because there are two CZ gates being decomposed (cut by
dotted lines), four branching paths in total are generated. Let
|φoutl1,l2〉 be the resulting state after performing the gates in part
A, l1 and l2 denote the indices of qubit 2 in two different time.
State |ξoutl1,l2〉 is similar to |φoutl1,l2〉. Then we have2:
|φoutl1,l2〉 = X0CZ0,1Y 1CZ2,1l2 T 1CZ
2,1
l1
H0H1|φ〉
where l1 and l2 denote the indices of qubit 2 when decompos-
ing the cutting CZ gates, thus CZ2,10 = I
1 and CZ2,11 = Z
1.
Furthermore, we have:
|ξoutl1,l2〉 = X3P 2l2T 2CZ2,3P 2l1H3H4|ξ〉
where P is a projection operator: Pi|i〉 = |i〉 and Pi|1−i〉 = 0
for i = 0, 1. The starting state of part C is:
|ψin〉 =
∑
l1,l2
|φoutl1,l2〉 ⊗ |ξoutl1,l2〉 (1)
We perform the remaining gates in part C, and the eventually
state |ψout〉 is:
|ψout〉 = Y 1X2CZ1,2X2|ψin〉 (2)
There are 24 = 16 amplitudes to calculate for |ψin〉 (or
|ψout〉). But we do not need memory space for 16 ampli-
tudes(not counting memory space for |φ〉 and |ξ〉) to accom-
plish the calculation. Note that in part C there are gates only
2The order of performing gates in circuit is from left to right, while the
matrix-vector multiplication is from right to left.
𝑈𝑈1
Input: |𝜓𝜓𝑖𝑖𝑖𝑖〉 |𝜓𝜓1〉 |𝜓𝜓2〉 |𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜〉
|𝜉𝜉〉
|𝜑𝜑〉
|𝜉𝜉𝑖𝑖𝑖𝑖=0′ 〉
𝜑𝜑𝑖𝑖𝑖𝑖=0
′
|𝜑𝜑𝑖𝑖𝑖𝑖=1′ 〉
|𝜉𝜉𝑖𝑖𝑖𝑖=1′ 〉 |𝜉𝜉𝑖𝑖𝑖𝑖=0
′′ 〉|𝜉𝜉𝑖𝑖𝑖𝑖=1′′ 〉
𝜑𝜑𝑖𝑖𝑖𝑖=0
′′
|𝜑𝜑𝑖𝑖𝑖𝑖=1′′ 〉
𝑨𝑨
𝑩𝑩
𝑈𝑈2
𝑈𝑈3
𝑈𝑈4𝑈𝑈
𝑖𝑖,𝑜𝑜
Fig. 2. An illustrative example of how gate decomposition works. The initial state is |ψin〉 = |0〉⊗n and the circuit is Ucircuit = (U3⊗U4)CUc,t(U1⊗U2).
The whole system can be regarded as a composition of two subsytems A and B, while the gate CUc,t (qubit c in A and qubit t in B) causes their entanglement.
Assume A has nA qubits and B has nB qubits, and the total qubit number is n = nA+nB . After performing U1 and U2, the whole state is |ψ1〉 = |ξ〉⊗|ϕ〉.
With the decompsing of CUc,t we get two branches: qubit c at |0〉 and qubit c at |1〉. After performing CUc,t, we have |ξ′ic 〉 and |ϕ
′
ic
〉, ic = 0, 1. Finally,
we perform U3, U4 and get |ξ′′ic 〉 = U3|ξ
′
ic
〉, |ϕ′′ic 〉 = U4|ϕ
′
ic
〉. The result is |ψout〉 =
∑
ic=0,1
|ξ′′ic 〉 ⊗ |ξ
′′
ic
〉 . Note that after performing CUc,t we need
to double the space to storage the two branches (qubit c at |0〉 or |1〉). So, the total space consumption is 2nA+1 + 2nB+1, in contrast to 2n, the space
consumption of directly calculating the resulting state of the composite system gate by gate.
H
H
H
H
Z T
T X
Z
Z
Z Y
X
X
|𝜙〉
|𝜉〉
X
|𝜓〉
𝑖0: |0〉
𝑖1: |0〉
𝑖2: |0〉
𝑖3: |0〉
𝑙1 𝑙2
Part A
Part B
Part C
Z
Y
Fig. 3. Example of a 4-qubit circuit. In this example, there are 2 CZ gates being decomposed. Note that the second decomposed CZ gate only doubles the
space consumption of part A, because after this CZ gate there is no gate in part B performed on qubit 2, thus the implicit decomposition.
performed on qubit 1 and qubit 2, and no gate on qubit 0 and
qubit 3. So |ψin〉 can be divided into 4 blocks by enumerating
the indices of qubit 0 and qubit 3. Let |ψq1,q2i0,i3 〉 denote the
reduced state of qubit 1 and qubit 2 with qubit 0 at |i0〉 and
qubit 3 at |i3〉. Then |ψin〉 =
⊕
i0,i3
|ψq1,q2i0,i3 〉. Equation (2)
turns into:
|ψout〉 =
⊕
i0,i3
|ψout,q1,q2i0,i3 〉
=
⊕
i0,i3
(Y 1X2CZ1,2X2|ψq1,q2i0,i3 〉) (3)
Now we know that if all the results of part A and B are
calculated and stored in memory, there only needs extra space
for 22 = 4 amplitudes in part C. From now on we set an
amplitude as a basic storage unit (8 + 8 = 16 bytes), and the
total space consumption is SA+SB+4 instead of SA+SB+16,
where SA and SB are the space consumption of part A and
B, respectively.
From the above analysis, we know SA = SB = 22+2 = 16.
However, SB can be halved without introducing extra compu-
tation. Note that after the second cut CZ (in red) gate being
decomposed, there is no any gate performed on qubit 2 in part
B, so for any amplitude of |ξoutl1,l2〉, let it be ξi2,i3,l1,l2 , where
i2 and i3 are the indices of qubit 2 and qubit 3. We have
ξi2,i3,l1,l2 6=i2 = 0, for l2 = 0, 1
Thus, the index l2 could be absorbed into index i2, and we
only need to store |ξoutl1 〉. This means that when we finish
the calculation of part B, only 21+2 = 8 amplitudes to be
stored, while in part A there are still 22+2 = 16 amplitudes
to be stored. Because the second cut CZ gate is decomposed
but the space consumption need not be doubled for control
Fig. 4. Implicit decomposition applied to 49-qubit universal random circuits.
In this and following figures, two adjacent blocks represent a CZ gate. And
the blocks in red represent the CZ gate that could be implicit decomposed.
Because part A has 21 qubits and part B has 28 qubits, without implicit
decomposition SB = 228+7,SA = 221+7. After we use implicit decompo-
sition on these seven cut CZ gates at the second and third layer, the ultimate
space consumption of part B is s′B = SB/2
7, and now SA = S′B .
part (part B in this example), we call this technique implicit
decomposition.
The implicit decomposition is useful when the space con-
sumption of part A and B is not balanced, say there need space
SA to store part A and some fewer space SB to store part B,
we could apply implicit decomposition to part A, and only
making the space of part A to S′A while leaving the space
of B unchanged, until S′A and SB are almost in the same
magnitude.
For a universal random quantum circuit of m× n grid, the
implicit decomposition works well when m or n is odd. Our
main target is 7 × 7-qubit circuits. At first, the two cutting
lines are at the same position and partition the circuit into two
parts: part A with 21 qubits, part B with 28 qubits. Then we
apply this technique to part B, as shown in Fig. 4.
2) Dynamic programming: To reduce the total space con-
sumption and make the simulator easier to optimize, we avoid
decomposing CZ-gates between part C and parts A, B. At first
two splitting line overlap. When the two splitting lines sepa-
rate, they start to walk around the CZ gates and will not cut-off
any one of them. A feasible partition scheme also requires the
total space consumption less than the memory space. To find
an optimal partition scheme under these constrains, we design
a dynamic algorithm for state compression.
Let f(t, i1, i2, i3, i4, i5, i6, i7) denote the minimal space
consumption (exponential in f ) of part A when the partition
scheme reaches layer t and the position of upper splitting line
is (i1, i2, ..., i7). Here, (i1, i2, ..., i7) is used to indicate the
distance between the current position and the initial position.
Initially, the upper splitting line is beneath the third row of
qubits, and f(1, 0, ..., 0) = 21. Note that f(7, 0, 0, ..., 0) =
21 + 3 = 24 and f(8, 0, 0, ..., 0) = 21 + 3 + 4 = 28. Since
we have a restriction that no CZ gate between part A and B
could be decomposed, f(t, i1, i2, ..., i7) will be illegal when
(i1, i2, ..., i7) 6= (0, 0, ..., 0) and (i1, i2, ..., i7) splits some CZ
gate at layer t.
Similarly, let g(t, i1, ..., i7) denote the target function of part
B. Then g(1, 0, 0, ..., 0) = 28 and g(8, 0, 0, ..., 0) = 35. The
recursion from g(t − 1) to g(t) is similar to f , and the only
difference is that when (i1, i2, ..., i7) first leaves the initial
position it has a chance of applying implicit decomposition.
To find a good partition scheme for circuit of depth t, we
can traverse f(t) and g(t) to find an optimal combination of
Fig. 5. A illustrative example of legal transition between two adjacent layers.
From layer t to t+1, the position of upper splitting line (in blue) transforms
from (0, 1, 0, 1, 0, 1, 0) to (1, 1, ..., 1). Since (1, 1, ..., 1) does not cut any
CZ gate at layer t + 1, this is a legal transition. So f(t + 1, 1, 1, ..., 1) =
min{f(t+ 1, 1, 1, ..., 1), f(t, 0, 1, 0, 1, 0, 1, 0)}.
Algorithm 1: Pseudocode of the Dynamic Programming
Input: f(t)
Output: f(t+ 1)
1 for 0 ≤ i1, i2, ..., i7 ≤ 3 do
2 f(t+ 1, i1, i2, ..., i7) =∞;
3 if position (i1, i2, ..., i7) is not illegal then
4 continue;
5 end
6 cutnum =
number of CZ gates splitting by (i1, ..., i7);
7 for 0 ≤ j1 ≤ i1, 0 ≤ j2 ≤ i2, ..., 0 ≤ j7 ≤ i7 do
8 f(t+ 1, i1, i2, ..., i7) = min{f(t, j1, j2, ..., j7) +
cutnum, f(t+ 1, i1, i2, ..., i7)};
9 end
10 end
f(t, i1, i2, ..., i7) and g(t, j1, j2, ..., j7) which minimizes SA+
SB + SC , where SA = 2f(t,i1,i2,...,i7), SB = 2g(t,j1,j2,...,j7)
and SC = 2
∑
1≤k≤7(ik+jk). Several partition schemes are illus-
trated in Fig. 6 and Fig. 9. There need space SA = SB = 235,
SC = 2
28 to execute the simulation for depth 27, while
SA = SB = 2
42 for depth 35 and 39. The maximal number
of qubits that could be simulated on one node is 28. This
indicates that from depth 35 to depth 39, there will be a drop
in performance, which is shown in section III-C.
3) Summary: The two techniques proposed above not only
work for universal random circuit. For circuit of an arbitrary 2-
D grid structure, our method can find a proper partition scheme
to reduce the time and space complexity of simulation. Table
III gives an algorithmic comparison of our partition scheme
and [13].
Obviously, 7× 7-qubit circuits of depth 39 and of depth 40
have different difficulties in physical preparation and opera-
tion. We provide an evidence showing that our method might
reach depth 40 if the target circuit is a “lucky circuit” (i.e.,
with enough T gates in special positions at layer 40). For
example, using the tensor slice technique in [13], if there is
Depth 27 35 39
SA 2
35 242 242
SB 2
35 242 242
SC 2
28 228 242
TABLE II
The space consumption of parts A,B,C for 49-qubit circuit of different depth.
Reference Depth Space Computation amount
IBM[13] 27 64 TB 249(210 + nC)
Sunway 27 1 TB 249(27 + nC)
IBM[13] 39 N/A N/A
Sunway 39 256 TB 249(214 + nC)
TABLE III
An algorithmic comparison of our partition scheme and the partition scheme
in [13] for 49-qubit circuits. Space means the least memory needed to
execute the simulation. nC is the number of gates in part C, usually several
hundred. The computation amount means the number of float-point
operations. No partition scheme for depth 39 is given in [13]. This table
shows that the combination of techniques 1 and 2 produces a more efficient
algorithm.
Fig. 6. Partition scheme for 49-qubit circuit of depth 39. The scheme for
depth 35 is exactly the front 35 layers of this figure, except for that at layer
35 two splitting lines are still straight as in layer 34 and cut 6 extra CZ gates
(See the last layer of partition scheme for 49-circuit of depth 27 in Fig. 9).
Note that the CZ gates at the last layer would not impact the probabilities and
could be removed because they diagonal. Thus the number of cut CZ gates
being cut (the implicit decomposed CZ gates and CZ gates at the last layer
not included) is 14 for both depth 35 and 39, but in the case of depth 35
SC = 2
28 and in the case of depth 39 SC = 242.
at least one T gate in the four single-qubit gates on qubit
0, 2, 4, 6 at layer 40, the size of a slice is at most 245,
which could be directly simulated on Sunway, though the
performance will further drop from depth 39 to depth 40.
Because X1/2, Y 1/2, T appears randomly at the positions for
single-qubit gates, the probability that a 49-qubit and 40-depth
universal random circuit could be simulated on Sunway is:
p = 1− (2/3)4 = 65/81
Fig. 7. Tensor slicing technique in [13]. This is a example that two T gates
appears in the four positions, thus 5 qubits in total could be sliced.
Fig. 8. Partition scheme for 56-qubit circuits of depth 35. The method for
solving a part of amplitudes from this circuit is exactly the same as computing
a complete state-vector for 49-qubit circuits. Note that SA = SB = 248 here,
which already exceed the memory limit of Sunway. Thus a little space-time
tradeoff [2] is needed. The space-time tradeoff, which is also needed for
64-qubit circuits with depth 30 and 72-qubit circuits with depth 27, can be
achieved by simply enumerating the first several decomposed CZ gates [15].
B. Calculating one or a few amplitudes (Task 2)
The method in Section II-A can be used to compute the
complete state-vector for 49-qubit circuits. For circuits of
56 qubits or larger size, it is difficult to calculate all the
amplitudes due to limited time and space. However, to test the
fidelity of a real quantum circuit, one only needs to sample (i.e.
calculate a small number of) amplitudes, usually ranging from
103 to 106[6]. Our method can finish this task very efficiently
because all the amplitudes of eventually states in part A and
B are stored in memory. For example, in the cases of 7 × 8
qubits with depth 35, or 8 × 8 qubits with depth 30 is easy
to calculate a large amount of (e.g. ≥ 232) amplitudes (in less
than 1 hour). Figure 8 shows the partition scheme for 56-qubit
circuits with depth 35. The schemes for 64-qubit and 72-qubit
circuit are similar.
When focusing on a 7 × 7-qubit circuit, we will introduce
a special and straight method to calculate an amplitude of the
final state. The target is to calculate
αx = 〈x|UcircuitH⊗49|00...0〉
for 0 ≤ x ≤ 249 − 1. Since we could calculate the complete
state-vector for a 7× 7-qubit circuit of depth 27, we can also
sample one amplitude for a circuit of depth 55. Let Ucircuit =
U2U1 in which U1 has 27 layers and U2 has 28 layers, |ψ〉 =
U1H⊗49|00...0〉 and |ϕ〉 = U†2 |x〉, we have:
αx = 〈ϕ|ψ〉
Thus, we calculate |ϕ〉 and |ψ〉 simultaneously in memory,
during the calculation we computing the inner product of
every two corresponding blocks of 228 amplitudes of |ϕ〉 and
|ψ〉. Sum all 221 inner products and we get αx. Because the
least space consumption of simulating a circuit of depth 27
is O(235), this method can be parallelized to calculate more
amplitudes.
Fig. 9. Partition scheme for calculating one amplitude for 49-qubit circuit of
depth 55.
C. Optimization for reducing communication amount
The method presented above concerns mainly the memory
space limitation of Sunway. But if the network communication
amount is too large in implementing this method, it will
be impractical since the network bandwidth of Sunway is
limited. In this subsection we describe a method to reduce the
communications of a key step in our simulation. We will first
explain why the communication is needed and then describe
how to optimize it.
Recall that equation (3) is the final step to compute the
complete state-vector, and this step can be divided into 2n/SC
subtasks which could be paralellized, where n is the number
of qubits in the whole circuit and SC is the space consumption
of part C. For 49-qubit circuits of depth 27 and depth 35, (3)
can be rewritten as:
|ψout〉 =
⊕
i0,i1,...,i13,i42,i43,...,i48
|ψout,q14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉
=
⊕
i0,i1,...,i13,i42,i43,...,i48
(UC |ψq14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉) (4)
where UC denotes the unitary operation represented by part
C. Note that calculating |ψout,q14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉 in equation
(4) is essentially the same as simulating a 28-qubit circuit,
which could be finished in a single node. Now the remaining
problem is to efficiently prepare |ψq14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉 for
each set of possible values of (i0, i1, ..., i13, i42, i43, ..., i48),
where ik ∈ {0, 1}. Note that
|ψq14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉 =
∑
l1,l2,...,lt,i21,i22,...,i27
(|φout,q14,q15,...,q20i0,i1,...,i13,l1,l2,...,lt,i21,i22,...,i27〉⊗
|ξout,q21,q22,...,q42i42,i43,...,i48,l1,l2,...,lt,i21,i22,...,i27〉) (5)
where t is the number of decomposed CZ gates,
t = 7 for depth 27, and t = 14 for depth 35.
|φout,q14,q15,...,q20i0,i1,...,i6,l1,l2,...,lt,i21,i22,...,i27〉 is a 27-length state-
vector, where the indices of qubit 0-6 are i0, i1, ..., i6,
the indices of control qubits of t decomposed CZ gates
are l1, l2, ..., lt, and the indices of control qubits of 7
implicit decomposed CZ gate are i22, ..., i27. Similarly,
|ξout,q21,q22,...,q41i42,i43,...,i48,l1,l2,...,lt,i21,i22,...,i27〉 is a 221-length state-vector.
Because of implicit decomposition, for each value set of
i21, i22, ..., i27, |ξout,q21,q22,...,q42i42,i43,...,i48,l1,l2,...,lt,i21,i22,...,i27〉 has only
214 non-zero amplitudes. Thus, (5) can be rewritten as:
|ψq14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉 =
∑
l1,l2,...,lt
(
⊕
i21,i22,...,i27
|φout,q14,q15,...,q20i0,i1,...,i6,l1,l2,...,lt,i21,i22,...,i27〉⊗
|ξout,q28,q29,...,q42i42,...,i48,l1,...,lt,q21=i21,...,q27=i27〉) (6)
where |φout,q14,q15,...,q20i0,i1,...,i6,l1,l2,...,lt,i21,i22,...,i27〉(|φ〉 for short) is still of
27-length but |ξout,q28,q29,...,q42i42,...,i48,l1,...,lt,q21=i21,...,q27=i27〉(|ξ〉 for short)
is of 214-length.
Now we consider how to realize equation (6). The data
from part A have 2t+7 complex numbers (amplitudes), and
the data from part B have 2t+14 complex numbers. Thus, for
the case of depth 35, the data from part B have 214+21 = 235
complex numbers. Directly calculating (6) needs 27 nodes to
communicate (e.g. an MPI Gather is feasible). This is very
inefficient, because there are 249−28 = 2, 097, 152 entities of
(6) to calculate, and each entity needs an MPI Gather in 128
nodes. Assume that we have 215 = 32, 768 free nodes for
this work. One round can proceed 215−7 = 256 MPI Gathers
and calculate the same amount of entities. Then we need
2,097,152
256 = 8192 rounds of MPI Gather to finish all the jobs.
However, we can slightly change the form of (6) to:
|ψq14,q15,...,q41i0,i1,...,i13,i42,i43,...,i48〉 =
⊕
i21,i22,...,i27
(
∑
l1,l2,...,lt
|φout,q14,q15,...,q20i0,i1,...,i6,l1,l2,...,lt,i21,i22,...,i27〉⊗
|ξout,q28,q29,...,q42i42,...,i48,l1,...,lt,q21=i21,...,q27=i27〉) (7)
Note that for each element in the bracket, the data from part A
and B have 221 and 228 complex numbers, respectively. This
can be finished in a single node and the length of result is
221. We further append |φ〉 with qubit 7-13 so that |φ〉 also
becomes a 214-length vector:
|φ〉 = |φout,q7,q9,...,q20i0,i1,...,i6,l1,l2,...,lt,i21,i22,...,i27〉
Now |φ〉 ⊗ |ξ〉 is of 228-length and can still be calculated
in a single node. Actually, it can be regarded as a 28-qubit
state:|ψq7,...,q20,q28,...,q42〉. Our target is |ψq14,...,q27,q28,...,q42〉.
Note that for every group of 128 nodes, they store
|ψq8,...,q20,q28,...,q42〉 for (i21, i22, ..., i28) = (0, 0, ..., 0) to
(1, 1, ..., 1). Thus, by performing an MPI Alltoall on these
128 nodes, we get 128 entities of (6). Assume again that we
have 215 free nodes to work for part C. Then we can get 215
entities in a single round. Furthermore, in only 221−15 = 64
rounds, the whole task can be finished.
Solving (6) for 49-qubit circuit of depth 39 is essentially the
same as the case of depth 35. Table IV shows the improvement
in network communication that the optimization brings.
Optimization Main work Overall time Speedup
without 221 MPI Gathers 1021.9 min 1
with 214 MPI Alltoall 17.8 min 57.4
TABLE IV
The comparison of network communication amounts for computing Eq. (6)
with and without optimization. Every single MPI Gather or MPI Alltoall is
within 128 nodes, so they can be executed in parallel. The overall time is
under the condition of using 32768 nodes for network communication. For
network communication without optimization the overall time is only an
estimation since we only executed 215 MPI Gathers, and multiplies by 64
the execution time, which is 15.96 minutes.
D. Single node optimizations
In this subsection, we introduce our optimizations for the
quantum circuit simulation at the single node (single core
group) scale. The optimizations for every single node is very
important for improving the performance of our method.
As agreed in Section I, one node represents a core group in
Sunway, and it has 1 master core and 64 slave cores. Each node
has 8GB DDR memory, shared by both master and slave cores.
The maximum qubit number that could be simulated on one
node is 28 if we use two doubles to represent an amplitude,
since 228 × 16B = 4GB. To obtain full power of Sunway
TaihuLight, one must distribute most of the computing tasks to
slave cores. Each slave core has a private and separate memory
unit called local data memory (LDM) (also known as scratch
pad memory) of 64kB size. To execute high-speed calculations
a slave core must fetch data from the main memory to its
own LDM and keeps it in LDM for calculation as long as
possible. This fetching-data behavior is usually called direct
memory access (DMA). To get high DMA bandwidth, it
usually requires the data fetched consecutive.
If LDMs fetch data from the main memory for every gate
performed, and put the data back to main memory when the
calculation is finished, the simulation will be inefficient due to
low flop-to-Byte ratio3. In our experiments, the average execu-
tion time is 0.32s per gate in such way, thus the performance
is bounded by DMA speed4. However, we can take advantage
of the data locality in a better way to reduce the DMA
amount. For example, if each slave core has fetched from the
main memory 16KB data in its LDM, that is, 214−4 = 210
amplitudes. For any gate performed on those qubits with their
ranks lower than 10 (0 is the lowest rank), the calculation can
be executed in this 16 KB data, i.e. αi and αi+2k are in the
same LDM. Thus we can perform a bunch of gates acting on
low-rank qubits once instead of performing the circuit gate by
gate. We call these 10 qubits with lowest ranks local qubits,
and other 18 qubits are global qubits5.
This idea is similar to the gate fusion techniques in [14],
which deals with the gates on low-rank qubits in cache.
3A 2 × 2 complex matrix multiplying a 1 × 2 complex vector needs 14
float-point operations. For a 28-qubit circuit, each non-diagonal gate requires
228−1 × 14 float-point operations and 4G DMA get and put
4 The DMA get bandwidth and put bandwidth are both less than 25GB/s
in our experiments
5this is different to [8, 11]. Their distinction of ’global’ and ’local’ is about
the multi-node and single-node level. While our ’global’ corresponds to main
memory, and ’local’ corresponds to LDM
Depth 25 30 34 38
Swaps 6 7 8 9
TABLE V
Frequency of swaps for 28-qubit universal random circuits with different
depth, in the case that the number of local qubits is 10.
LDM 0 LDM 1 LDM 7……
LDM 8 LDM 9 LDM 15……
LDM 56 LDM 57 LDM 63……
…
…
ro
w
 co
m
m
un
ica
tio
n 
bu
s 
…
…
…
…
Fig. 10. Register communication. There are 8 row communication buses
and 8 column communication buses in a core group. In this figure only
row communication buses are plotted, since in our experiments only row
communication is used. In fact, the row communication is used to achieve
the swap between qubits of rank 8-10 and qubits of rank 11-13. Register
communication has very high bandwidth, more than 200GB/s in total for 8
row communication buses. So the cost of this step is very small, and we can
treat 8 LDMs in a row as new a composite LDM. Thus the number of local
qubits turns into 14.
However, gate fusion is one-off, which can only be applied
at the start of the circuit.
To make the above procedure repeatable, we also adopt the
qubit reordering method. Qubit reordering are used in [8] and
[11] to reduce the network communications. Here, our aim is
to maintain the data locality and reduce the amount of memory
access. That is, only diagonal gates, or non-diagonal gates on
local qubits are calculated. To accomplish this, we need to
execute two types of qubit rank swaps:
• Swap the qubits of rank 0-9 and qubits of rank 11-20
• Swap the qubits of rank 14-20 and qubits of rank 21-27
With a gate scheduling preprocessing program, which also
utilizes the diagonal properties of gate T and CZ, we get the
amount of swaps for 28-qubit quantum supremacy circuit: We
achieve fast swaps of qubit rank with the help of slave cores.
Swapping the qubits of rank 0-9 and qubits of rank 10-19 is
essentially a transpose of a complex matrix. The dimension
of this matrix is 210 × 210, and 228−20 = 256 matrices in
total need to be transposed. Swapping the qubits of rank 14-
20 and qubits of rank 20-27 is similar, which is equivalent to
a transpose of a 28×28-dimensional matrix, but each element
of this matrix contains 212 amplitudes.
Register communication is a unique function in Sunway CPU,
designed for fast data transmission between LDMs. The qubit
reordering can be further optimized using this feature. Because
the slave cores in a row can send/receive messages in a
communication bus, we can treat a row of 8 separate LDMs
as a composite LDM, while the data exchange between these
8 LDMs is fulfilled by register communication. If each slave
core fetches 32kB consecutive data from the main memory to
its LDM, there will be 11 local qubits. But when considering
a composite LDM formed by a row of LDMs, the number of
local qubits turns into 14. Thus we only need to execute one
type of qubit swap:
• Swap the qubits of rank 0-13 and qubits of rank 14-27
This is simply a transpose of a 214×214-dimensional complex
matrix, which can be quickly accomplished by slave cores.
Because of the high bandwidth of register communication, the
time consumed on register communication is very small, so as
to improving the overall performance of single node case.
E. Other standard optimizations
In this subsection we briefly introduce some other standard
optimizations provided by Sunway TaihuLight, which can be
exploited for our simulation of quantum circuits.
1) Vectorization: Sunway TaihuLight provides many 256-bit
data types. In our simulation the type doublev4 is adopted for
vectorization. The data stored in LDM is a complex number
array with one double as the real part of a complex number
and another double as its imaginary part. To vectorize the
double-precise float-point calculation, we put four amplitudes
into two doublev4 registers vr, vi once. However, the real parts
of these amplitudes are not consecutive, the imaginary parts
neither. We use the instruction vshuffle to solve this problem.
2) Instruction Reordering: Another optional optimization is
instruction pipeline. The put and store operations, together
with the multiply-add operations, can form a pipeline to further
reduce the calculation time, especially when the data depen-
dency between adjacent instructions is little. This optimization
further improves the computational efficiency.
III. NUMERICAL EXPERIMENTS AND RESULTS
A. Setup of Experiments
Sunway TaihuLight is one of the most powerful supercomputer
with over 100 Pflops computing capacity [1]. To test the
limitation of our simulator on Sunway, we used 131072
nodes (32768 cpu chips), which is around 80% of computing
resource of the whole machine with nearly 1PB main memory
in total. We implemented our simulator in C++ for master core
managing programs and C for slave core computing programs.
We use MPI for inter-node communications. To facilitate the
most of computing capacity of Sunway we have used the
athread library [10].
According to previous analysis, a simulation task in our
implementation has two stages:
• stage 1: computing the results for part A and B and store
them in the memory;
• stage 2: using the the results in stage 1 to generate the
input and compute the results for part C, so as to solving
all the amplitudes;
Stage 1 can be finished in 10 minutes if there is enough space
to store the results for part A and B. Stage 2 thus is the
bottleneck of the whole task. As illustrated in section II-C,
stage 2 could be evenly divided into 64 rounds, making it
convenient to parallelize and providing good strong scalability.
Depth 18 26 34 42 50
Gates 258 375 492 609 726
Swaps 3 4 5 6 7
Time 15.4s 22.6s 29.6s 36.7s 44.3s
Speedup 7.04 6.95 6.96 6.95 6.88
TABLE VI
Performance of simulating 28-qubit circuits with different depth on a single
node. The number of global qubits is 14. Speedup is the speed-up ratio to
method of performing gate by gate without any optimization but using the
slave cores to accelerate.
Stage 2 has two computing kernels: the first one is generating
the input for part C, more precisely, computing the entities of
eq(6); the second one is simulating a 28-qubit circuit on single
nodes. We call the first kernel tensor because it calculates the
tensor product of two complex vectors and sums them. We
call the second kernel sim.
B. Performance Measurement
The performance is usually computed in two ways:
• Manually counting all double-precision arithmetic in-
structions in the assembly code;
• Using the hardware performance monitor of Sunway,
PERF, to get the amount of double-precision arithmetic
instructions retired on the CPE cluster.
Both ways provide similar results of counting the arithmetic
operations. We employ the second way (PERF) in our study.
And we obtain the sustained performance of two kernels:
Kernel tensor achieves 92.8 GFlops per core group; kernel
sim achieves 37.1 GFlops per core group. The performance of
these two kernels fits the overall performance when simulating
a 49-qubit circuit of depth 39.
Table VI shows the performance of kernel sim and speedup
under cases of 28-qubit circuits with different depth. The
number of global qubits is 14. Speedup is the speed-up
ratio to the method of performing gate by gate without any
optimization but using the slave cores to accelerate.
C. Time-to-solution
For the task of simulating a 49-qubit circuit of depth 35 and
computing the complete state-vector, it takes around 3.7 hours.
The bottleneck is (6), because solving 221 entities of (6) needs
221+7+35 = 263 times of complex number multiplication. This
step occupied around 90% of the run-time in the simulation of
35-depth circuit. For the task of simulating a 49-qubit circuit
of depth 39, it takes around 4.2 hours. The reason for causing
this drop in performance is that part C has 42 qubits in the
case of depth 39, while part C only has 28 qubits in the case of
depth 35. Thus in the case of depth 35, the calculation in part
C are all within single nodes. While calculating a block of part
C in the case of depth 39 is essentially simulating a 42-qubit
circuit of depth 15, which needs one all-to-all communication
on 214 nodes [11]. Because there are 249−42 = 128 blocks to
calculate, the amount of communication increases a lot.
The sustained performance is 4.92 PFlops for the case of
depth 35 and 4.3 PFlops for the case of depth 39. 4.3 PFlops is
around 3.44% of the peak performance of Sunway. There are
two reasons for this low efficiency: 1) we only use 131072
nodes, which is just around 80% of the whole computing
resource of Sunway; 2) for the cases of depth 35 and depth
39, the kernel tensor is the bottleneck. However, during the
simulation only half of the nodes are used for kernel tensor
due to the limited memory space of each node6. If we can
find a better method for memory allocation we might let all
131072 nodes work for kernel tensor, doubling the overall
performance roughly.
D. Improvement over previous works
To show the improvement that our method brings, we first
compare the overall performance between our work and the
IBM’s work [13], in the case of 49 qubits with depth 27.
In principle, simulating the circuit with such a depth only
needs 1024 nodes using our method. Increasing the nodes
will decrease the time-to-solution. For a fair comparison, the
performance of machine should be taken into consideration.
We choose the result using 16384 node for comparison. It is
10.24% of Sunway. And the improvement is shown in Table
VII. The main reason for such improvement is also given.
Another reason for the speedup in Table VII is our single
node optimizations, which make better use of the machine
performance. To make this claim more convincing, we com-
pare our single node optimizations with the ETH’s work
[11], in which their single node case is highly optimized too.
Again, to make comparison fair, we should consider a rather
similar benchmark. Because the bottleneck of quantum circuit
simulation in single node case is memory access, we consider
the memory bandwidth of one node in either machine. See
Table VIII.
The comparison with the ETH’s work is not absolutely fair,
but at least demonstrates that our single node optimizations are
also very efficient to deal with the large amount of memory
access even in nodes without high memory bandwidth.
E. Scalability
Figure 11 shows the strong scaling behavior for circuits of
different depth. As the stage 1 of computing results for part
A and B usually takes a few minutes, this causes the drop
of parallel efficiency especially when each node executes few
rounds of stage 2. Note that the time consumption of stage 1
for the case of depth 27 is much less than that for the case of
depth 35 and depth 39, so the parallel efficiency for the case
of depth 27 is slightly better the other two cases. Moreover,
simulating a circuit of depth 35 or depth 39 requires at least
65536 nodes.
IV. CONCLUSION AND FUTURE WORKS
This paper describes our method and implementation of
quantum circuit simulator on Sunway TaihuLight. The results
indicate that for current universal random circuits, 49 qubits
with depth 39 is reachable. To find a proper bound of quantum
6In our implementation at current stage, 1/4 of the nodes need to store the
results for part A, another 1/4 of the nodes need to store the entities of eq(6),
they can not participate in the computation of kernel tensor.
1024 2048 4096 8192 16384 32768 65536 131072
1
2
4
8
16
32
number of process
sp
ee
du
p
 
 
ideal
depth 27: 99.7%
depth 35: 95.6%
depth 39: 94.7%
Fig. 11. Strong scaling behaviour for the cases of depth 27, 35 and 39. The
parallel efficiency under these three cases is also illustrated in the figure.
−35 −30 −25 −20 −15 −10 −5 0 5
10−20
10−15
10−10
10−5
100
Log(Nq)
Pr
−35 −30 −25 −20 −15 −10 −5 0 5
10−20
10−15
10−10
10−5
100
Log(Nq)
Pr
Fig. 12. Histograms of log-transformed outcome probabilities for 49-qubit
circuits, compared to theoretical Porter-Thomas distribution [6]. The left is
the result of simulating a circuit of depth 35. The right is the result of circuit
of depth 39. Red lines mean the theoretical Porter-Thomas distribution, and
blue lines respresent the distribution of our experimental results. Both results
fit the theoretical distribution well.
supremacy in terms of universal random circuits, one might
1) increase the depth or qubits of the circuits; or 2) modify
the structure of current universal random circuits. Whatever,
classical computers have their limits on simulating quantum
circuits. We believe there will be one day that quantum com-
puter can solve certain problems which classical computers
cannot. Before that day comes, simulating quantum circuits
on classical computers is crucial to understand the power and
limit of quantum computers. Even after that day, a simulator of
quantum circuits on a classical computer will still be helpful
for design, synthesis, testing and verification of quantum
circuits.
Follow-up work of this paper includes further optimizations
of our simulator and adding some new functions to it, e.g. (1)
quantum circuit testing and verification; (2) simulation of real
circuits with quantum noise, and (3) simulation, debugging
and verification of more sophisticated quantum algorithms and
quantum programs (with control flows) [19]. Another line of
research is to develop more efficient simulation method using
new mathematical and/or algorithmic tools like tensor network
[3] or QuIDD [4]. We hope that our simulator can be extended
to serve as a useful tool in the design, testing and validation
of future quantum computer hardware and software.
ACKNOWLEDGEMENT
We are very grateful to Gan Lin, Yu Haining, Zhang Wei,
Shi Shupeng, Meng Hongsong, Yu Hongkun, Zhao Wenlai and
the whole team at the National Super Computing Center in
Wuxi for their kind helps. Special thanks go to Liu zhao, who
Work Qubits Depth Rmax (TFlop/s) Time-to-solution Speedup
IBM 49 27 17,173.2 ≥ 24 hrs 1
Sunway 49 27 9,524.7 (16384 nodes) 1.49 hrs ≥ 29
TABLE VII
The performance comparison between our work and IBM simulator [13]. RMAX means the maximal sustained performance [1]. 10.24% of Sunway has the
Rmax of 9, 524.7 TFlop/s. The speedup is calculated by: speedup = IBM time×IBM Rmax
Sunway time×Sunway Rmax . The main reason for such speedup is that in [13],
10 CZ gates are decomposed in the case of 49 qubits with depth 27. While in our method there are only 7 decomposed CZ gates benefiting from implicit
decomposition and dynamic programming techniques.
Platform Local qubits Depth Gates Time-to-solution Time per gate Memory access per gate Single node bandwidth Speedup
Cori II 30 25 369 9.58 s 0.026 s 16 GB 460 GB/s 1
Sunway 28 26 375 22.6 s 0.060 s 4 GB 27 GB/s 1.84
TABLE VIII
Analysis of two highly optimized simulator in the single node cases. We compare their average performance per unit memory bandwidth (say 1 GB/s). We
reemphasize that in this article, one single node only means one core group of Sunway. While a SW26010 cpu chip has 4 core groups, its memory
bandwidth quadruples, which is more than 100GB/s. Counting in the memory access and average time per gate, we get an approximate speedup:
sppedup =
(ETH time per gate×ETH MA per gate)/Cori single node bandwidth
(Sunway time per gate×Sunway MA per gate)/Sunway single node bandwidth = 1.84
has given us a lot of useful suggestions and assistance. This
work is partially supported by the National Natural Science
Foundation of China and the National Supercomputing Center
in Wuxi.
REFERENCES
[1] “top500 list june 2017”.
https://www.top500.org/lists/2017/06/.
[2] Scott Aaronson and Lijie Chen. Complexity-theoretic
foundations of quantum supremacy experiments. arXiv
preprint arXiv:1612.05903, 2016.
[3] Markov I L, Shi Y. Simulating quantum computation
by contracting tensor networks[J]. SIAM Journal on
Computing, 2008, 38(3): 963-981.
[4] George F Viamontes, Igor L Markov, John P Hayes
Quantum circuit simulation, 2009
[5] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, and H. Neven.
Simulation of low-depth quantum circuits as complex
undirected graphical models. ArXiv e-prints, December
2017.
[6] Sergio Boixo, Sergei V Isakov, V N Smelyanskiy, Ryan
Babbush, Nan Ding, Zhang Jiang, John M Martinis, and
Hartmut Neven. Characterizing quantum supremacy in
near-term devices. arXiv: Quantum Physics, 2016.
[7] Davide Castelvecchi et al. Quantum cloud goes commer-
cial, 2017.
[8] K De Raedt, K Michielsen, De Hans Raedt, B Trieu,
G Arnold, M Richter, Th Lippert, Hiroshi Watanabe, and
Nobuyasu Ito. Massively parallel quantum computer sim-
ulator. Computer Physics Communications, 176(2):121–
136, 2007.
[9] Richard P Feynman. Simulating physics with computers.
International Journal of Theoretical Physics, 21:467–
488, 1982.
[10] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang,
Zhenya Song, Xiaomeng Huang, Chao Yang, Wei
Xue, Fangfang Liu, Fangli Qiao, Wei Zhao, Xunqiang
Yin, Chaofeng Hou, Chenglong Zhang, Wei Ge, Jian
Zhang, Yangang Wang, Chunbo Zhou, and Guangwen
Yang. The sunway taihulight supercomputer: system
and applications. Science China Information Sciences,
59(7):072001, Jun 2016.
[11] Thomas Haner and Damian S Steiger. 0.5 petabyte
simulation of a 45-qubit quantum circuit. arXiv preprint
arXiv:1704.01127, 2017.
[12] Michael A Nielsen and Isaac Chuang. Quantum compu-
tation and quantum information, 2000.
[13] Edwin Pednault, John Gunnels, Giacomo Nannicini,
Lior Horesh, Thomas Magerlein, Edgar Solomonik, and
Robert Wisnieff. Breaking the 49-qubit barrier in the
simulation of quantum circuits. 10 2017.
[14] Mikhail Smelyanskiy, Nicolas Sawaya, and Alan As-
puruguzik. qhipster: The quantum high performance
software testing environment. arXiv: Quantum Physics,
2016.
[15] Chen Z, Zhou Q, Xue C, et al. 64-qubit quantum circuit
simulation[J]. Chinese Science Bulletin, 2018.
[16] Chao Song, Kai Xu, Wuxin Liu, Chui-ping Yang, Shi-
Biao Zheng, Hui Deng, Qiwei Xie, Keqiang Huang, Qiu-
jiang Guo, Libo Zhang, Pengfei Zhang, Da Xu, Dongning
Zheng, Xiaobo Zhu, H. Wang, Y.-A. Chen, C.-Y. Lu,
Siyuan Han, and Jian-Wei Pan. 10-qubit entanglement
and parallel logic operations with a superconducting
circuit. Phys. Rev. Lett., 119:180511, Nov 2017.
[17] Krysta M Svore, A Geller, M Troyer, J Azariah,
C Granade, B Heim, V Kliuchnikov, M Mykhailova,
A Paz, and Dave and Roetteler, Martin Wecker. Q#:
Enabling scalable quantum computing and development
with a high-level dsl. Proceedings of the Real World
Domain Specific Languages Workshop, 2018.
[18] Dave Wecker and Krysta M Svore. Liqui|〉: A software
design architecture and domain-specific language for
quantum computing. arXiv: Quantum Physics, 2014.
[19] Mingsheng Ying. Foundations of quantum programming,
2016.
