Benchmarking 50-Photon Gaussian Boson Sampling on the Sunway TaihuLight by Li, Yuxuan et al.
1Benchmarking 50-Photon Gaussian Boson
Sampling on the Sunway TaihuLight
Yuxuan Li, Mingcheng Chen, Yaojian Chen, Haitian Lu,
Lin Gan, Chaoyang Lu, Jianwei Pan, Haohuan Fu, and Guangwen Yang
Abstract—Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present
work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold detection based on the Sunway TaihuLight
supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the
selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and
instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by
an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a
highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a
100× 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.
Index Terms—Boson sampling simulation, parallel computing, Sunway TaihuLight supercomputer
F
1 INTRODUCTION
THE extended Church-Turing (ECT) thesis states that aclassical computer can efficiently simulate any physical
process with only polynomial overheads [1]. In the 1980s,
Feynman observed that many-body quantum problems can-
not be efficiently solved by classical computers due to
the exponential size of quantum state Hilbert space [2].
This observation inspired Feynman to envision a quantum
computer to solve quantum problems. Efficient quantum
algorithms were proposed for classically hard problems,
such as integer factoring [3], [4]. Today, achieving quantum
supremacy based on quantum computers is anticipated to
be an important milestone in the post-Moore era.
Recently, it was found that quantum computers can
simulate quantum sampling problems in polynomial time,
while classical computers need exponential time unless
the polynomial hierarchy (PH) collapses [5]. Therefore,
the quantum sampling problem has become a feasible
way to demonstrate the quantum supremacy on noisy
intermediate-scale quantum (NISQ) devices [6]. According
to numerical estimation, quantum sampling with 50 ∼ 100
quantum particles is beyond the computational capabilities
of the state-of-the-art supercomputers towards achieving
quantum supremacy [7].
Candidates for quantum sampling problems include the
instantaneous quantum polynomial time circuit [8], random
circuit sampling (RCS) [9], [10], boson sampling [5], and
Gaussian boson sampling (GBS) [11], [12]. For the physical
• Y. Li, Y. Chen, L. Gan, H. Fu, and G. Yang are from Department of
Computer Science and Technology, Tsinghua University, Beijing 100084,
China.
• M. Chen, C. Lu, and J. Pan are from 1Hefei National Laboratory for
Physical Sciences at Microscale and Department of Modern Physics,
University of Science and Technology of China, Hefei 230026, People’s
Republic of China; and 2Shanghai Branch, CAS Center for Excellence in
Quantum Information and Quantum Physics, University of Science and
Technology of China, Shanghai 201315, People’s Republic of China.
• H. Lu is from the National Supercomputing Center in Wuxi, China.
• *Corresponding author, Gan Lin (Email address: lingan@tsinghua.edu.cn)
implementation, instantaneous quantum polynomial time
circuit and RCS are based on quantum bits, while boson
sampling and GBS are based on bosons, such as single pho-
tons. Since the proposal of GBS greatly simplifies its quan-
tum implementation, quantum sampling based on bosons
is expected to be an important milestone in demonstrating
quantum supremacy [1], [13].
This work establishes the quantum supremacy frontier
for the boson-based quantum sampling problem. To achieve
the best performance and provide a competitive scenario
for quantum computers, the selected algorithm is fully
optimized based on the Sunway TaihuLight supercomputer,
which is one of the most powerful classical computers in the
world. Our major contributions are as follows:
• an effective parallel scheme is proposed to reduce
the requirements of cache-level storage and obtain
an almost ideal load balance of the selected boson
sampling algorithm.
• an instruction-based and multiple-precision optimiz-
ing scheme that enables customizable data precision
modes is designed to guarantee sufficient accuracy
of the simulation; an adaptive framework based on
upper- and lower-bound estimation is further ap-
plied to determine the best precision of the config-
uration.
• in terms of instruction-level optimization, a DAG-
based heuristic search algorithm is adopted to
achieve optimal instruction scheduling for all ker-
nels.
2 RELATED WORKS
In 2019, Google first claimed the quantum supremacy by
using 53 superconducting quantum bits in an RCS experi-
ment [14]. The quantum device took 200 seconds to sample
one instance of a quantum circuit a million times, while
classical benchmarking was expected to need 10,000 years
ar
X
iv
:2
00
9.
01
17
7v
1 
 [c
s.D
C]
  2
 Se
p 2
02
0
2on the Summit supercomputer. To confirm and establish this
quantum supremacy, there have been many developments
in the classical benchmarking algorithm [15], [16], [17], [18],
[19], [20] before and after the experiments. The latest result
found by IBM indicates that the calculation can be run on
the same supercomputer in less than two and a half days—
rather than 10,000 years [21], [22]. Furthermore, by leverag-
ing secondary storage of supercomputer, the sampling prob-
abilities of all the 253 states can be parallelly computed. For
large number of samples, the average classical calculation
time for a single sample will be 253 times faster. Therefore,
a more reliable state-of-the-art benchmarking on classical
computers is crucial to prove the quantum supremacy.
Today, many studies have been based on the RCS prob-
lem. In general, the classical simulators of RCS problems are
categorized into three classes. The first class and the second
class are the direct evolutions of the quantum state [15],
[23], [24], [25], [26] and the perturbation of stabilizer circuits
[27], [28], [29], respectively. The tensor network contraction
class [18], [19], [30] is the most suitable method for current
flop-oriented architectures such as on the Fugaku [31] and
Summit [32] supercomputers. In addition, several hybrid
algorithms [16], [21], [33] show their great performance in
achieving a simulation or benchmark with ∼ 50 quantum
bits. There are also some efforts [24], [34] that attempt
to build up the benchmarking of the quantum supremacy
based on the RCS problem on the Sunway TaihuLight su-
percomputer.
The boson sampling problem, which was introduced by
Aaronson and Arkhipov [35], is historically the first protocol
to conclusively demonstrate the quantum supremacy. Sev-
eral algorithms have been designed and experiments have
been conducted referring to such approaches [36], [37], [38],
[39], [40]. However, experimentally, it is actually difficult
to scale the boson sampling to high photon numbers ow-
ing to its intrinsic photon cost. GBS [41] and GBS with
threshold detection [12] were recently proposed to obtain
higher-order photon numbers than those of boson sampling.
There are several classical benchmarks for GBS. Based on
the Titan supercomputer, Brajesh Gupt et al. [42] proposed
a benchmark for GBS with threshold detection. However,
due to the limitations of the method, it can only handle
approximately 22 clicks (or photons) when using the entire
Titan supercomputer.
Different from previous works, we establish the first
classical benchmarking for GBS with threshold detection
based on calculation of the Torontonian function, which
has not been used in previous works. Our design is able
to calculate a single Torontonian for 50-photon GBS in less
than one day.
3 BACKGROUND
3.1 Selected Algorithm: Boson Sampling
In the standard boson sampling experiment, N indistin-
guishable single photons are sent to anM -port linear optical
network, and the output scattered photons are detected by
N single-photon detectors. The probabilities of these N -
photon events are related to the permanent function of an
N ×N sampling matrix, which is proven to be #P-hard for
classical computers. However, a large-scale experiment suf-
fers from the formidable challenge of preparing N perfect
single photons [36], [38], [39], [43], [44], [45], [46], [47].
Linear network
Linear network
Single-photon
detector
Single-photon
detector 
"0" or "1"
"0" or ">0"
Single-photon
source
Non-classical
light
(a)
(b)
Fig. 1: (a) Standard boson sampling; (b) GBS with threshold
detection.
A practical improvement was proposed to change the
input single photons to nonclassical Gaussian light and use
single-photon detectors without photon-number resolution
to register N -click events, which is called GBS with thresh-
old detection [11], [12], [48], [49]. In this case, the probability
of N -click events is the Torontonian function of a 2N × 2N
sampling matrix, which is expected to still be exponentially
hard for classical computers.
In a GBS experiment, before photon detection, the output
Gaussian state is described by a covariance matrix
∑
. The
sampling matrix A is then defined as A = I −∑−1, and
the probability p(S) of an N -click event S = s1, s2, ..., sM is
determined by
p(S) = 〈S|ρ∑|S〉 = Tor(A{S})√
det(
∑
)
(1)
, where ρ∑ is the output quantum state of a covariance
matrix
∑
, S is the threshold click pattern with sk ∈ 0, 1
and
∑M
k=1 = N , and A{S}is the 2N × 2N submatrix of
A by selecting the k-th and (M + k)-th row and column
when sk = 1. The matrix function Tor(·) is the Torontonian
function, which is defined as
Tor(A) =
∑
Z∈PN
(−1)N−|Z| 1√|det(I −AZ)| (2)
, where PN is the power set of 1, 2, ..., N , and hence, there
are 2N determinant terms in the summation [12]. The com-
putational complexity of the Torontonian function is O(2N ),
which is the same as the permanent function in standard
boson sampling with N photons.
In this work, our major task is to calculate the Toronto-
nian function—which has exponential complexity.
3.2 Sunway Architecture
To achieve the best simulation performance, this work se-
lects the Sunway TaihuLight supercomputer as the clas-
sical platform. The system consists of 40,960 many-core
processors called SW26010s and is able to provide a peak
performance of 125 PFlops and a sustainable performance
of 93 PFlops. Each SW26010 processor consists of four core
groups (CGs), which are shown in Figure 2. Each CG has a
memory controller (MC), a management processing element
(MPE) and 64 computing processing elements (CPEs).
3CPE
CPE
CPE
CPE
CPE
CPE
CPE
CPE
CPE
CPE
CPECPE
CPE
CPE
CPE
CPE
...
...
...
...
...
...
...
...8x8
M
P
E
M
C
L1
L2
C G
Main
Memory
SPM
Fig. 2: Block diagram of a CG.
TABLE 1: Main instructions in this work
Instruction Description Latency
umulqa 128-bit unsigned 6
complete multiplication
uaddo carry 256-bit unsigned addition 2
usubo carry 256-bit unsigned subtraction 2
sllow 256-bit logical left shifting 2
srlow 256-bit logical right shifting 2
vshff shuffle based on 1
two vectors and a mask
vsellt vector ”less than” 1
conditional selection
As shown in Figure 2, the MPE is a fully functional 64-
bit RISC general-purpose core that contains a 32-KB L1 data
cache and 256-KB L2 instruction/data cache. The CPE is
a simplified 64-bit RISC computing-oriented core and has
a private 16-KB L1 instruction cache and 64-KB scratch-
pad memory (SPM). Direct memory access (DMA) is the
major method for the CPEs to access data from global mem-
ory. Furthermore, a thread-level communication is enabled
based on register communication. Thus, inside one CG,
the CPE cores within the same row or column are able to
communicate with each other directly.
A distinctive feature of the Sunway architecture is its
256-bit large integer instruction sets, as shown in Table
1. The first five instructions (multiplication, addition, sub-
traction and shifting) belong to the 256-bit large integer
instruction sets. However, the multiplication instruction is
not a straightforward 256-bit version; instead, it performs an
unsigned multiplication of two 128-bit unsigned numbers
and obtains a full 256-bit result.
shuffle(a,b,{0,2,0,1})  
a0 a1 a2 a3
b0 b1 b2 b3
b1 b0 a2 a0
(a)
1 -1 -1 1
b0 b1 b2 b3
c0 c1 c2 c3
b1c0 b2 c3
vsellt(a,b,c)  
(b)
Fig. 3: (a) Example of vshff instruction; (b) example of vsellt
instruction
The rest of the instructions in the table are regular
vector instructions and are useful in optimizing the boson
sampling process. The parameters of instruction vshff is a,
b and mask are shown in Figure 3a. a and b are two 256-
bit registers that contain four 64-bit numbers. mask provides
information on how to construct a new register from a (the
first two digits of mask) and b (the last two digits of mask).
Figure 3a shows an example: based on the four digits of
mask, 0 and 2 of A (corresponding value a0 and a2), and
position 0 and 1 of B (corresponding value b0 and b1) are
selected to construct a new register.
The conditional selection instruction vsellt takes three pa-
rameters a, b, and c, which are all 256-bit registers composed
of four 64-bit numbers. a is the conditional register that
chooses between b and c. As shown in Figure 3b, if one of the
64-bit numbers in a is less than 0, then the corresponding 64-
bit number of the result comes from the corresponding bits
of b and vice versa.
4 PARALLEL SCHEME DESIGN
This section introduces the parallel scheme of solving Equa-
tion 2, including partition and storage strategies for load
balancing and the LDM capacity issues.
4.1 Partition Strategy
For Equation 2, the best dimension to partition is the
enumeration of Z due to its embarrassing parallelism. For
convenience, we map each Z to a number mask ”maskz”
ranging from 1 to 2N − 1. If i ∈ Z , the i-th bit in the binary
representation of maskz is ”1” and vice versa.
A naive partition strategy can be directly obtained by
evenly dividing the 1 ∼ 2N − 1 mask into every CPE
from different CGs. However, while two continuous num-
bers must have different numbers of ”one” in the binary
representation, the matrix size and computing load of each
CPE are different. Therefore, the simple and naive strategy
brings a severe thread-level load imbalance.
Algorithm 1 Calculate Torontonian(A)
Require: Matrix A
Ensure: A is Hermitian positive definite
1: function Torontonian(A)
2: result← 0
3: for i|Z| = 1→ N do
4: maskz ← GET KTH MASK(N, i|Z|,
( N
i|Z|
)
∗rank
nproc
)
5: maskEnd← GET KTH MASK(N, i|Z|,
( N
i|Z|
)
∗(rank+1)
nproc
)
6: whilemaskz 6= maskEnd do
7: AZ ← GEN AZ(maskz)
8: det← GET DETERMINANT(I −AZ)
9: result← result+ (−1)N−i|Z| 1√|det|
10: maskz ← GET NEXT MASK(maskz)
11: end while
12: end for
13: return result
14: end function
To avoid this situation, we design a new partition strat-
egy with high load balancing. We first divide the workloads
of matrices into N parts according to the matrix size |Z|.
In each part, we divide all masks evenly and continuously
into every process. Algorithm 1 describes the Torontonian
function with the new partition strategy. Line 3 indicates
the enumerating of N parts; in Line 4 and Line 5, the
masks assigned to a process are determined; in Line 7, the
calculating matrixAZ is generated by the current mask; Line
8 is the determinant calculation that is the most expensive
4part; Line 9 is the update of the result; and in Line 10, the
next mask is derived from the current mask, and then, the
next loop begins. In a certain part, each CPE processes a
matrix of the same size, obtaining the best load balance.
Algorithm 2 Get Matrix Mask
1: function GET KTH MASK(N, |Z|, k)
2: l← −1
3: maskz ← 0
4: for i = 0→ |Z| − 1 do
5: for j = l + 1→ N − |Z|+ i+ 1 do
6: cnt← C(N-j-1,—Z—-i-1)
7: if cnt > k then
8: l← j
9: maskz ← maskz or 2N−j−1
10: end if
11: k ← k − cnt
12: end for
13: end for
14: returnmaskz
15: end function
16: function GET NEXT MASK(mask)
17: cto← COUNT TAIL ONE(mask)
18: x← mask − (2cto − 1)
19: ctz ← COUNT TAIL ZERO(x)
20: masknext ← x− 2ctz−cto−1
21: returnmasknext
22: end function
23:
In the naive strategy, we only need to calculate maskz =
2N ∗rank/nproc to obtain the first mask of a certain process
(or obtain the k-th mask) and to calculatemaskz+1 to obtain
the next mask. nproc denotes the total number of processes,
and rank denotes the rank of a certain process.
However, the two functions for obtaining the mask,
the pseudocodes of which are shown in Algorithm 2, are
rather complicated in the new strategy. The get kth mask
function is based on a bit-by-bit algorithm that determines
the position of every ”1” from high to low in binary repre-
sentation. In each round of the outer loop, the position of
a ”1” is determined. In the inner loop, we try each possible
position one by one according to the comparison results of
the combinatorial enumeration ”cnt” and the current ”k”.
If cnt is greater than ”k”, then we set the position to ”1”
and begin the next round of the outer loop; otherwise, we
subtract cnt from ”k”. The main task of the get next mask
function is to move the consecutive ”1”s at the tail of the
binary representation to a proper position. For instance,
consider that the current mask is ”010011”. To obtain the
next mask ”001110”, we need to move the last two ”1”s to
the fourth position (beginning with 1) and move the first ”1”
to the next position. Therefore, a bit operation algorithm is
designed. cto and ctz represent the number of consecutive
ones and zeros at the tail, respectively. cto is used to count
the consecutive ”1”s at the tail, while ctz is used to indicate
the proper position.
Such a partition scheme is also applicable to thread-level
parallelisms with trivial modifications.
4.2 Storage Strategy
Neither the input nor the calculating matrix can be fit to
the 64 KB LDM space if the scale N is too large. For
instance, when N is 45 based on 256-bit multiple precision,
the storage requirement of a matrix is (2N) × (2N) ×
sizeof(complex) = 90 ∗ 90 ∗ 64 = 506.25 KB, and far exceeds
the LDM capacity. In addition, when reserving some LDM
space to store the input matrix to avoid extra DMA copies,
the insufficiency of capacity can be even worse. Therefore,
we propose a storage strategy to lower the LDM require-
ment of both the input matrix A and calculating matrix AZ .
For A, we use lossy compression and a matrix symmet-
ric. As scientific measurement data, there are only three to
five significant digits in each element of matrixA. Therefore,
a 16-bit number (with an error of±1.525×10−5) is sufficient
to describe the measurement data. Additionally, in our
implementation, 32-bit lossy compression is supported for
higher-precision requirements.
Another reason that brings storage improvements is
the usage of a matrix symmetric. Owing to the physical
properties of the matrix A, if we represent matrix A in the
form of a block matrix with four blocks as
A =
[
A00 A01
A10 A11
]
(3)
then the following three symmetries are satisfied:
1) A is Hermitian
2) A01 and A10 is symmetric
3) A00 = A11
Now, only half of A00 and A01 need to be stored, and thus,
only approximately 1/4 of the LDM space is required.
CPE 0 CPE 1
CPE 0 CPE 1
Register Comm.
Register Comm.
Next round
Fig. 4: Gaussian Elimination based on register communi-
cation. The upper panel and lower panel represent two
adjacent rounds of reduction. Each block indicates a row of
matrix. Since the matrix is an upper triangular matrix, the
lower block has fewer elements than the upper block. The
block with a red box indicates the row where the pivot is
located. The green arrow indicates the direction of register
communication. The green block indicates the buffer for
receiving the row with pivot from the other CPE.
For calculating the matrix AZ , we use the Hermitian
symmetric and distribute the matrix into several CPEs to
save the LDM space. Since A is Hermitian positive definite,
it is not difficult to prove that I − AZ is also Hermitian
positive definite, and in each round of row reduction, the
matrix is always Hermitian. Thus, we can save half the space
of the matrix when performing calculations, i.e., the upper
triangular part of the matrix, which saves half of the LDM
space.
It is still not enough to put the whole calculating matrix
into LDM. The matrix must be scattered into different CPEs,
which brings an extra cost when a CPE needs data from
5other CPEs. Fortunately, the register communication mech-
anism of the Sunway architecture provides an effective and
efficient way to exchange data among different CPEs.
Figure 4 illustrates our distribution strategy and the us-
age of register communication when the matrix is scattered
to two CPEs. Each row of the matrix is alternately assigned
to two CPEs. In the upper panel, the row with pivot is
located at CPE 0. In this round, CPE 0 sends the row with
pivot to CPE 1 via register communication. Then, each CPE
multiplies this row by a proper scalar and adds the scaled
row to other rows to finish reduction. In the next round
of row reduction, as shown in the lower panel, the row
with pivot comes to CPE 1, and thus, the data need to be
transferred from CPE 1 to CPE 0. The process of reduction
is the same as in the previous round.
Owing to the two strategies for the calculating matrix,
the LDM is now enough to store all the required data.
When N is 45 and the matrix is scattered to eight CPEs,
each CPE is occupied 506.25 KB/2/8 = 31.64 KB LDM
space, where 506.25 KB is mentioned in the beginning of
this subsection, the division factor 2 is due to Hermitian
symmetry, and 8 is due to the utilization of eight CPEs.
In addition, when using 32-bit compression, the size of
the input matrix is (2N) ∗ (2N) ∗ 2 ∗ 4/4 = 16.17 KB.
Therefore, the total LDM requirement is approximately
31.64 KB + 16.17 KB = 47.81 KB, which is less than
the capacity limit of 64 KB.
5 ADAPTIVE MULTIPLE-PRECISION DESIGN
5.1 Motivation
For real-world data, the result of the Torontonian function
is a very small decimal, which results in a high-precision
requirement. For instance, when N is 45, the result is
1.44 × 10−25. According to Equation 2, the result Tor(A)
is accumulated from the summation of massive reciprocal
square roots of the determinant. We can prove that the value
of each reciprocal square root is greater than 1. Thus, at least
25 decimal digits (roughly equivalent to 83 binary bits) is
required to achieve adequate accuracy in the determinant
calculation. According to the IEEE 754 standard [50], the
double-precision type, which has only 53 binary bits of sig-
nificant precision, is not competent to represent these num-
bers. Therefore, a customized multiple-precision type must
be designed for the Torontonian function. To design such a
precision type, we first consider the upper and lower bound
analysis of the absolute value of intermediate results. The
actual upper bound depends on the summation order of the
Torontonian function. If all positive numbers are summed
first and all negative numbers are added afterward, then the
intermediate results will be very large and proportional to
the times of summation (i.e., O(2N )). However, the result
of the Torontonian function is guaranteed to be a very
small number. Thus, the overflow of the summation can be
omitted. In other words, the part of the intermediate results
greater than 1 can be truncated.
The upper bound, which cannot be omitted, is reached
when calculating the reciprocal of pivot in the Gaussian
elimination. The value of pivot is difficult to obtain, but
fortunately, it can be proved to be smaller than its corre-
sponding determinant, so we can replace all of them with
the value of the determinant. Furthermore, it can be shown
that the determinant of I−A is less than any of its principal
minors, which means that the upper bound can be estimated
by the reciprocal of the determinant of I − A. To make
an estimation without precise computation, I − A can be
calculated with this approximation with |Z| = N :
det(I −AZ) ≈ ((1− a)− k
1− a |Z|)
|Z| · (1− a)|Z| (4)
, where k is a correction factor depending on matrix A that
has an observed range of 0.0009 ∼ 0.0035, and a is the
representative value of the diagonal element aii. In real-
world experiments, the range of a is 0.16±0.06. In addition,
the value of the determinant when N = 60 will reach 10−6,
which means that the upper bound can be estimated to be
107.
The lower bound is reached at the final result of the
Torontonian function. Equation 5 is an approximation of the
final result based on the theorem of matrix analysis and the
result of det(I −A):
Tor(A) ≈ (((1− a)− kN
1− a )
−1 − 1)N (5)
, where k and a are the same as in Equation 4. When k, a,
and N are not too large (k < 0.0035, a < 0.2, N < 60),
we have kN/(1− a) 1− a. Therefore, the approximation
is close to (1/(1− a)− 1)N , which can be guaranteed to
be smaller than the actual result. Then, our approximation
can be trusted to be a lower bound of the actual result.
With different k and a, there will be some variances in the
value of the Torontonian function. Taking the range of these
parameters into consideration, the lower bound can reach
10−61 at N = 60.
According to the bound analysis, the range of interme-
diate results falls into a range of 10−61 to 107, which can
be represented by a large fixed-point number. Therefore, we
choose a fixed-point number as our precision type, which is
much faster than a floating-point number.
5.2 Multiple-Precision Design
The large integer instruction set in the Sunway architecture
is very suitable for implementing fixed-point numbers. As
mentioned in Section 3.2, the Sunway processor supports
256-bit large integer operations, which implies that up
to 256-bit fixed-point calculations can be implemented by
hardware instructions directly or almost directly. According
to the bound analysis, 256-bit fixed-point precision is suffi-
cient for up to N = 60, which is an impossible task for a
classical computer.
However, it must be noted that there is no 256-bit large
integer multiplication in the architecture. The largest multi-
plication instruction is the 128-bit unsigned multiplication.
Constructing the 256-bit signed multiplication requires nu-
merous instructions based on the supported 128-bit multi-
plication. It is too expensive to perform the 256-bit signed
multiplication.
To decrease the cost of multiplication, we propose an
adjustable multiple-precision design. In addition to the 256-
bit precision, we mainly provide a precision mode of 128-
bit fixed-point numbers, which is much less expensive in
terms of multiplication. The 128-bit precision mode is not
6competent for all cases but is much faster than the 256-bit
precision mode when it works.
For the scaling factor, we just use a global factor for every
fixed-point number. Unifying the scaling factor is beneficial
to the performance efficiency.
5.3 Operation Design
In the determinant solver of the Torontonian function,
O(2NN3) times of multiplications are required, which is
the bottleneck of the whole calculation. Constructing signed
multiplication based on the unsigned multiplication is the
crucial challenge in multiple-precision design. A simple
method is to convert negative numbers to positive numbers
by using if statements. However, the if statements may
result in a failure of branch predictions, which is a hazard to
performance. Thus, avoiding if statements is considered the
best way to enhance performance.
Take a 128-bit multiplication between 128-bit signed
numbers a and b as an example. For convenience, we
assume that a and b are both negative. The complements
of a and b are expressed as (2128 − |a|) and (2128 − |b|). The
unsigned multiplication is represented below:
(2128−|a|)∗(2128−|b|) = 2256−(|a|+|b|)∗2128+|a|∗|b| (6)
Therefore, the 256-bit result produced by unsigned mul-
tiplication is |a| ∗ |b| − (|a| + |b|) ∗ 2128, in which the term
of 2256 is truncated. The result is not equal to the correct
one, which is |a| ∗ |b|. It worth noting that if a and b are
256-bit signed numbers, then the second term of Equation 6
is (|a| + |b|) ∗ 2256, which can be truncated. This indicates
that the 256-bit results of the signed multiplication can be
taken from the unsigned multiplication between two 256-bit
signed numbers.
However, as mentioned above, there is no 256-bit mul-
tiplication instruction in the Sunway architecture. Instead,
three times of 128-bit multiplication are required, i.e., a∗b =
a[128,256) ∗ b[0,128)+a[0,128) ∗ b[128,256)+a[0,128) ∗ b[0,128). Be-
cause a and b are essentially 128-bit numbers, a[128,256) and
b[128,256) are either 0 or −1; i.e., a[128,256) ∗b[0,128) is equal to
0 or−b[0,128). Therefore, there is no need to actually perform
multiplication for a[128,256) ∗ b[0,128) and a[0,128) ∗ b[128,256).
In contrast, vectorized conditional selection instructions are
competent.
Algorithm 3 128-bit signed multiplication
Require: 128-bit signed number a and b
1: function MUL128(a, b)
2: c← UMULQA(a, b)
3: asign ← VSHFF(0, a, 0x55) . SHUFFLE(0, a, {1, 1, 1,
1})
4: bsign ← VSHFF(0, b, 0x55) . SHUFFLE(0, b, {1, 1, 1,
1})
5: a128 ← VSELLT(bsign, a, 0)
6: b128 ← VSELLT(asign, b, 0)
7: c← (c− a128 − b128) >> scaling factor
8: return c
9: end function
Algorithm 3 describes our 128-bit signed multiplication
design. The first line describes 128-bit unsigned multiplica-
tion. The second and third lines are prepared for conditional
selection instructions. By vshff instruction, the sign bit of a
is extracted and put into high 128 bits of asign. The low 128
bits of asign is 0. In the next line, according to asign, the
vectorized conditional selection instruction vsellt is applied
to make a choice between b and 0. Last, c is calculated and
scaled in Line 7.
Therefore, in our method, a total of eight instructions
are adequate for 128-bit signed multiplication. The 256-
bit signed multiplication is much more difficult but shares
the same idea. In our implementation, 19 instructions are
sufficient.
Division, which is used for calculating the scalar applied
in the elimination, is the second expensive operation. Al-
though only O(2NN2) times of divisions are required, the
time complexity of division is much higher than that of
multiplication. In our work, we use O(2NN) times of re-
ciprocal calculations and O(2NN2) times of multiplications,
instead of using division directly. In general, the reciprocal
optimization results in a decrease in precision; however,
our multiple precision design is enough to withstand the
loss of accuracy. We choose trial division to implement the
reciprocal. The time complexity is O(b), where b represents
the bit length of the multiple-precision fixed-point number.
The last operation required is the reciprocal square root.
It has the same time complexity with division but is only
called O(2N ) times, and thus, it is evidently not a bottleneck
in the Torontonian function. Similar to division, a bitwise
method is applied to the reciprocal square root.
5.4 Adaptive Precision Strategy
The final task of our precision design is to determine the
scaling factor and bit length of the multiple-precision type
(i.e., 128 bits or 256 bits). If the scaling factor is too small
or the bit length is too short, then the results may have an
overflow or insufficient significant digits. In contrast, if the
bit length is too long, then the performance efficiency will be
affected. In this subsection, we propose a strategy to select
a proper scaling factor and bit length automatically.
The scaling factor depends on the upper bound of all
intermediate results. As explained in Subsection 5.1, the
upper bound can be represented by the reciprocal of de-
terminant of I −A. As long as we calculate the determinant
of I − A in advance, the upper bound can be obtained,
hence the scaling factor. The choice of bit length is much
more complicated and depends on several factors including
both the upper bound and the lower bound. In our adaptive
precision strategy, Equation 5 is chosen to approximate the
trusted lower bound.
In addition, bits owing to the accumulated error should
be taken into account. Since each element has N2 times of
elementary arithmetic in a determinant calculation, we as-
sume that one time of determinant calculation will produce
an error of αN2, where α is a correction factor. Meanwhile,
2N times of determinant calculations are required. Thus, we
use α2NN2 to estimate the accumulated error, where α is
taken as 0.5 by some experiments.
7Last but not least, the significant digits required (usually
for physical experiments, this is three decimal digits) should
be considered.
The final bit length can be represented as follows:
B = Blower +Bupper +Bsgn +Baccum (7)
, where B denotes the total bit length required, Blower
denotes the part of the bit length derived from the upper
bound (or scaling factor), Bupper denotes the part derived
from the upper bound, Bsgn denotes the part derived from
the significant digit requirement, and Baccum denotes the
part derived from the accumulated error. According to the
relationship between B and 128, the 128-bit precision mode
or 256-bit precision mode is chosen.
6 OPTIMAL INSTRUCTION SCHEDULING
6.1 Framework
To obtain the optimal instruction scheduling, the cor-
responding hardware features must be considered care-
fully. For each CPE from the Sunway processor, there are
two hardware pipelines with different functions. The first
pipeline mainly takes charge of floating-point or vector
instructions, and the second pipeline mainly takes charge
of memory access and branch instructions. In addition,
integer scalar instructions are processed by both pipelines.
In our multiple-precision design, most instructions are large
integer instructions, which belong to vector instructions and
are thereby taken charge of by the first pipeline. Only about
15% of the instructions belong to the second pipeline in each
kernel.
The performance (or the pipeline stall) of the first
pipeline determines the overall performance. Two hazards
affect the pipeline efficiency in the Sunway architecture. One
is the read-after-write (RAW) data hazard; the other is the
writeback stage structural hazard, which conflicts with the
writeback of general-purpose registers. Given both hazards,
it is hard to find an optimal instruction scheduling directly.
Thus, we turn to optimization methods to find the optimal
instruction scheduling.
For the instructions assigned to the second pipeline,
we manually insert them into the scheduled sequence of
instructions that belong to the first pipeline. To eliminate
computation dependence and avoid data hazards, software
pipeline technology is applied. As a result, these inserted
instructions do not increase the number of total clocks in
our case. In other words, all time cost comes from the first
pipeline.
Consequently, our instruction scheduling framework is
as follows:
Stage 1 Distinguish which pipeline to which
each instruction belongs.
Stage 2 Search an optimal scheduling from in-
structions of the first pipeline.
Stage 3 Manually insert instructions of the sec-
ond pipeline into the scheduled instruction se-
quence.
umulqa ar, br, crr
umulqa ar, bi, cri
umulqa ai, br, cir
umulqa ai, bi, cii
vsellt 𝑏𝑟!"#$, ar, $zero, drr
vsellt 𝑏𝑟!"#$, ai, $zero, dri
…
vsellt 𝑎𝑟!"#$, br, $zero, err
…
vshff ar, $zero, 0x55, 𝑎𝑟!"#$
vshff ai, $zero, 0x55, 𝑎𝑖!"#$
vshff br, $zero, 0x55, 𝑏𝑟!"#$
vshff bi, $zero, 0x55, 𝑏𝑖!"#$
subo_carry crr, drr, crr
subo_carry cri, dri, cri
…
subo_carry crr, err, crr
…
…
…
Fig. 5: Part of the instruction DAG for the 128-bit reduction
of Gaussian elimination.
6.2 Heuristic Search Based on DAG Constraints
In our instruction scheduling framework, Stage 2 is the crit-
ical stage that determines the final performance. A heuristic
search method is proposed to address this stage.
We choose the A-Star search algorithm [51] as the frame-
work. To search in heuristic order, the A-Star algorithm
selects the next state that minimizes f(n) = g(n) + h(n),
where n is the next state be selected, g(n) is the cost
of the selected state set, and h(n) is a heuristic function
that estimates the lowest cost from the current state set to
the goal. In our work, g(n) is the execution clocks of all
selected instructions, and h(n) is the number of remaining
instructions.
However, only an A-Star search is insufficient with re-
gard to the instruction scheduling with a scale of more than
30 instructions. More constraints to prune are necessary. Us-
ing a directed acyclic graph (DAG) to represent constraints
is very suitable to the instruction scheduling. In general, a
node in a DAG represents an instruction; an edge a → b
in a DAG indicates that b has data dependence on a. With
regard to the heuristic methods based on DAGs, a previous
work [52] surveyed three DAG construction algorithms and
twenty-six proposed heuristics methods and evaluated their
performance exhaustively.
Different from existing methods, a node in a DAG repre-
sents a sequence of instructions in our method. Due to the
properties of complex operations, instructions in the kernel
are symmetrical to some extent. For example, computing
of the real and imaginary parts is symmetrical. Moreover,
if the operation is fused multiply-add (FMA), then the
instructions can be divided into 4 symmetrical parts. For
each symmetrical part, the corresponding instructions are
the same instructions with different operands. We assign
the corresponding instructions from different parts into the
same DAG nodes. Then, we can set an order for these
corresponding instructions distinguished by the operands,
and all nodes share the same order. Setting such an order
does not lead to a loss of the optimal solution because (1)
swapping corresponding instructions in some DAG nodes
8TABLE 2: Clocks of first pipeline based on different instruc-
tion scheduling methods.
Method 128 Outer 128 Inner 256 Outer 256 Inner
Origin 21(13) 48(34) 64(32) 111(64)
Manual 15(13) 37(31) 39(32) 79(58)
Optimal 15(13) 33(31) 35(32) 65(58)
cannot affect the structural hazard in the writeback stage
since all corresponding instructions have the same clocks,
and (2) data hazards cannot be reduced by swapping since
the best choice is for all DAG nodes to share the same order.
Figure 5 shows the main part of the instruction DAG of
a 128-bit reduction. In the top left yellow panel, the order
within the node is determined as ”crr”, ”cri”, ”cir”, and
”cii”, which are four different combinations of the real and
imaginary parts of two complex numbers. The remaining
DAG nodes all maintain the same internal order. In addition,
in the middle orange panel, we can put eight instructions
into this node due to the symmetry between ”b, arsign”
and ”a, brsign” rather than the complex symmetry. Through
the optimization, the efficiency of the heuristic search is
increased many times, and thus, we can obtain the optimal
results.
Table 2 lists the clocks of different methods in different
computing parts. ”128 Outer” represents the clocks of the
instruction block in the outer loop, which is called O(N2)
times (while the inner loop (or the reduction of Gaussian
elimination) is called O(N3) times) when using 128-bit
precision mode. For the three different methods, ”Origin”
represents results without any scheduling; ”Manual” rep-
resents results by a redundant removal and a relatively
careful manual scheduling; and ”Optimal” represents the
optimal results achieved by heuristic search described in
this section. The number in brackets indicates the number
of instructions, and the number outside of the brackets
indicates the clocks.
We can see that the optimal results are much more effi-
cient than the results of the previous two methods, showing
our brilliant pipeline performance of approximately 90%
efficiency. This result implies that for our multiple-precision
design, we achieve 90% of the theoretical optimal floating-
point performance.
7 EVALUATION
7.1 Performance Results
This part demonstrates the performance results. Unless oth-
erwise specified, we use 39,286 nodes (157,144 processes)
of Sunway TaihuLight for experimentation, which is almost
the entire capacity of the supercomputer (a total of 40,960
nodes).
7.1.1 Time to Solution
Figure 6 shows the execution time of both precision modes
based on a log-transformed distribution. In the beginning,
the execution time is nonlinear because the workload occu-
pancy is insufficient. Then, the time is almost linear in the
end due to the high scalability.
To obtain a sample, it usually needs to calculate about
100 probabilities of the candidate samples using Markov
0 10 20 30 40 50
Number of Photons
10 5
10 3
10 1
101
103
105
Ti
m
e 
(s
)
128-bits mode
256-bits mode
Fig. 6: Execution time for obtaining a single Torontonian.
Two precision modes are tested with different number of
photons on the entire Sunway TaihuLight supercomputer.
0 10 20 30 40 50
Number of Photons
10 3
10 1
101
103
105
107
109
1011
CP
U
 X
 T
im
e 
(s
)
New GBS Algorithm by BJ Wu
GBS with TD by N. Quesada
Hafnian Computing by A. Bjorklund
GBS with TD on Titan by B. Gupt
Torontonian Computing on SW with 256 bits
Fig. 7: CPU × Time results of various GBS classical bench-
marks. Red line indicates this work when using 256-bit
precision mode. Blue, yellow, green, and pink lines indicate
the works of [53], [54], [55], and [42], respectively.
Chain Monte Carlo (MCMC) sampling method [40]. To
calculate one Torontonian probability for a 50-click photon
detecting event, the execution time in our benchmark is
73,773s (about 20 hours) for 128-bit mode and 170,891s
(about 2 days) for 256-bit mode, respectively.
Figure 7 shows a comparison with other works that
handle GBS. In the figure, we can see that except for our
work (red line) and the work of [55] (green line), most
works cannot process up to 50 photons. Compared with
work represented by the green line, our work is hundreds of
times faster. The time-to-solution results show the excellent
performance of our method.
7.1.2 FLOPS
For FLOPS counting, we count all arithmetic operations in
the Torontonian function. As a standard, solving a linear
equation needs 2/3n3 + O(n2) FLOPS with a size n ma-
trix [56]. However, we need to make adjustments to the
determinant calculation of the Torontonian function. First,
as mentioned above, the input matrix is guaranteed to be an
Hermitian positive definite complex matrix. Thus, the cost
of calculating a complex number should be considered, and
half of the operations can be saved by Chomsky decompo-
sition (or a modified Gaussian elimination). In addition, the
9matrix size n is usually small in a Torontonian function, so
the term O(n2) should not be omitted. Therefore, we count
the operations of the linear solver accurately, as shown in
Formula 8.
FLOPS =
4
3
n3 + n2 − 4
3
n (8)
Bringing this formula into the Torontonian function and
using some combinatorial math skills, we obtain the total
FLOPS:
FLOPS =
N∑
i=1
(
N
i
)
(
32
3
i3 + 4i2 − 8
3
i) (9)
Figure 8 shows the performance results of two different
precisions based on the final formula. The highest sustained
performance of 128-bit mode and 256-bit mode are 2.78
PFLOPS when N = 45 and 1.27 PFLOPS when N = 37;
the highest peak performance is 3.67 PFLOPS when N = 45
and 1.85 PFLOPS when N = 41, respectively.
The highest FLOPS did not occur at N = 50. This is
because when N becomes larger, the LDM space require-
ment becomes larger. Thus, as mentioned in Subsection 4.2,
a calculating matrix is scattered to more CPEs, which results
in additional overhead (explained in Subsection 7.1.5). This
issue is especially severe in 256-bit mode.
0 10 20 30 40 50
Number of Photons
0
1
2
3
4
5
PF
LO
PS 2.78 PFLOPS
3.67 PFLOPS
Sustained Performance
Peak Performance
(a)
0 10 20 30 40 50
Number of Photons
0.0
0.5
1.0
1.5
2.0
2.5
PF
LO
PS 1.27 PFLOPS
1.85 PFLOPS
Sustained Performance
Peak Performance
(b)
Fig. 8: Peak and Sustained Performance of (a) 128-bit preci-
sion; (b) 256-bit precision.
7.1.3 Scalability
For strong scalability, we choose the data scale N = 36
and use 4,096 to 131,072 processes. Figure 9 shows the
strong scalability results. As the figure shows, both precision
modes achieve an almost linear strong scalability.
4096 8192 16384 32768 65536 131072
Number of processes
1
2
4
8
16
32
Sp
ee
du
p
99.96%
ideal
128-bits
(a)
4096 8192 16384 32768 65536 131072
Number of processes
1
2
4
8
16
32
Sp
ee
du
p
99.90%
ideal
256-bits
(b)
Fig. 9: Strong scalability of (a) 128 bits; (b) 256 bits.
TABLE 3: Load balancing results
Level Average Maxinum Mininum
Thread level 55.6G 55.6G 55.6G
Process level 55.6G 55.6G 55.6G
7.1.4 Load Balance
This part reports the load balance performance with regard
to the partition strategy described in Subsection 4.1. Table 3
shows the result in 128-bit precision mode when N = 40.
Regardless of the average, maximum, and minimum values,
they are all 55.6G clocks. Thus, as long as the data scale is
large enough, our partition strategy can achieve an almost
completely balanced load.
7.1.5 Kernel Performance
Figure 10 shows the percentage of time cost of each kernel
when N = 50. As shown in Algorithm 1, GENAZ cor-
responds to Line 6, which is used to generate the matrix
AZ based on the current mask. INV, SYNC, REGCOM,
and ELIMI all correspond to Line 7, i.e., the function
GENAZ
INV SYNC REGCOM
ELIMI
SQRTINV
0
20
40
60
80
Pe
rc
en
ta
ge
 o
f T
im
e
128 bits
256 bits
Fig. 10: Percentage of time cost of each kernel
10
TABLE 4: FLOPS compared with other architectures
Hardware Library Kernel Prec. Peak. Result
SW26010 - Tor(A) 128 3.06T 93.5G
SW26010 - Tor(A) 256 3.06T 46.5G
E5-2680 v3 DD ADD 106 0.96T 15.9G
E5-2680 v3 GMP ADD 128 0.96T 0.9G
KP 920-4826 DD ADD 106 1.00T 0.89G
KP 920-4826 GMP ADD 128 1.00T 4.19G
Tesla C2050 CAMPARY GEMM 106 0.51T 14.8G
Tesla C2050 CAMPARY GEMM 212 0.51T 0.98G
get determinant. INV is related to the reciprocal calculation,
while REGCOM is related to the register communication.
SYNC represents the time cost of synchronization with
others’ CPEs. This mainly occurs when several CPEs are
working together to compute an identical matrix. Only one
CPE needs to calculate the reciprocal of the pivot, and the
other CPEs simply wait. From Figure 10 we can see the time
cost of SYNC in 256 bits is much higher than that for 128 bits
since the matrix is scattered to more CPEs in 256 bits. This is
the main factor that affects performance when N becomes
larger.
ELIMI is related to the inner loop of the linear solver,
which is composed of O(2NN3) times of arithmetic and
thus is the bottleneck that takes up 70% ∼ 90% of the time.
Last, SQRTINV corresponds to Line 8, which computes
the reciprocal square root.
7.2 Comparisons with Other Architectures
We use the FLOPS metric to compare the performance with
other architectures. For the x86 architecture (Xeon E5-2680
v3) and the ARM architecture (Kunpeng 920-4826 ARM
v8), we directly run a series of additions without data de-
pendence (called kernel ”ADD”) based on several efficient
libraries such as DD and GMP. For the GPU platform (Tesla
C2050), we choose the CAMPARY library as our counter-
part, which is one of the best multiple precision libraries
in GPU. We directly use data supported by work [57] as
the GPU results. The work [58] proposes a new multiple
precision algorithm that is faster than CAMPARY when
requiring greater than 106 bits of precision. However, in
this work, all libraries are tested based on a single-precision
version on the NVIDIA GTX architecture, which is much
slower than the double-precision version.
Although the kernels used by each architecture are dif-
ferent, we guarantee that the Torontonian function used in
our work is the most difficult one.
Table 4 lists the FLOPS results among all architectures.
Only the GPU architecture is comparable to our algorithm
on the Sunway architecture. Since the Tesla C2050 is a rela-
tively old version of the GPU, we also choose the Tesla V100
GPU (7.066 TFLOPS peak performance). For this purpose,
we convert the performance results based on the ratio of
the peak performance of two GPUs. The results of 106 bits
and 212 bits are 204 GFLOPS and 13.5 GFLOPS, respectively.
Although the 106-bit mode is much better now, the 212-bit
mode, which must be applied when the data scale is very
large, is over three times slower than our algorithm.
Consequently, our algorithm based on the Sunway archi-
tecture has an advantage compared with other architectures.
8 DISCUSSION
In addition to introducing a state-of-the-art benchmark
based on GBS with threshold detection, our work provides
some new insights from the computing perspective. Our
partition strategy is applicable to other platforms. Our
heuristic search for instruction scheduling is suitable for
other applications with a hand-write assembly kernel. With
minor modifications, our idea can also be applied to other
platforms, especially for those platforms with a write-back
stage structural hazard.
The multiple-precision fixed-point design is one of our
major contributions. Based on a architecture-specific large
integer instruction set, a set of fast operators is achieved to
support the Torontonian function, which includes addition,
subtraction, multiplication, division, and reciprocal square
root. Our design can be easily applied to other scientific
applications requiring high accuracy. In addition to the Sun-
way architecture, any current or future machine with such
a specific instruction set can benefit. In our future work,
we may increase and improve the operators to form an
efficient and cross-platform fixed-point multiple-precision
library based on the hardware instructions.
Last, our program has good portability by replacing
our customized precision design with existing multiple-
precision libraries under other platforms.
9 CONCLUSION
GBS with threshold detection is one of the most feasible
ways to demonstrate quantum supremacy. Based on the
Sunway TaihuLight supercomputer, which contains suitable
architecture for this problem, we established a state-of-
the-art classical benchmark waiting for its future quantum
computing counterpart.
To simultaneously achieve almost optimal performance
and sufficient accuracy, a series of methods including a
partition strategy with excellent load balancing, optimal
instruction scheduling based on a heuristic search, multiple-
precision fixed-point design based on an architecture-
specific instruction set, and adaptive precision mode selec-
tion according to upper- and lower-bound estimates was
proposed and applied.
As a result, sustained performance of 2.78 PFLOPS for
128-bit precision and 1.27 PFLOPS for 256-bit precision was
achieved with a proper N . The largest run enabled us to
obtain one Torontonian function of a 100 × 100 submatrix
from 50-photon GBS within 20 hours with 128-bit precision
and 2 days in 256-bit precision.
REFERENCES
[1] Aram W Harrow and Ashley Montanaro. Quantum computational
supremacy. Nature, 549(7671):203–209, 2017.
[2] Richard P Feynman. Simulating physics with computers. Int. J.
Theor. Phys, 21(6/7), 1982.
[3] Peter W Shor. Polynomial-time algorithms for prime factorization
and discrete logarithms on a quantum computer. SIAM review,
41(2):303–332, 1999.
[4] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation
and Quantum Information: 10th Anniversary Edition. Cambridge
University Press, New York, NY, USA, 10th edition, 2011.
[5] Scott Aaronson and Alex Arkhipov. The computational complex-
ity of linear optics. In Proceedings of the forty-third annual ACM
symposium on Theory of computing, pages 333–342. ACM, 2011.
11
[6] John Preskill. Quantum computing in the nisq era and beyond.
Quantum, 2:79, 2018.
[7] Alexander M Dalzell, Aram W Harrow, Dax Enshan Koh, and
Rolando L La Placa. How many qubits are needed for quantum
computational supremacy? Quantum, 4:264, 2020.
[8] Michael J Bremner, Ashley Montanaro, and Dan J Shep-
herd. Average-case complexity versus approximate simulation
of commuting quantum computations. Physical review letters,
117(8):080501, 2016.
[9] Sergio Boixo, Sergei V Isakov, Vadim N Smelyanskiy, Ryan Bab-
bush, Nan Ding, Zhang Jiang, Michael J Bremner, John M Martinis,
and Hartmut Neven. Characterizing quantum supremacy in near-
term devices. Nature Physics, 14(6):595, 2018.
[10] Scott Aaronson and Lijie Chen. Complexity-theoretic foun-
dations of quantum supremacy experiments. arXiv preprint
arXiv:1612.05903, 2016.
[11] Craig S Hamilton, Regina Kruse, Linda Sansoni, Sonja Barkhofen,
Christine Silberhorn, and Igor Jex. Gaussian boson sampling.
Physical review letters, 119(17):170501, 2017.
[12] Nicola´s Quesada, Juan Miguel Arrazola, and Nathan Killoran.
Gaussian boson sampling using threshold detectors. Physical
Review A, 98(6):062322, 2018.
[13] Austin P Lund, Michael J Bremner, and Timothy C Ralph. Quan-
tum sampling problems, bosonsampling and quantum supremacy.
npj Quantum Information, 3(1):1–8, 2017.
[14] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C
Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL
Brandao, David A Buell, et al. Quantum supremacy using a
programmable superconducting processor. Nature, 574(7779):505–
510, 2019.
[15] Edwin Pednault, John A Gunnels, Giacomo Nannicini, Lior
Horesh, Thomas Magerlein, Edgar Solomonik, Erik W Draeger,
Eric T Holland, and Robert Wisnieff. Breaking the 49-qubit
barrier in the simulation of quantum circuits. arXiv preprint
arXiv:1710.05867, 2017.
[16] Zhao-Yun Chen, Qi Zhou, Cheng Xue, Xia Yang, Guang-Can Guo,
and Guo-Ping Guo. 64-qubit quantum circuit simulation. Science
Bulletin, 63(15):964–971, 2018.
[17] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch,
Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel
Michielsen. Massively parallel quantum computer simulator,
eleven years later. Computer Physics Communications, 237:47–61,
2019.
[18] Jianxin Chen, Fang Zhang, Cupjin Huang, Michael Newman, and
Yaoyun Shi. Classical simulation of intermediate-size quantum
circuits. arXiv preprint arXiv:1805.01450, 2018.
[19] Benjamin Villalonga, Dmitry Lyakh, Sergio Boixo, Hartmut Neven,
Travis S Humble, Rupak Biswas, Eleanor G Rieffel, Alan Ho, and
Salvatore Mandra`. Establishing the quantum supremacy frontier
with a 281 pflop/s simulation. Quantum Science and Technology,
5(3):034003, 2020.
[20] Ming-Cheng Chen, Riling Li, Lin Gan, Xiaobo Zhu, Guangwen
Yang, Chao-Yang Lu, and Jian-Wei Pan. Quantum-teleportation-
inspired algorithm for sampling large random quantum circuits.
Physical Review Letters, 124(8):080502, 2020.
[21] Edwin Pednault, John A Gunnels, Giacomo Nannicini, Lior
Horesh, and Robert Wisnieff. Leveraging secondary storage
to simulate deep 54-qubit sycamore circuits. arXiv preprint
arXiv:1910.09534, 2019.
[22] Cupjin Huang, Fang Zhang, Michael Newman, Junjie Cai, Xun
Gao, Zhengxiong Tian, Junyin Wu, Haihong Xu, Huanjun Yu,
Bo Yuan, et al. Classical simulation of quantum supremacy
circuits. arXiv preprint arXiv:2005.06787, 2020.
[23] Koen De Raedt, Kristel Michielsen, Hans De Raedt, Binh Trieu,
Guido Arnold, Marcus Richter, Th Lippert, Hiroshi Watanabe, and
Nobuyasu Ito. Massively parallel quantum computer simulator.
Computer Physics Communications, 176(2):121–136, 2007.
[24] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch,
Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel
Michielsen. Massively parallel quantum computer simulator,
eleven years later. Computer Physics Communications, 237:47–61,
2019.
[25] Mikhail Smelyanskiy, Nicolas PD Sawaya, and Ala´n Aspuru-
Guzik. qhipster: The quantum high performance software testing
environment. arXiv preprint arXiv:1601.07195, 2016.
[26] Thomas Ha¨ner and Damian S Steiger. 5 petabyte simulation
of a 45-qubit quantum circuit. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis, pages 1–10, 2017.
[27] Scott Aaronson and Daniel Gottesman. Improved simulation of
stabilizer circuits. Physical Review A, 70(5):052328, 2004.
[28] Sergey Bravyi and David Gosset. Improved classical simulation
of quantum circuits dominated by clifford gates. Physical review
letters, 116(25):250501, 2016.
[29] Ryan S Bennink, Erik M Ferragut, Travis S Humble, Jason A
Laska, James J Nutaro, Mark G Pleszkoch, and Raphael C Pooser.
Unbiased simulation of near-clifford quantum circuits. Physical
Review A, 95(6):062337, 2017.
[30] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher
Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore Mandra.
A flexible high-performance simulator for the verification and
benchmarking of quantum circuits implemented on real hardware.
2018.
[31] The Fugaku Supercomputer from RIKEN Center for Compu-
tational Science. https://www.r-ccs.riken.jp/en/fugaku/project,
2020. Accessed: 2020-08.
[32] The Summit Supercomputer from the Oak Ridge National Labora-
tory. https://www.olcf.ornl.gov/summit/, 2019. Accessed: 2020-
08.
[33] Igor L Markov, Aneeqa Fatima, Sergei V Isakov, and Sergio Boixo.
Quantum supremacy is both closer and farther than it appears.
arXiv preprint arXiv:1807.10749, 2018.
[34] Riling Li, Bujiao Wu, Mingsheng Ying, Xiaoming Sun, and Guang-
wen Yang. Quantum supremacy circuit simulation on sunway
taihulight. IEEE Transactions on Parallel and Distributed Systems,
31(4):805–816, 2019.
[35] Scott Aaronson and Alex Arkhipov. The computational complex-
ity of linear optics. Theory of Computing, 9(4):143–252, 2013.
[36] Matthew A Broome, Alessandro Fedrizzi, Saleh Rahimi-Keshari,
Justin Dove, Scott Aaronson, Timothy C Ralph, and Andrew G
White. Photonic boson sampling in a tunable circuit. Science,
339(6121):794–798, 2013.
[37] Max Tillmann, Borivoje Dakic´, Rene´ Heilmann, Stefan Nolte,
Alexander Szameit, and Philip Walther. Experimental boson
sampling. Nature photonics, 7(7):540–544, 2013.
[38] Justin B Spring, Benjamin J Metcalf, Peter C Humphreys, W Steven
Kolthammer, Xian-Min Jin, Marco Barbieri, Animesh Datta,
Nicholas Thomas-Peter, Nathan K Langford, Dmytro Kundys,
et al. Boson sampling on a photonic chip. Science, 339(6121):798–
801, 2013.
[39] Andrea Crespi, Roberto Osellame, Roberta Ramponi, Daniel J
Brod, Ernesto F Galvao, Nicolo Spagnolo, Chiara Vitelli, Enrico
Maiorino, Paolo Mataloni, and Fabio Sciarrino. Integrated multi-
mode interferometers with arbitrary designs for photonic boson
sampling. Nature photonics, 7(7):545–549, 2013.
[40] Alex Neville, Chris Sparrow, Raphae¨l Clifford, Eric Johnston,
Patrick M Birchall, Ashley Montanaro, and Anthony Laing. Classi-
cal boson sampling algorithms with superior performance to near-
term experiments. Nature Physics, 13(12):1153–1157, 2017.
[41] Craig S Hamilton, Regina Kruse, Linda Sansoni, Sonja Barkhofen,
Christine Silberhorn, and Igor Jex. Gaussian boson sampling.
Physical review letters, 119(17):170501, 2017.
[42] Brajesh Gupt, Juan Miguel Arrazola, Nicola´s Quesada, and
Thomas R Bromley. Classical benchmarking of gaussian boson
sampling on the titan supercomputer. Quantum Information Pro-
cessing, 19(8):1–14, 2020.
[43] Hui Wang, Yu He, Yu-Huai Li, Zu-En Su, Bo Li, He-Liang
Huang, Xing Ding, Ming-Cheng Chen, Chang Liu, Jian Qin, et al.
High-efficiency multiphoton boson sampling. Nature Photonics,
11(6):361–365, 2017.
[44] Hui Wang, Wei Li, Xiao Jiang, Y-M He, Y-H Li, Xing Ding, M-
C Chen, Jian Qin, C-Z Peng, Christian Schneider, et al. Toward
scalable boson sampling with photon loss. Physical review letters,
120(23):230502, 2018.
[45] Hui Wang, Jian Qin, Xing Ding, Ming-Cheng Chen, Si Chen, Xiang
You, Yu-Ming He, Xiao Jiang, L You, Z Wang, et al. Boson sampling
with 20 input photons and a 60-mode interferometer in a 1 0 14-
dimensional hilbert space. Physical review letters, 123(25):250503,
2019.
[46] Marco Bentivegna, Nicolo` Spagnolo, Chiara Vitelli, Fulvio Flamini,
Niko Viggianiello, Ludovico Latmiral, Paolo Mataloni, Daniel J
Brod, Ernesto F Galva˜o, Andrea Crespi, et al. Experimental
scattershot boson sampling. Science advances, 1(3):e1400255, 2015.
12
[47] Han-Sen Zhong, Yuan Li, Wei Li, Li-Chao Peng, Zu-En Su, Yi Hu,
Yu-Ming He, Xing Ding, Weijun Zhang, Hao Li, et al. 12-photon
entanglement and scalable scattershot boson sampling with op-
timal entangled-photon pairs from parametric down-conversion.
Physical review letters, 121(25):250505, 2018.
[48] Stefano Paesani, Yunhong Ding, Raffaele Santagati, Levon
Chakhmakhchyan, Caterina Vigliar, Karsten Rottwitt, Leif K Ox-
enløwe, Jianwei Wang, Mark G Thompson, and Anthony Laing.
Generation and sampling of quantum states of light in a silicon
chip. Nature Physics, 15(9):925–929, 2019.
[49] Han-Sen Zhong, Li-Chao Peng, Yuan Li, Yi Hu, Wei Li, Jian Qin,
Dian Wu, Weijun Zhang, Hao Li, Lu Zhang, et al. Experimental
gaussian boson sampling. Science Bulletin, 64(8):511–515, 2019.
[50] William Kahan. Ieee standard 754 for binary floating-point arith-
metic. Lecture Notes on the Status of IEEE, 754(94720-1776):11, 1996.
[51] Wikipedia contributors. A* search algorithm, 2020. [Accessed 10-
August-2020].
[52] Mark Smotherman, Sanjay Krishnamurthy, P. S. Aravind, and
David Hunnicutt. Efficient dag construction and heuristic cal-
culation for instruction scheduling. In Proceedings of the 24th
Annual International Symposium on Microarchitecture, MICRO 24,
page 93102, New York, NY, USA, 1991. Association for Computing
Machinery.
[53] Bujiao Wu, Bin Cheng, Fei Jia, Jialin Zhang, Man-Hong Yung, and
Xiaoming Sun. Speedup in classical simulation of gaussian boson
sampling. Science Bulletin, 2020.
[54] Nicola´s Quesada and Juan Miguel Arrazola. The classi-
cal complexity of gaussian boson sampling. arXiv preprint
arXiv:1908.08068, 2019.
[55] Andreas Bjo¨rklund, Brajesh Gupt, and Nicola´s Quesada. A faster
hafnian formula for complex matrices and its benchmarking on a
supercomputer. Journal of Experimental Algorithmics (JEA), 24(1):1–
17, 2019.
[56] The Top500 List. https://www.top500.org/, 2020. Accessed: 2020-
08.
[57] Mioara Joldes, Jean-Michel Muller, and Valentina Popescu. Imple-
mentation and performance evaluation of an extended precision
floating-point arithmetic library for high-accuracy semidefinite
programming. In 2017 IEEE 24th Symposium on Computer Arith-
metic (ARITH), pages 27–34. IEEE, 2017.
[58] Konstantin Isupov and Vladimir Knyazkov. Multiple-precision
blas library for graphics processing units. 2020.
