We report, in a sequence of notes, our work on the Alibaba Cloud Quantum Development Kit (AC-QDK). AC-QDK provides a set of tools for aiding the development of both quantum computing algorithms and quantum processors, and is powered by a large-scale classical simulator deployed on Alibaba Cloud. In this note, we report the computational experiments demonstrating the classical simulation capability of AC-QDK. We use as a benchmark the random quantum circuits designed for Google's Bristlecone QPU [1] . We simulate Bristlecone-70 circuits with depth 1 + 32 + 1 in 0.43 second per amplitude, using 1449 Alibaba Cloud Elastic Computing Service (ECS) instances, each with 88 Intel Xeon(Skylake) Platinum 8163 vCPU cores @ 2.5 GHz and 160 gigabytes of memory. By comparison, the previously best reported results for the same tasks are 104 and 135 seconds, using NASA's HPC Pleiades and Electra systems, respectively (arXiv:1811.09599). Furthermore, we report simulations of Bristlecone-70 with depth 1 + 36 + 1 and depth 1 + 40 + 1 in 5.6 and 580.7 seconds per amplitude, respectively. To the best of our knowledge, these are the first successful simulations of instances at these depths. *
Building quantum computers and developing their applications are the two primary challenges for the field of quantum computing. We are at an early stage for quantum computing that is often likened to the development of classical computers in the early 20th century. However, there is a fundamental difference: for the design of quantum computers and applications, we now have at our disposal powerful classical computing capabilities that have been exponentially improving for decades. The Alibaba Cloud Quantum Development Kit (AC-QDK) aims to utilize Alibaba's massive classical computational resources for aiding the development of quantum applications and quantum computers themselves.
The computational engine of AC-QDK is at present our classical quantum circuit simulator Tai-Zhang, deployed on Alibaba Cloud. In [4] , we described Tai-Zhang's algorithm and the computational experiments that deployed it on the computing facilities in Alibaba Group's Data Infrastructure and Search Technology Division. For migration to Alibaba Cloud and to adapt to the updated benchmark, we made several technical changes to Tai-Zhang, documented below. We refer the interested reader to [4] for the algorithm.
Our choice of benchmark, first proposed in the Google AI Blog [3] , is motivated by the prospect of comparing our results with those of other groups and with a real quantum device, such as the Bristlecone quantum processor Google is working on [7] . Thus, simulating these circuits will provide a direct comparison between quantum and classical implementations for the same task. This has motivated several other groups to simulate these circuits as well.
We recognize, however, that being able to simulate these circuits does not guarantee the ability to simulate other circuits of a similar scale. In particular, like all simulations based on tensor contraction, our approach requires computational resources that scale exponentially in the treewidth of the quantum circuit, and thus is limited fundamentally in the size it can simulate. Nevertheless, as we will report in subsequent notes, our system can be applied fruitfully despite this fundamental constraint.
Benchmarking and the Experimental Setup
The circuits we simulate in this work are described in [3] . They are a modification of the random circuits defined in [2] , and the circuit files are available for download [1] . The new circuit prescription is as follows.
1. Begin with a layer of all Hadamard gates.
2. Place t layers of CZ gates alternating between 8 configurations.
3. Place single-qubit gates on these t layers at positions unoccupied by CZ gates, according to the following rules.
(a) Place a single-qubit gate chosen at random from the set { √ X, √ Y} at a qubit if that qubit participates in a CZ gate in the previous layer.
(b) Place a T-gate at a particular qubit if that qubit participates in a √ X, √ Y, or H gate in the previous layer.
End with a final layer of all Hadamard gates.
A series of papers address the simulation of these revised random quantum circuits [8, 9, 5, 6] . In [8] , the authors estimated that the revised circuits should be about 1000× harder to simulate, compared with those in [4] . In [8] , Bristlecone-70 with depth 1 + 32 + 1 has been benchmarked on NASA's HPC Pleiades and Electra systems with reported runtimes of 2.89 × 10 −2 and 3.57 × 10 −2 hours respectively, or equivalently 104.04 and 128.52 seconds. For the above computation, all four available node architectures on the Pleiades system are used: In [5] , the same circuit has been argued to be simulable for supercomputers like Tianhe-2 by analyzing the circuit complexity. Additionally, 72-qubit random quantum circuits for Bristlecone with depth 1 + 32 + 1 have been benchmarked in [5] , which reported runtimes of 14.1 minutes, or 846 seconds to compute a single amplitude on 16384 Sunway SW26010 260C nodes, with 256 cores each.
Before we present our simulation setup, we make a few clarifications. First, Bristlecone-70, the 70-qubit random quantum circuit family for that architecture, is equivalent for simulation purposes to Bristlecone-72, the 72-qubit version, since two qubits of the latter network can be easily contracted with their only neighbor. Therefore, we can compare the above reported runtimes with results using Bristlecone-70 at depth 1 + 32 + 1, even though we use less CPU cores than previously reported work. Second, we adopt a slightly different notation. In [1] , the file bris_n.tar.gz contains circuits using n rows. The files inside are named bris_n_maxcycle_id.txt. Bristleconem refers to those circuits acting on m qubits. So the 70-qubit circuits that we benchmark in this paper, Bristlecone-70, correspond to circuit description files in Bris_11.tar.gz. In particular, we emphasize that these are the largest size circuits available in [1] .
We use 1449 Alibaba Cloud Elastic Compute Service (ECS) instances, each with 88 virtual CPU cores and 160 GB of memory. We first use a single node as an agent to split the large tensor network contraction task into many smaller tensor network contraction subtasks; this step is called 'preprocessing'. Then, the agent node uses the OSS (Object Storage Service) as a data transmission hub to assign different subtasks to different nodes. When a node is finished with the assigned subtask, it will upload the result to the OSS, and the agent node will repeatedly query the OSS until all the subtask results add up to the desired amplitude. The reported 'running-time' is the total elapsed time obtained on the agent node except for the preprocessing step. This is because we only need to do preprocessing once, independent of the number of amplitudes we will calculate.
We refactored the source code of the simulator we presented in [4] to yield a better abstraction of the simulation task. A major change in the refactoring is to switch the edges and the nodes of the tensor network. In the previous version, we denoted the running indices of a tensor network as nodes, while each tensor was regarded as a hyper-edge connecting several nodes together. The refactored code is formulated in the opposite way, where each node now holds a tensor and each hyper-edge corresponds to a running index. This change provides a common interface for all tensor-valued objects, and allows us to conveniently construct complex tensor networks. Despite this change, the algorithm in the refactored code is functionally similar to that in [4] , and so we don't observe a significant performance difference after the refactorization. For more details of the algorithm, please refer to [4] . Detailed benchmarking results will be presented in Sec 2.
Benchmarking Results
Several variables will affect the performance of our simulator:
• the number N a of amplitudes to calculate;
• the number N c of CPU cores;
• the number N s of subtasks for calculating a single amplitude.
In our algorithm, all subtasks have equal computational complexity. Thus, the naive way to balance the computational load is to assign NaNs Ns subtasks to each CPU core. There is no shared data access across these subtasks, and so distributing subtasks equally among CPU cores or among nodes does not strongly affect the performance. Therefore, we can calculate the execution time of those subtasks assigned to a single node from a cluster and predict the full execution time of the whole task on that cluster. In our experiment, we choose four nodes with 2, 4, 8 and 16 vCPU cores, and all with a Memory-to-CPU Core ratio of 2. We calculate the execution time of subtasks for each node and choose the largest one as the predicted execution time of the whole calculation on a cluster with 127, 512 vCPU cores and 2 × 127, 512 gigabytes of memory. From the above table, we observe that the more amplitudes we calculate, the less the execution time per amplitude. This is due to a more balanced computational load for larger NaNs×#vCPU
127,512
. When N a is large enough, the run time per amplitude will remain relatively stable. We also observe that even when using same number of total vCPU cores, execution time will be slightly reduced when we use a larger number of smaller ECS instances. Let |ψ denote the output of a random quantum circuit. It is known that the distribution of measurement probabilities p(x j ) = | x j |ψ | 2 approaches the exponential form N e −N p , known as the Porter-Thomas distribution. Based on the 200, 000 amplitudes we calculated for Bristlecone-70 circuits with depth 1 + 28 + 1, we plot the distribution of N p, which closely matches the Porter-Thomas form. 
