With the current rate of progress in quantum computing technologies, 50-qubit systems will soon become a reality. To assess, refine and advance the design and control of these devices, one needs a means to test and evaluate their fidelity. This in turn requires the capability of computing ideal quantum state amplitudes for devices of such sizes and larger. In this study, we present a new approach for this task that significantly extends the boundaries of what can be classically computed. We demonstrate our method by presenting results obtained from a calculation of the complete set of output amplitudes of a universal random circuit with depth 27 in a 2D lattice of 7 × 7 qubits. We further present results obtained by calculating an arbitrarily selected slice of 2 37 amplitudes of a universal random circuit with depth 23 in a 2D lattice of 8×7 qubits. Such calculations were previously thought to be impossible due to impracticable memory requirements. Using the methods presented in this paper, the above simulations required 4.5 and 3.0 TB of memory, respectively, to store calculations, which is well within the limits of existing classical computers.
Introduction
In the last few years, significant technological progress has enabled quantum computing to evolve from a visionary idea [19] to reality [11, 12] . With existing devices now providing 10-20 qubits with controllable couplings, the topic of "quantum supremacy" -that is, the ability of a quantum device to perform a computational task beyond what any classical computer can currently perform -is receiving much attention [25, 35, 10, 1, 18] . Work in this area is raising many issues, two of which are particularly relevant to advancing quantum technology: (1) the ability to quantify circuit fidelity (e.g., [8, 36, 20, 10, 30] ), (2) the ability to assess correctness, performance and scaling behavior of quantum algorithms [21] . Fundamental to both is the ability to calculate quantum state amplitudes for measured outcomes.
It is important to distinguish the problem of computing ideal quantum state amplitudes for measured outcomes from the problem of calculating all amplitudes in a quantum state, and from the problem of randomly sampling from the implied probability distribution to simulate quantum computation. These problems become intractable very quickly, simply because the number of amplitudes grows exponentially with the number of qubits. By way of contrast, we present empirical evidence that suggests that the problem of calculating quantum state amplitudes for measured outcomes may not grow to intractable levels quite as quickly as one might expect, neither in terms of computational complexity nor memory requirements, a fortunate circumstance from an engineering standpoint.
This empirical evidence derives from simulations we have performed of two quantum circuits: one with depth 27 for a 7 × 7 grid of qubits, the other with depth 23 for an 8 × 7 grid of qubits. The depth of a circuit is the number of layers that the circuit can be partitioned into in such a way that the gates acting on the qubits at any given layer do no overlap. Both circuits are examples of universal random circuits constructed according to the rules described in [10] , where the first layer of Hadamard gates are not counted in the depth. The primary data structures employed in the 49-qubit simulation required just over 4.5 Terabytes of main memory to store results, while for the 56-qubit simulation they required just over 3.0 Terabytes. In contrast, existing state-ofthe-art techniques [24] would have required 8 Petabytes and 1 Exabyte, respectively. The 49-qubit circuit contained 761 gates and requires an average of 3.83 floating-point operations per gate per amplitude to compute results. The 56-qubit circuit contained 745 gates and requires an average of 3.90 floating-point operations per gate per amplitude. In contrast, a complex multiplication operation requires six floating-point operations: four multiplications and two additions.
Our approach to calculating quantum amplitudes abandons the conventional Dirac matrixvector representation of gates and quantum states [16, 24] . Instead, we employ a tensor representation [27] . This mathematical shift in representation enables the full power of the commutative, associative, and distributive laws of the field of complex numbers to be exploited to reorganize the equations that define quantum circuits without being encumbered by the more restrictive algebras that govern matrices and inner products. In particular, these laws enable:
• Quantum circuits to be arbitrarily partitioned into subcircuits.
• Subcircuits to be simulated in arbitrary orders, even independently.
• Simulation results of subcircuits to be combined in arbitrary orders.
Our approach combines this flexibility with tensor slicing methods to enable final quantum states of circuits to be calculated in slices instead of having to materialize entire quantum states in memory all at once.
The computations were performed on the Vulcan supercomputer at Lawrence Livermore National Laboratories, an IBM Blue Gene/Q system with 24576 nodes, each one equipped with 16 cores and 16GB of memory (in total, we required less than 5 Terabytes of memory for output). The tensor calculations for each circuit were implemented using Cyclops Tensor Framework [37] , a distributed-memory C++ library for algebraic transformations of multidimensional arrays.
We used the simulation on Vulcan to investigate the isotropic properties of the resulting distributions of quantum-state probabilities. Theory predicts that these probabilities should exhibit exponential (a.k.a. Porter-Thomas [34] ) distributions across the aggregate quantum state [10] . However, because the pattern in which these probabilities occur within the final quantum state is also chaotic, the same exponential distribution of probabilities should likewise be observed within small local slices of the state as well. Our simulation of the 49-qubit circuit confirms this phenomenon: the same exponential distribution of probabilities was observed in the tensor slices that were calculated as was observed in the overall quantum state, with only minor variations observed among tensor slices. We then simulated an arbitrarily selected slice of amplitudes for the 56-qubit circuit to further confirm this phenomenon. This slice of the 56-qubit quantum state likewise exhibited the same expected exponential distribution of probabilities.
Our simulations demonstrate that the computation of quantum amplitudes for measured outcomes has now entered the realm of the possible for quantum devices of 50 qubits or more. The extent to which the simulation methodology proposed in this paper can be advanced is yet to be fully explored; investigation of this aspect is left for future research.
This paper is organized as follows. Section 2 describes the mathematical foundations of our simulation methodology. Section 3 presents a summary of our numerical results. Section 4 concludes the paper and discusses future research directions. The appendix contains details on the analytical form of the outcome distribution, and more details on the decomposition of the two large circuits that we simulated.
Simulation methodology
We now describe our approach for simulating quantum circuits, thereby computing the probabilities needed to characterize circuit fidelity. As previously noted, our approach abandons the usual Dirac matrix-vector representation of circuit equations in favor of a tensor representation. To keep the exposition accessible, self-contained, and still relatable to Dirac notation, we use a tensor notation derived from linear algebra.
We define a circuit to be a sequence of d layers G 1 , . . . , G d , where each layer consists of an undirected graph on the same vertex set V := {1, . . . , n}; formally, G t = (V t , E t ), V t = {1, . . . , n}, E t ⊆ V t × V t . Every edge represents a two-qubit gate, and no vertex can appear in two edges on the same layer. At each layer, a single-qubit gate is applied to every vertex that does not appear in a two-qubit gate. This is done without loss of generality, as we can always think of the identity gate being applied. Gates involving more than two qubits are not considered, but in principle the representation could be extended to accommodate such gates as well. Alternatively, circuits containing such gates could be transformed into equivalent circuits containing only one-and two-qubit gates.
Let V 1 t := V t \ {j : (i, j) ∈ E t or (j, k) ∈ E t } be the qubits in layer t that are not involved in two-qubit gates. For j ∈ V 1 t , let U t,j be the gate applied to qubit j in layer t, where U t,j ∈ C 2×2 is a unitary matrix. For (i, j) ∈ E t , let U t,(i,j) be the two-qubit gate applied to qubits i and j in layer t, where U t,(i,j) ∈ C 4×4 is likewise a unitary matrix.
We use subscripts to denote elements of vectors and matrices, and in particular we use binary strings as indices as is customary in quantum mechanics. For example, given binary digits i, j, k, h ∈ {0, 1} and a vector ψ ∈ (C 2 ) ⊗3 ≡ C 8 , ψ ijk denotes the element in the position corresponding to the binary string ijk, and given U ∈ C 4×4 , U ij,kh denotes the matrix element whose row is indexed by the binary string ij and whose column is indexed by the binary string kh.
Let n be the number of qubits and N = 2 n . The quantum state is therefore a unitary vector in H ≡ C N . The initial state is denoted δ i 1 ...in , with:
Using the above notation, and given the state ψ t−1 after the (t − 1)-th layer of gates, we can express the state after the t-th layer as follows:
Together with the initial conditions on the state, we obtain the following recursion that fully describes the amplitudes of the final state:
The simulation of a quantum circuit is equivalent to the solution of this recursion. The recursion can be solved by dynamic programming [5] , and the computation can be accelerated in many possible ways; e.g., memoization to store intermediate results. Note that in principle (1) can be written as a single matrix-vector multiplication
This has the advantage of computing the entire state, but the matrixŪ t has size 4 n , which is often prohibitive even for moderate n: even with n = 25, a dense matrix would require 16 Petabytes of storage for complex numbers in double precision. The solution of (1)- (2) can be carried out efficiently in specific cases. For example, the computation becomes trivial when V 1 t = V t at every layer; i.e., there are no two-qubit gates in the circuit. Indeed, in this case the quantum state ψ t i 1 ...in can be expressed as a product of simpler terms ψ t i 1 , . . . , ψ t in . This is trivially true at t = 0 because ψ 0
, and the property propagates forward by induction following (1) because:
The equation above shows that the state at layer 1 can be expressed as a product of simpler terms that depend on a single qubit, and it is easy to see that the recursion continues in this product form until the last layer. This is a well-known fact: because the initial state is a so-called product state (tensor product of individual qubits) and because there are no entangling two-qubit gates, the quantum state remains a product state and can be computed qubit by qubit. It follows that computing a single amplitude of the final state in such cases requires linear time and space. Other cases in which (1)- (2) can be easily solved include stabilizer circuits [2] , match gates [39] , and circuits with polynomially bounded entanglement under a certain measure [40] . In general, however, the computation as stated may require exponential time and/or memory because of the exponential number of terms in the summation.
Overview of known simulation techniques
There are two widely used strategies to approach the recursion (1)- (2) . The first strategy consists of following the recursion layer by layer, fully computing the state ψ t for t = 0, . . . , d at each layer, and expanding the calculation (1) using tensor contractions. This approach, sometimes called the "Schrödinger algorithm" [1] , has the drawback of storing the entire state vector in memory at each layer, requiring O(2 n ) memory and O(dn2 n ) operations, which is prohibitive for ≈ 50 qubits.
Another strategy involves following all the "paths" from a final state to the initial state in the dynamic programming recursion. This approach requires O(2 nd ) time but linear space. See also the description of the "Feynman algorithm" given in [1] and [6, Thm. 8.6 ]. Clearly, other strategies are possible. When solving (1)- (2), we are only bound by ingenuity and the rules of linear algebra. [1] proposes a scheme for circuits that can be embedded into a given 2D grid; i.e., when the graphs G t are planar and there is an embedding of the vertex set onto a two-dimensional grid such that edges (two-qubit gates) only appear between adjacent vertices (qubits). If such an embedding is given, the 2D grid can be partitioned into two sets V 1 , V 2 such that max{|V 1 |, |V 2 |} ≤ 
] separately, and combine the results looping over the 16 |R| elements resulting from all possible combinations of the matrix elements of the gates corresponding to edges in R. In terms of (1) and assuming without loss of generality that V 1 = {1, . . . , m}, V 2 = {m + 1, . . . , n}, this corresponds to computing: A simulation methodology based on tensor network contractions is presented in [31] . This methodology relies on the construction of a tensor network in which nodes represent gates, and edges represent "common indices" between gates. To compute the outcome probability of a given state, one needs to perform a contraction of the entire network to a single node, where a contraction corresponds to a summation. It is shown in [31] that the contraction can be performed efficiently if the circuit graph has small treewidth. More specifically, the algorithm of [31] runs in exponential time in the treewidth of said graph. We remark that treewidth often occurs as a measure of complexity for problems that may be difficult in general: for example, the NP-hard problems of determining a minimum-sized vertex coloring, Hamiltonian cycle, or vertex cover of a graph are easy to solve by dynamic programming if the graph has bounded treewidth [7] . For the class of circuits considered in this paper, Fig. 1 suggests that the treewidth grows with both the number of qubits and the circuit depth. The figure depicts a lower bound to the treewidth of universal random circuits, computed using the MDD+ algorithm [9] (exact computation of the treewidth is an NP-hard problem). Although lower bounds may not accurately characterize the overall trend, the growth of these bounds indicates that this class of circuits may be nontrivial to simulate with the approach of [31] . We remark that tensor network representations have been a crucial component of other works as well, e.g., [15] ; in this paper we do not directly employ a tensor network representation.
From a more practical point of view, one can employ a strategy that combines the "Schrödinger algorithm" with circuit partitioning. Initially, V is partitioned into the n singletons V i := {i}, i = 1, . . . , n and each subcircuit G[V i ] is simulated independently if possible, just as in (4) . Whenever an entangling gate between two partitions is encountered, the two partitions are merged into a single subcircuit, and the simulation then proceeds in the same fashion. In other words, we initially consider each qubit as a separate subcircuit and we merge subcircuits as necessary whenever entanglement between subcircuits occurs. This approach is exemplified in Figure 2 . Of course, if the Figure 2 : A common simulation strategy: each qubit is initially a separate subcircuit whose simulation is carried out independently, and subcircuits are merged as entangling gates are encountered.
circuit being simulated is nontrivial (i.e., does not trivially yield a product state), this simulation strategy eventually has to compute the full state vector of size 2 n .
Building blocks for our simulation technique
We use a combination of several techniques to speedup and reduce the memory footprint of the computation (1)- (2). Many of the calculations can be carried out in multiple mathematically equivalent ways that yield different memory and time requirements; hence, for each circuit we consider several possibilities and choose one that fits the capacity of the available computing cluster. We will elaborate on this aspect after describing the building blocks of our methodology, which are the following:
1. Tensor slicing;
2. Circuit transformation;
3. Deferral of entangling gates.
Tensor slicing
The simplest technique that we consider is that of simplifying the recursion (1) in the presence of diagonal gates; i.e., quantum gates described by diagonal matrices. Examples of such gates include Z, S, T and CZ gates. Suppose that there exists a qubit q such that, in all layers from τ to d, qubit q is only involved in diagonal gates. Then we can write:
Writing the expression in this form has two advantages. First, by eliminating one index (i.e., j q ) from the summation in (5), we are effectively halving the memory requirements for simulation via matrix-vector multiplication: instead of a single matrix (3) of size 4 n we only require two matrices of size 4 n−1 . Second, if we are interested in the full state ψ t , we can loop over the two possible values for i q = 0, 1 in a very efficient embarrassingly-parallel manner, thereby slicing the computation as well. Notice that these two advantages hold for each index i q that can be sliced: if there are, in total, m indices that can be sliced because all the corresponding gates from layer τ to d are diagonal, then we save 2 m storage space for the matrix (3), and we can perform the computation in parallel over 2 m processors. 
Circuit transformation
It is well-known that there are several equivalent representations for a given quantum circuit. In principle, since each circuit can be simulated in a different way and therefore yield different memory and time requirements, one would have to consider all possible equivalent representations. However, in practice, only transformations that cater to the strength of a particular simulation methodology need be considered. Because we want to apply tensor slicing, we consider transformations that diagonalize gates. For example, for every CX in the circuit, we consider whether the transformation indicated in Fig. 3 is beneficial. The transformation involves a trade-off between the number of non-diagonal two-qubit gates with the total number of gates. The total number of gates increases, but instead of two-qubits with non-diagonal operators, we now have one qubit with diagonal gates and one with non-diagonal gates. This type of transformation can be beneficial if it allows tensor slicing to be applied earlier, so the trade-off has to be considered on a case-by-case basis.t
Deferral of entangling gates
The third and final building block of our simulation technique is the selection of some entangling gates for deferred computation in order to maintain a partitioning of a circuit into subcircuits that decrease memory usage. The idea is somewhat similar to [1] , but we are flexible as to which two-qubit gates are deferred. We describe the idea for a single two-qubit gate (m, m + 1) at layer τ . Assume (with a possible rearrangement of qubit indices) that there exists a partition V m := {1, . . . , m}, V m+1 := {m+1, . . . , n} of V such that up to layer τ , there are no two-qubit gates between V m and V m+1 . A commonly employed strategy in this case is to simulate the subcircuits
] induced by the two sets of qubits separately, and then perform a merge (with the corresponding multiplicative increase in memory requirements) only when necessary, in this case at layer τ . However, another possibility, and the one we advocate here, is to instead defer merging the quantum state information associated with each subcircuit by introducing additional indices to mathematically account for the entanglement of the two subcircuits. We can write:
This expression shows that we can still carry out computations separately for
we are willing to pay the bookkeeping price of extra indices m , m+1 . In particular, we have two choices, shown by the following equation that is easily seen to be equivalent to the last row of the expression above:
According to (6), we can assign the entangling gate (6) Gate Table 1 : Memory requirements and number of floating point of operations for some quantum gates, when applied to a n-qubit quantum state (2 n -dimensional complex vector).
we can impose the constraints i m = m and i m+1 = m+1 , so the extra bookkeeping work is cut down in half. If further entangling two-qubit gates are present at subsequent layers between the partition V m , V m+1 , we have two possible choices: we can resolve the entanglement by merging the partitions, or we can defer the additional entangling gates as well, increasing the amount of bookkeeping. The entanglement has to be resolved before outputting the full state ψ d .
An example of possible computation schemes
We now provide two small examples of the ideas discussed in this section. In our first example we work with the circuit given in Fig. 4 . The memory and floating point requirements for the operations that will be used in this example are given in Table 1 . Notice that these requirements assume that we want to compute the full state vector, and each complex number requires 16 bytes of memory (two double-precision floating point numbers). The simple partitioning scheme given in Fig. 4 carries out the first gate operations independently on each qubit, then merges the two subcircuits. The total cost is 48 operations and 128 bytes of memory. A different partitioning scheme is given in Fig. 5 : the computation of the entangling gate is deferred, and the entire CX gate is assigned to the bottom subcircuit. In this case, the computation requires 62 operations and 256 bytes of memory. The reason the memory occupation and number of operations is worse than the simple partitioning scheme of Fig. 4 is that we are only considering two qubits. In this case, the extra bookkeeping due to deferring the entangling gate is as much as the cost of merging the top and bottom subcircuits; therefore, there is no advantage. However, for larger circuits the benefit of keeping the circuit partitioned into subcircuits can be substantial.
A better approach to carry out the computation in Fig. 4 is described in Fig. 6 . In this computation scheme, the CX gate is transformed into a CZ gate and two H gates. Because of the transformation, the top qubit can be sliced in the subcircuit on the right, so we can carry out the computations for the top qubit equal to |0 and |1 independently, without expanding the tensor product of the two qubits. By doing so, the total cost is of 42 operations, and 96 bytes of memory.
A more elaborate example is a universal random circuit for a 4 × 1 grid of qubits with depth 15. It turns out that the generating rules produce many empty layers for this example and the circuit can be represented using 9 layers. Such a circuit is depicted in Fig. 7 , together with a possible partitioning scheme. Using the labeling indicated in the diagram, and assigning the first CZ gate to the bottom |ξ subcircuit, the full equations governing the state are the following (for simplicity, we use gate names to indicate the corresponding matrix, and omit the superscript that denotes the layer in (1)):
Note that, although the first CZ gate in the circuit is assigned to the bottom |ξ subcircuit, we have to keep track of the extra 1 index associated with that CZ gate in the top |φ subcircuit once the T gate is introduced. We remark that the indices i 0 , i 2 , i 3 in the final state |ψ can be sliced, because the corresponding indices appear only in diagonal gates. Therefore, we never have to explicitly materialize the full state |ψ in memory, and can work on slices. In this example, deferring the entangling gate CZ in the second layer and slicing the last subcircuit allows us to compute the final state without ever allocating memory to hold a full 2 4 -dimensional state vector. Figure 7 : Example of a random 4×1-qubit, depth 15 circuit generated according to [10] . The circuit can be contracted to 9 layers. The partitioning scheme is indicated in the figure. The entangling gate in the second layer is deferred and assigned to the bottom subcircuit.
Instead, the final state can be computed in 8 slices using a state vector of size 2 for the computation of a slice.
Optimizing the circuit simulation strategy
The general scheme that we follow in the computation of (1)- (2) is to keep the circuits partitioned into small subcircuits if possible, and decide on the deferral of entangling gates on a case-bycase basis. We apply tensor slicing whenever possible. For every CX gate, we have the choice of transforming it as indicated in Section 2.2.2; and for every entangling gate between qubits in different subcircuits, we have the choice of deferring the entanglement as indicated in Section 2.2.3. In this latter case, we have the choice of assigning the entangling gate to either of the two subcircuits. Alternatively, the subcircuits can be merged instead of deferring the entanglement. These options generate a space of possible computation schemes for any given circuit. Given a circuit and a computation scheme, we can compute the memory requirements and floating-point operations required simply by analyzing the gates, as is done in Section 2.2.4. In principle, we would like to run a bi-objective optimization over the space of possible computation schemes to generate the Pareto frontier that defines the optimal trade-off between memory usage and floating-point operations. However, even limiting ourselves to the schemes generated using the ideas of Section 2.2, the number of possible choices is exponentially large, so a brute force exploration strategy of all combinations is not viable for large circuits.
Nevertheless, the fact that this Pareto frontier has now been identified means that in order to rule out the possibility of simulating a given circuit within a given amount of memory and/or time, one must prove that there does not exist a computation scheme which satisfies the requirements of the stated claim. This could, in general, be a difficult task.
To obtain the simulation results presented in this paper, we employed both a heuristic search strategy that implemented some hand-crafted circuit partitioning criteria based on our intuitions, as well as an automated search that explores the choices described above in a time-bounded and largely depth-first manner. The hand-crafted strategies adopted the following principles:
• Diagonalize CX gates if all subsequent gates applied to the control qubit are diagonalized.
• Try to keep each circuit partitioned into two or three subcircuits.
• Defer entanglement gates if merging would create a large circuit; i.e., exceed a given memory threshold.
• Assign deferred entanglement gates to the smallest subcircuit, or to a subcircuit that can be sliced. Figure 13 in Appendix B illustrates how a 49-qubit, depth 27 random circuit and a 56-qubit, depth 23 random circuit were partitioned into subcircuits using the hand-crafted approach described above. These circuits and subcircuits were used in our simulation studies.
In addition, we used a time-bounded automated search method to explore the trade-off between memory usage and floating-point operations for various circuits. This search had the freedom to explore arbitrary circuit partitionings unconstrained by our heuristic intuitions. In each execution, the search method was given a circuit and a limit on the available memory that could be used, and it then attempted to find circuit partitionings that minimized the number of floating-point operations needed to perform a simulation without exceeding the specified memory limit. Fig. 8 illustrates the trade-off between memory and computation identified by the analysis for the 49-qubit circuit shown in Figure 13 . Note that because the search strategy employed was time bounded and it did not perform an exhaustive search, it is likely that this curve is suboptimal and circuit partitioning schemes that strictly dominate ours could be found.
We remark that the combination of the three building blocks of tensor slicing, circuit transformation, and deferral of entangling gates in a dynamic way yields a simulation approach with many differences with respect to those described in the existing literature. We list here the main differences as compared to [1, 31] , which are the most similar to our paper:
• We use diagonalization of gates and tensor slicing in order to efficiently parallelize operations and reduce the memory footprint.
• The partitioning of the graph into subcircuits is fixed in [1] , while it is dynamically determined at each layer using our approach. Furthermore, in [1] the gates that cross subcircuits are decomposed into components that are assigned to each subcircuit (the number of components grows exponentially in the number of gates that cross subcircuits), whereas in our approach such gates are assigned in their entirety to either subcircuit.
• In [31] , the tensor contractions are performed pairwise and in a fixed order (as determined by the tree decomposition of the circuit graph), while in principle our scheme allows arbitrary contractions and arbitrary order (but certain orders are preferred to others because they are more memory efficient). In addition, the deferral operations performed using our approach are not tensor contractions, so they have no counterpart in [31] .
Many aspects of the decomposition scheme that we propose have still to be thoroughly investigated, and their ramifications to be fully understood. We discuss some of the open questions in Section 4.
Numerical results
In this section, we discuss the technical details of how we carried out the simulation of the 49-and 56-qubit circuits on existing hardware, and we summarize the results obtained from analyzing the amplitudes computed by the simulation. We first describe the platform for the computation.
Hardware platform: Blue Gene/Q
Our experiments were executed on an IBM Blue Gene/Q supercomputer [38] , which is the third generation of high-performance computers in the Blue Gene series. A Blue Gene/Q node consists of 18 A2 PowerPC 64 bit cores, 16 accessible from application space. The floating-point performance of each core is augmented through the integration of a four-way SIMD floating-point engine, dubbed QPX, capable of delivering up to eight double-precision floating-point operations per processor cycle. Designed to be energy-efficient, each A2 core runs at 1.6 GHz, and is four-way (symmetrically) multi-threaded. Each thread is capable of executing instructions, in-order, in two pipelines. One pipeline executes all integer control and memory access instructions while the other pipeline executes all floating point arithmetic instructions. When multiple threads are used, instructions can execute on both pipelines simultaneously. Each core has a 16 KB private level 1 (L1) cache and a 2 KB prefetching buffer (L1P). All the cores on the same processor share a 32 MB level 2 (L2) cache and 16GB of main memory. The compute nodes are connected in a 5D torus network with a total network bandwidth of 44 GB/s per node.
Software platform: Cyclops Tensor Framework
We implemented the quantum circuit simulation approach described in Section 2 using Cyclops Tensor Framework (CTF) [37] , a distributed-memory C++ library for tensor summations and contractions. The library provides a domain-specific language for tensor operations using an Einsteinsummation syntax. CTF tensors are distributed over all processors with tensor summations and contractions performed in a data-parallel manner using MPI [22] , OpenMP [14] , and BLAS [29] . The library is capable of mapping data directly to the 5D torus network topology of the Blue Gene/Q architecture.
Code generation and tuning
We describe quantum circuits using a syntax similar to the QASM specification [13] used in the IBM Quantum Experience [26] , since it is widely adopted. We developed a code transformation and generation system in order to translate stylized QASM code, with ordered gates and subcircuits, into CTF code that could run on the Blue Gene/Q system at scale.
The massively parallel tensor contraction calculations enabled by CTF have been driven by applications in computational chemistry and physics [4, 33] that involve contractions of tensors with 4 to 8 dimensions. The quantum circuit models employed in our calculation use tensors of much higher dimension. In performing these contractions using CTF, the main new challenge is the need for higher-dimensional virtual topologies; i.e., the decomposition of tensors among more dimensions than CTF typically performs. CTF operates by mapping tensor dimensions onto dimensions of a processor grid, often redistributing tensors to new mappings at contraction time.
As all the dimensions of our tensors are small, mappings to high-dimensional processor grids are necessary. To achieve these, we increased the space of virtual topologies CTF considers, made improvements to the mapping logic in CTF, and enabled dynamic creation and destruction of MPI communicators (defined for each processor grid dimension). These changes make execution more memory-efficient and substantially improved load-balance, but prevent us from making effective use of rectangular collectives on Blue Gene/Q [17, 28] , as few of the communicators used for collectives would map to a contiguous hypercube or toroid.
We used the built-in profiling capabilities of CTF and the performance-counter libraries made available on Vulcan [32] , the hardware platform used in our experiments, to determine the bottlenecks within CTF incurred during circuit simulation contractions. For reasons of brevity, we do not describe all optimizations in this paper. Local copy, summation, and multiplication primitives were automatically replaced by appropriate optimized variants, chosen with knowledge of the parameters employed during the circuit simulations. These low-level changes substantially improved the performance with respect to the original implementation.
The bulk of our performance improvements were high-level changes applied to CTF or made at the application code level. As has been described, the circuit-simulation code resulting from our code generation system contains, in the performance-critical slicing loop (Section 2.2.1), interactions between high-dimensional tensors. There is a reasonably even mix between operations that contain contractions and those that do not (e.g., diagonal gates that involve only multiplications). The first performance optimization at the application level was the aggregation of the (small) gate operators that mutate the large state tensors in a manner similar to that described in [24] . Thus, instead of applying, for example, fifteen two-dimensional gates/tensors to a tensor of dimension 38, we can apply one 30-dimensional tensor. As the complexity of the application of each gate operator is generally linear in the size of the tensor of states, the flop-to-byte ratio is low (there are few computational operations for each memory access). Aggregation of the operators incurs a greater computational cost, but better amortizes the cost of transpositions and lowers the amount of memory traffic associated with the tensor of states. We chose the sizes both by considering the CTF performance model and through the analysis of experimental data.
We further optimized the performance of the CTF multiplications and contractions by tuning over possible ordering permutations of the tensor modes. A tensor whose dimensions are greater than that of the underlying (virtual) communication network is typically distributed by CTF over its leading modes; e.g., elements ψ ijk for all k may be owned by processor p ij . The initial distribution of the tensors plays a significant role in the performance characteristics of both CTF multiplications and contractions. Certain distributions may avoid the need for communication altogether in given multiplication or contraction operations. We considered these permutations in light of a variety of levels of aggregation of gate operators (e.g., aggregating multiple gates into a single gate), which we observed to be beneficial in terms of performance.
Results
In the case of the 49-qubit circuit, the final quantum state is calculated in slices, where each slice involves the calculation of 2 38 amplitudes and a total of 2 11 slices were calculated. In the case of the 56-qubit circuit, only one slice with 2 37 amplitudes was calculated out of the 2 19 such slices that defined the final quantum state.
All experimental results were obtained over the course of two days on Lawrence Livermore National Laboratory's Vulcan Blue Gene/Q supercomputer. This time included the time to setup the experiments described in this paper, as well as that to conduct additional experiments beyond the scope of what is described here. Memory usage required strong scaling to 4,096 nodes, with 64 TB of memory, for each (parallel) calculation. As reported in Section 1, the theoretical minimum memory footprint to hold a slice of the state vector is less than 5 TB for the 49-qubit circuit (2 38 complex doubles are used to store the 2 38 amplitudes); it is common in the literature to report just the memory required for the state vector, e.g., [24] . In practice, runs required more than 32 TB. The bulk of these operations require two tensors to compute and CTF internal buffering introduces approximately a factor of 4 in overhead. Memory was also needed to reformat contraction operands for efficiency and for buffer space as messages are passed between nodes. Other sources of memory consumption stem from the precise manner in which certain operations are performed in CTF and we plan to address some of these trade-offs in future work.
After computing the ideal state amplitudes, we analyzed the distributions of the corresponding outcome probabilities (i.e., the squared magnitudes of the amplitudes) in aggregate across slices as well as individually per slice. The ranges of these outcome probabilities vary with the number of possible outcomes N = 2 n , where n is the number of qubits. The values of these probabilities also vary over several orders of magnitude. For these reasons, histograms were calculated for the values of z = log(N p) for outcome probabilities p. This transformation normalizes the resulting histograms across varying numbers of qubits, simplifying their comparison. It also better reveals the full dynamic range of the outcome probabilities. The theoretical probability density of z = log(N p) for universal random circuits with circuit fidelity α is distributed according to the following equation, which describes a truncated Gumbel distribution in the case of a perfect fidelity of α = 1:
The full derivation of this equation is given in Appendix A. Fig. 9 shows the histograms of the log-transformed outcome probabilities (log(N p)) obtained for the 49-and 56-qubit circuits plotted together with their theoretical distributions given by Equation (7) for α = 1. The histogram for the 49-qubit circuit incorporates all 2 49 outcome probabilities, while the one for the 56-qubit circuit incorporates only the arbitrarily selected slice of 2 37 outcome probabilities that were calculated for that circuit. As these graphs illustrate, there is an extremely close agreement between the observed and predicted distributions in both cases. Fig. 10 shows a plot of the histograms obtained for an aggregation of the slices calculated for the 49-qubit circuit. Each histogram is an aggregation of 32 slices, and the 64 individual histograms have been stacked together and rendered as a surface plot to better reveal variations between the histograms. As can be seen in this figure, the variations are concentrated at the extreme tails of the distribution where very low probabilities and, hence, very low bin counts are observed in the histograms. Such variations are expected in histograms whenever one is dealing with low bin counts. The effects are magnified in this case because the vertical axis (corresponding to bin counts) is plotted on a log scale. The same magnification effect can likewise be observed in Fig. 9 . Except for such minor variations at low bin counts, Fig. 10 reveals that the distributions of outcome probabilities across slices are virtually identical, demonstrating that this distribution is essentially isotropic across the quantum state. 
Conclusions and future research
In this paper we addressed a fundamental question in the engineering of more powerful computing devices. Our key contribution is a new circuit simulation methodology that allowed us to break the 50-qubit barrier, previously thought to be the limit beyond which quantum computers cannot be simulated. Our results confirm the expected Porter-Thomas distribution as the distribution of the outcome probabilities for universal random circuits. This expectation was confirmed per slice across the entire quantum state produced by the 49-qubit circuit, and for a single slice of the quantum state resulting from the 56-qubit circuit. As mentioned in the introduction, the ability to calculate quantum amplitudes for measured outcomes is important to assess the correct operation of quantum devices; however, this application was not explored in the present study.
The implications of Fig. 8 should be self-evident. If a 49-qubit random circuit that was previously thought to be impossible to simulate on a classical computer can now be simulated in slices where the data structures fit comfortably within the memory available on high-end servers, then the possibility now exists to perform routine calculations of amplitudes for measured outcomes of such quantum devices without requiring access to supercomputers. We plan to fully explore the extent to which this is possible to do both in terms of numbers of qubits and depths of circuits. Because tensor slice calculations can also be embarrassingly parallelized, circuits whose output amplitudes can now be calculated on high-end servers can also, in principle, be simulated in their entirety by distributing the calculations across a sufficiently large number of high-end servers. No longer are these calculations limited by the resources available on specific supercomputers. One could, in principle, run such simulations on the Cloud. For this class of circuits, whose extent still needs to be determined, the question of whether they can be fully simulated has now become more a question of economics than one of a physical plausibility.
Beyond the aforementioned computational utility on a complete circuit scale, tensor slicing can be instrumental for algorithmic analysis, as well as in simulation of active control subcircuits, where measured slice outcomes can serve as dynamic input to the remaining part of a circuit. 
Future research
In terms of the optimization of the parallel computation of quantum states, we plan to explore the use of low-precision computation, leveraging CTF support for user-defined tensor element types. Dynamic (contraction-time) tensor unfoldings should make it possible to map to lower-dimensional topologies, expanding the space of available algorithms and making use of rectangular collectives again possible. In some cases we may be able to combine folding with improvements in performance prediction, especially as relates to the offload model, to exploit the capabilities of GPUs. Integration of optimized batched-BLAS routines into CTF should result in substantial speed-ups in many cases. The utility of these changes should be greatly amplified with a commensurate set of changes to the performance prediction models employed.
Characterizing the fidelity of quantum circuits is a crucial aspect for the evolution of the design of quantum devices, and is enabled by the computation of ideal state probabilities. There are several possible research directions in this area. For instance, we could consider a (sequential) optimal experimental design framework, in which the goal is to maximize the extraction of useful information regarding the fidelity of the system, while being conscious to considerations such as limited experimentation or simulation budgets. Such a framework would provide means to assess the diminishing-return nature of information gain measures, and their trade-off with experimentation and simulation budgets [23, 3] .
Finally, there are several open questions regarding the optimal partitioning of circuits for simulation purposes. The difficulty of determining an optimal partitioning for circuit graphs with an underlying 2D lattice structure is an important aspect that emerges from our paper: such difficulty should be characterized from both an empirical, and a theoretical point of view (we conjecture that the corresponding decision problem is NP-hard). Another topic for future research is the precise relationship between treewidth, tensor network contractions, and our proposed methodology.
Let the actual measurement probability with respect to the physical realization of the circuit, with circuit fidelity α be
The cumulative distribution of p is given by: Hence, the probability density of q is: In our own experiments working with 49-qubit and 56-qubit circuits, we found it convenient to plot the distribution of log(N q) to better visualize the shape of this distribution over several orders of magnitude of variation in the value of N q. We therefore first derive the distribution w.r.t. N q, and then w.r.t. log N q. Let x = N q. The cumulative distribution of x is: Thus, the probability density of x is: 
B Partitioning of the 49-and 56-qubit circuits
In this section we describe the 49-and 56-qubit circuits analyzed in Section 3, and the partitioning adopted to compute the final amplitudes. To indicate gates, we use the notation given in Fig. 11 .
To familiarize the reader with the notation, we start by representing the example discussed at the end of Section 2.2.4: this is given in Fig. 7 . We use colors to indicate subcircuits, and "S" to indicate qubits that can be sliced.
The two large circuits that we simulated are given in Fig. 13 , together with their partitioning. From the picture, it can be seen that we defer entanglement so that the subcircuits do not get too large, and we exploit slicing in the final layers to reduce memory usage. that are not involved in subcircuit 3 (i.e., those for which subcircuit 3 does not contain any gates applied to them).
