Near term quantum computers with a high quantity (around 50) and quality (around 0.995 fidelity for two-qubit gates) of qubits will approximately sample from certain probability distributions beyond the capabilities of known classical algorithms on state-of-the-art computers, achieving the first milestone of so-called quantum supremacy. This has stimulated recent progress in classical algorithms to simulate quantum circuits. Classical simulations are also necessary to approximate the fidelity of multiqubit quantum computers using cross entropy benchmarking. Here we present numerical results of a novel classical simulation algorithm to calculate output probabilities of universal random circuits with more qubits and depth than previously reported. For example, circuits with 5 × 9 qubits of depth 40, 7 × 8 qubits of depth 30, and 10 × (κ > 10)) qubits of depth 19 are all easy to sample by calculating around one thousand measurements in a single workstation. Cross entropy benchmarking with around one million measurements for these circuits is now also possible in a computer cluster. The algorithm is related to the "Feynman path" method to simulate quantum circuits. For low-depth circuits, the algorithm scales exponentially in the depth times the smaller lateral dimension, or the treewidth, as explained in Boixo et. al. [1] , and therefore confirms the bounds in that paper. In particular, circuits with 7 × 7 qubits and depth 40 remain currently out of reach. Follow up work on a supercomputer environment will tighten this bound.
I. INTRODUCTION
Quantum computers offer the promise to perform a number of computational tasks believed to be impossible for classical computers. Prominent examples are quantum simulations [2] [3] [4] [5] [6] [7] [8] [9] and factoring large numbers [10] . Given the progress in classical computation, surpassing the capabilities of state-of-the-art algorithms and computers is no small feat. Therefore, in the path to more practical "quantum supremacy" breakthroughs, it is reasonable to focus first on more specific milestones better suited for near-term quantum computers with a high quantity (around 50) and quality (around 0.995 two qubit gate fidelity) of qubits. Sampling from the output distribution of certain families of quantum circuits has been identified as a promising venue for a first milestone of quantum supremacy [11] [12] [13] [14] [15] [16] . Universal random circuits [1] (or more general chaotic quantum evolutions [17] ) are particularly well-suited for near term devices. On the one hand, we note that an important characteristic of these types of sampling problems is that they are hypersensitive to perturbations [18, 19] and, therefore, approximate classical algorithms are unlikely to work. On the other hand, this also implies that an experimental implementation will not be successful unless the qubits being used are very coherent and all operations have high fidelity (and very fine-tuned control). Nevertheless, the same hypersensitivity to perturbations makes this type of demonstration of quantum supremacy a good benchmark for multiqubit fidelity of universal computers using cross entropy benchmarking [1, 17] . A successful implementation of a first milestone of quantum supremacy would demonstrate the basic building blocks for a large-scale quantum computer within the operational range of the surface code [20, 21] .
The prospect of a near-term demonstration of quantum supremacy has also stimulated recent progress in classical algorithms to simulate quantum circuits [1, [22] [23] [24] [25] . The most efficient algorithms previously reported to simulate generic circuits (without symmetries that allow for faster compressions [23, 26] or "emulations") apply gates to a vector state, implemented as highly optimized matrixvector multiplications [1, 22, 24, 25] . Benchmarking results for a highly optimized algorithm of this type were reported in Ref. [1] for 6 × 7 qubits and depth 27 taking 989 seconds and performed in the Edison supercomputer.
More recently, Ref. [24] reported a simulation using 8192 nodes and 0.5 petabytes of memory for a quantum circuit (in the same ensemble) with 5 × 9 qubits and depth 25 on the new Cori II supercomputer in 552 seconds (and 6 × 7 qubits in 80 seconds). Finally, Ref. [25] recently reported simulations of a circuit with 7 × 7 qubits and depth 27 and another circuit of 7 × 8 qubits and depth 23, along two days in the IBM Blue Gene/Q supercomputer. Computing time in a supercomputer at that scale is scarce and expensive.
We generalize the simulation of a quantum circuit by mapping it onto an undirected graphical model, which we compute using the variable elimination algorithm, developed in the context of exact inference [27, 28] . In some cases a circuit amplitude can be mapped directly to the partition function of an Ising model with imaginary temperature [1] , which can be computed exactly with the same method. This is related to the Feynman path method [29] , the tensor network method [30] , and similar approaches [31] . The algorithm is explained in Sec. IV, and we provide pseudocode in App. A. We are mostly interested in sampling output probabilities, but the same method can be used to calculate many amplitudes at once or calculating observables exactly. The quantum circuits implemented with superconducting qubits, which is our main focus, use controlled-phase gates (and controlled-Z gates in particular) as the predominant two-qubit gate [20, 32] . We exploit the fact that controlled-phase gates are diagonal in the computational basis (this is also used in the algorithms referred above). The cost of this algorithm is exponential in min(O(d ), O(n)) for depth d, minimum lateral dimension and total number of qubits n. More optimally, the cost is exponential in the treewidth, which is upper bounded by min(O(d ), O(n)), and was estimated in Ref. [1] (see also Sec. IV B).
II. NUMERICAL RESULTS
In the simplest implementation, we compute one-byone the exact probabilities assigned by a quantum circuit U to a chosen set of outputs. One use case is cross entropy benchmarking [1, 17] , where we need to compute the probabilities assigned by U to the bit-strings measured in an experiment, see App. B. In other use cases, such as estimating the value of an observable, or simulating sampling from the output distribution, the error scales as t −1/2 , where t is the number of output probabilities computed (see App. C). If the time per amplitude is small enough (for instance, around 100 seconds or less), then we can obtain enough probabilities in a single workstation in a reasonable time (for instance, 1000 probabilities in around one day). The sampling is also trivially parallelizable across machines. The algorithm can be modified to sample sets of probabilities in a single run, see Eq. (6) . It can also be modified to calculate an observable exactly, but the computational cost might change. This algorithm is not useful in all cases. For instance, Ref. [1] numerically verified that the output distribution of low-depth universal random circuits approaches the entropy of the Porter-Thomas (or exponential) distribution to within its standard deviation of order 2 −n/2 . It would require to compute a number of output probabilities of order 2 n to verify this property. We also note that for enough depth a direct evolution of the wave vector is likely to become more efficient, as it is easier to optimize.
In what follows we report the time per probability (or amplitude) in a single workstation. We use as a reference a machine with 2× 14-core Intel E5-2690 V4 processors @ 2.6GHz, 35MB Cache and 128GB DDR4 2400MHz RAM. We use single precision complex numbers, as the relative error per probability is negligible for the sizes computed (less then 10 −5 ). The computation of a single probability can also be distributed among several machines as the size of the computation grows. We leave this and other optimizations (such as reusing part of the computation when calculating many probabilities) for follow up work.
We are interested in circuits in a 2D lattice, as they are available experimentally [20, 32] . We will use the following set of gates:
1. We use controlled-phase (CZ) gates as two-qubit gates.
FIG. 1. Time per output probability for a typical instance as a function of depth on a single workstation, using TensorFlow as the engine for the computation and a vertical elimination ordering (see Sec. IV A). Different colors corresponds to different circuit sizes, from 5 × 5 (and 5 × 9) qubits, to 10 × 10 (10 × 11). The green circle marker corresponds to the circuits simulated in Ref. [1] , black circles to Ref. [24] , and blue circles to Ref. [25] . The yellow circle is the circuit used in Fig. 3 .
2.
For a concrete example we use single qubit gates in the set {H,
is a π/2 rotation around the X (Y) axis of the Bloch sphere, and the nonClifford T gate is the diagonal matrix {1, e iπ/4 }.
We use the rules for the layout of gates described in [1] . We are also particularly interested in superconducting qubits, and it is currently not possible to perform two CZ gates simultaneously in two neighboring qubits [20, [33] [34] [35] . This restriction was used for the circuits. Figure 1 shows the time per output probability for typical instances of different circuit sizes as a function of depth on a single workstation.
1 Different colors correspond to different circuit sizes, from 5 × 5 (and 5 × 9) qubits, to 10 × 10 (10 × 11). Here we use a vertical variable elimination ordering, which in essence processes the "Feynman path" or worldline of each qubit in sequence (see Sec. IV A). The processing time (and required memory) of each computation grows exponentially with depth. The exponential growth is also faster for larger circuits. More specifically, we process the worldline of the qubits ordered first along the smaller lateral dimension . The cost of computing probabilities for circuits with 5 × 9 qubits, with = 5, is similar to the cost of circuits with 5 × 5 qubits. The same is true for circuits of sizes 7 × 7 and 7 × 8 with = 7, and also sizes FIG. 2. Timings per output probability for typical instances at different depths using different implementations of the graphical model variable elimination algorithm. We compare a TensorFlow implementation and a C++ implementation using the vertical variable elimination ordering (see Sec. IV A), and an implementation using the Dask python package using a variable ordering obtained by running QuickBB [37] for a day to find a treewidth decomposition at each depth (see Sec. IV B).
10 × 10 and 10 × 11 (or larger) with = 10. In this implementation the computational graph of the algorithm was constructed in python and processed with TensorFlow version 1.3.0 [36] . For each size depicted, the computation becomes memory limited (128 GB RAM) after the highest depth shown (depths around 31 to 33 for 7 × 7 qubits depending on the instance). Figure 2 compares the run times per output probability as a function of depth for three different implementations. One is the implementation with a TensorFlow [36] backend using the same vertical variable elimination ordering as in Fig. 1 . This implementation is efficient and portable to the different architectures in which TensorFlow is available, including distributed environments. A second implementation is a native C++ implementation. The C++ implementation is faster, although the run times are similar, and is probably less portable and harder to distribute across machines. Finally we also show an implementation using a variable ordering obtained by running QuickBB [37] for a day to find a treewidth decomposition at each depth. The tensors obtained by this variable elimination ordering are worse aligned than in the vertical ordering, and can not be processed by the off-the-shelf TensorFlow binary. Therefore we used the python module Dask [38] in this case (Dask was slower than TensorFlow in the vertical ordering for our current implementation). Not taking into account the time required to find a better variable elimination ordering, the treewidth ordering is faster. In all three cases further optimizations are possible, and the run times reported are merely indicative. All three implementations are memory (RAM) limited after the last depth shown in the figure.
As an illustration, we calculated 200 thousand output probabilities of a universal random circuit with 7 × 8 qubits and depth 30. We show in Fig. 3 that the distribution obeys the exponential or Porter-Thomas distribution. We also used these probabilities to estimate the entropy of the output distribution of this circuit. We obtain an entropy of log(2 7×8 ) − 0.422 ± 0.007, while the analytical value for a Porter-Thomas distribution is log(2 7×8 ) − 0.4228 [1] .
III. CIRCUIT SIMULATION AS AN UNDIRECTED GRAPHICAL MODEL
The basic principle of the algorithm can be seen as a mapping of a quantum circuit to an undirected graphical model with complex factors. This is equivalent to the interaction graph of the Ising model detailed in Ref. [1] . The evaluation is carried out using an algorithm developed in the context of exact inference for undirected graphical models [27, 28, 39] , but in the present case the factors are complex. A similar algorithm was used to find an exact ground state of a classical Ising model in [40] . We now explain in detail how the undirected graphical model is constructed.
A. Feynman path method
We represent a quantum circuit by a product of unitary matrices U (t) corresponding to different clock cycles t. We introduce the following notation for the amplitude of a particular bit-string after the final cycle of the circuit,
Here |b t = ⊗ . We refer to these as non-diagonal gates as they contain two nonzero elements in each row and column (unlike T and CZ). This observation allows us to rewrite the path integral representation in a more economic fashion.
Through the circuit, a sequence of non-diagonal gates are applied to qubit j. We denote the length of this sequence as d j . In a given path the qubit j goes through the sequence of Boolean states {b 
Therefore, an individual path in the path integral can be encoded by the set of
B. Undirected graphical model
Equation (1) is a sum of products of factors defined by the gates of the quantum circuit. A quantum circuit U is given as a sequence of gates U = U g · · · U 1 . In expression (1), each gate U will contribute factors defined by a complex pseudo-Boolean function ψ U . A diagonal one-qubit gate is expressed as
where ψ U (b) is a complex function of one Boolean variable. Next consider a non-diagonal one-qubit gate, for example H, X 1/2 , Y 1/2 , and so on,
where 
where
is also a function of two Boolean variables, with ψ CZ (1, 1) = −1 and all the other entries equal 1.
For circuits where the only two-qubit gate is a CZ, the expression (1) is a sum of products of factors defined by complex functions of one or two Boolean variables
where {b k j } are the classical Boolean variables introduced above. The delta functions set the values of the Boolean variables corresponding to the initial and final bit-strings. If we omit some or all the delta functions, we would compute sets of amplitudes at once. We show below how the method is extended to more general gates or gates acting on more qubits.
In order to compute expression (6), it is convenient to interpret it as an undirected graphical model by defining the graph G where each variable b k j corresponds exactly to one vertex in G and each function ψ results in a clique between the vertices corresponding to the variables of ψ. As explained above, the subscript j ∈ [1, n] enumerates the qubits, and the superscript k enumerates new variables introduced along each qubit worldline, one new variable for each qubit acted upon by a non-diagonal gate. After the vertices of the graph G have been defined in this way, gates do not introduce any further vertices, rather they (might) introduce edges between vertices. Multiple gates might be represented by the same edge or edges, and act upon the same vertex or vertices. Explicit rules on how gates are represented are given below.
A quantum circuit representation of a diagonal onequbit gate and the corresponding graphical model are
The quantum circuit and graphical model of a nondiagonal one-qubit gate are
The circuit and graphical model for a two-qubit diagonal gate are
Finally, the corresponding quantum circle and graphical model for a two-qubit non-diagonal gate are
As an example, consider a quantum circuit with two qubits such as
where the vertical line symbolizes a controlled-Z gate (which is diagonal) and the final boxes represent a measurement. The undirected graphical model corresponding to this circuit is .
This example is worked out explicitly in App. D. The same method can be applied to calculate the expectation value of an operator O given by 0| U † OU |0 . For a local operator we simplify this expression by writing the circuit unitary U in terms of gates U α , and canceling terms U † α U α = 1 whenever possible. The operator results in factors in an expression equivalent to Eq. (6), following exactly the same procedure. Alternatively, the expectation value of an observable can be estimated by sampling output probabilities and using Monte Carlo integration, see App. C.
IV. VARIABLE ELIMINATION ALGORITHM
The evaluation of a sum of products of factors such as expression (6) can be done directly by an algorithm developed in the context of exact inference for undirected graphical models, known as the bucket elimination algorithm [27, 41] or the variable elimination algorithm [28] . In our case, though, the factors have complex values, and it does not correspond to a probabilistic model. The graphical model of some circuits can also be interpreted directly as an Ising model at imaginary temperature [1] , see App. F, similar to the relation between undirected graphical models or random Markov fields and Ising models at real temperature.
The variable elimination algorithm is usually explained by first considering the case where the graphical model is one dimensional. This would correspond to the worldline of a quantum circuit with a single qubit. In this case we would have to evaluate expressions of the form
That is, we evaluate a one dimensional undirected graphical model by using the distributive property and eliminating variables from left to right (or right to left). A variable is eliminated by performing the corresponding sum and storing the corresponding factor in memory, as in the step from Eq. (12) to Eq. (13). The same algorithm applies to trees, and is known as belief propagation in the context of graphical models [42] . It is also closely related to the Bethe Peierls iterative method [43] or the cavity method in physics [44] . The cost of this algorithm is linear. In more general cases we obtain factors that grow in size after eliminating a variable. For example
The size of a factor or tensor stored in memory is exponential in the number of variables or indexes. The size of the factors obtained through variable elimination depends on the order of elimination. Consequently, the cost of this algorithm can vary exponentially for different variable elimination orders.
A. Vertical variable elimination ordering for low-depth circuits
For circuits with low depth and low dimension a simple strategy for the order of variable elimination is to process one qubit at a time, eliminating all variables in one worldline sequentially before moving to a neighboring qubit. As an example, we consider again the circuits of Ref. [1] , which are defined in a two dimensional lattice of qubits. As explained in Sec. III, the mapping of a circuit output amplitude to an undirected graphical model results in a graph defined on vertices corresponding to Boolean classical variables b along the so-called worldline of a qubit j in the time direction. We assume that the qubit index j is ordered so that sequential values correspond to neighboring qubits in the underlying two dimensional lattice. Processing the variables first along the worldline direction, which we call the vertical ordering of variable elimination, corresponds to eliminating variables in the lexicographical order of the pairs (j, k). That is, we eliminate all b k j variables corresponding to qubit j sequentially along the k index before moving to the variables corresponding to qubit j + 1. Figure 4 shows the size of the maximun tensor rank as a function of the circuit depth for an instance of each size 6×6, 6×7, and 7×7 using a vertical elimination ordering. More exactly, we plot the size of the tensor minus 1, to make it directly comparable to the standard definition of treewidth used in Fig. 5 . There are small variations between instances of the same size, which explains that in this particular case the instance of size 6 × 7 has a larger tensor size for some depths in this ordering than the instance of size 7 × 7.
B. Numerical estimation of the treewidth for some quantum circuits
The size of the factors obtained by variable elimination from a given ordering can be analyzed graphically. We start with the undirected graph corresponding to the original expression, where each variable corresponds to a vertex. To simulate the elimination of a variable, we add an edge between all vertices connected to the vertex (variable) being eliminated. The cliques obtained in this process correspond to factors obtained through variable elimination. The size of the largest clique is called the induced width. Note that different orderings in the vertex elimination process, which correspond to different orderings of variable elimination, can result in different cliques. The treewidth is defined as the minimum width over all variable orderings. Determining the treewidth is in general NP-complete, but heuristic algorithms, such as QuickBB [37] , can be used to obtain an approximation.
For a circuit in a 2D lattice of qubits with two-qubit gates restricted to nearest neighbors, as in Ref. [1] , the treewidth of the corresponding undirected graphical model is proportional to min(O(d ), O(n)), where is the minimum lateral dimension. Figure 5 shows numerical upper bounds for the treewidth as a function of depth (see also Ref. [1] ). The upper bounds were obtained by running the QuickBB algorithm [37] . This algorithm also returns the corresponding variable elimination order, which was used in Fig. 2 and also to calculate the probabilities plotted in Fig. 3 .
V. CONCLUSION
We generalize the simulation of a quantum circuit by mapping it onto an undirected graphical model, which we compute using the variable elimination algorithm, developed in the context of exact inference [27, 28, 39] . This is particularly convenient for circuits of low depth [1, 30, 31] , and with predominantly diagonal two-qubit gates. We are able to sample the output distribution of quantum circuits in a single workstation for cases that previously required a supercomputer. We also show how to perform cross entropy benchmarking and how to estimate the value of an observable. The computation of many amplitudes is embarrassingly parallelizable across machines. The cost of this algorithm is exponential in min(O(d ), O(n) ) for depth d, minimun lateral dimension and total number of qubits n. More optimally, the cost is exponential in the treewidth of the graph, as estimated in Ref. [1] . Follow up work will study potential improvements to this algorithm, such as implementation in a supercomputer and code optimization, and possible space-time tradeoffs [31] .
ACKNOWLEDGMENTS
We acknowledge Stephan Hoyer for conversations and suggesting the use of Dask.
Appendix A: Pseudocode
In this section we give pseudocode for the quantum circuit simulation algorithm presented in the main text [28] . As seen in Eq. (6), an output amplitude of a quantum circuit is expressed as a sum of products of factors. Each factor is a complex valued function or tensor defined in one or two Boolean variables or indexes. The variable elimination algorithm performs the sum of Eq. (6) Tensors implement two operations. One is a multiplication across tensors, tensor1 * tensor2, implemented with broadcasting semantics, as in the NumPy python library. This means that the product of tensors is done elementwise. If an index or variable in tensor_a is missing in tensor_b then, in effect, this index is added to tensor_b with dimension 1 (the value of tensor_b is determined by the pre-existing indexes). The other operation, tensor.reduce_sum(), computes the sum of elements across the first dimension of the tensor. For simplicity, we assume that the indexes of all tensors are initially ordered from low to high. This will also be an invariant of the algorithm.
The bucket elimination algorithm [27, 41] keeps track of the order of the operations involved. More explicitly, we use the bucket elimination algorithm to build a computational graph that will be processed by TensorFlow or Dask. A bucket is a list of tensors. The ordered list buckets has one bucket per variable. In this way each bucket is associated to a variable. An invariant of the algorithm is that each tensor in a bucket acts upon the variable associated with the bucket, and all other variables have higher order.
First we initialize each bucket with the tensors in Eq. (6), corresponding to the gates, initialization and measurement of a circuit output amplitude: Note that this obeys the invariant given above, because of the assumption that the variables or indexes in each tensor are ordered from low to high. Each bucket in the buckets list is processed in order. Processing a bucket implements the elimination of the variable associated with that bucket, as in Eq. (14) . As noted above, we assume that variables have initially been ordered with the chosen variable elimination order. Processing the buckets list one by one then implements the correct order. Processing a bucket can return a tensor without indexes, corresponding to a scalar, when all tensors in the bucket had only one variable (this has to be the variable associated with the bucket). Typically this only happens in the last bucket, and this scalar corresponds to the amplitude that we want to calculate. In some cases, for circuits of very small depth, the circuit can be decomposed into disjoint tensors, and the corresponding scalars are multiplied together. The processing of a bucket consists on two steps: first all tensors in the bucket are broadcasted together, and then the first variable is summed over, as in Eq. (14) . By the invariants of the algorithm, the variable summed over is the one associated to the bucket. We now review cross entropy benchmarking, a technique to estimate the fidelity of an experimental quantum circuit implementation using the exact probabilities amplitudes calculated by a quantum circuit simulation algorithm [1, 17] .
The cross-entropy between a probability distribution p A and the ideal output probability p U is defined as S(p A , p U ) = − x p A (x) log p U (x). As explained in Ref. [1] , the cross-entropy with an uncorrelated distribution p unc averaged over quantum circuits is almost constant, H 0 = E U [S(p unc , p U )]. In the following we make the averaging over circuits U implicitly.
We write the output of an experimental quantum circuit implementation with fidelity α as
The state U |0 corresponds to the ideal output of circuit U, without any errors, while σ U represents the output of the implementation of U after any experimental errors. The experimental output probability for a bit-string x is p exp (x) = x| ρ |x = α p U (x) + (1 − α) x| σ U |x . From the arguments presented in Ref. [1] , almost any error in a random universal circuit of sufficient depth results in an output probability which is uncorrelated with the ideal probability distribution, and therefore we assume that x| σ U |x is uncorrelated with p U (x). We have
Therefore
is an estimate of the experimental fidelity. Note that H 0 and H(p U ) can be calculated numerically, see Sec. II and App. C. Furthermore, for universal random circuits in the Porter-Thomas regime we have H 0 = log(2 n ) + γ and H(p U ) = log(2 n ) − 1 + γ, where γ is Euler's constant [1] . In order to estimate the fidelity α of an experimental implementation, we take a large sample S exp = {x 
In conclusion, using the algorithm presented in the main text to calculate the m ideal output probabilities {p U (x exp )}, we can use Eqs. (B7) and (B6) to obtain an estimate of α.
Appendix D: Simulation example
In this appendix we consider the specific example of the quantum circuit
We calculate the output amplitude for input |00 and output |00 . The corresponding graphical model is
Because the input and output is specified, we have b 
(D3)
The treewidth of this graph is 2, because it is a clique with two vertices. 2 In the next step we can eliminate the variables b 1 0 and b 1 1 in any order. More explicitly, the amplitude 00| C |00 , where C is the circuit in Fig. (D1) , is
The function ψ H corresponds to a Hadamard gate and is given by the table
The function ψ CZ corresponds to a controlled-Z gate and is given by the table We can rewrite Eq. (D4) as 
Appendix E: Relation to tensor network contraction
Variable elimination on the undirected graph representing a quantum circuit is related to the simulation of a quantum circuit by contracting tensor networks [30] . In a tensor network graphical representation, gates correspond to vertices (tensors) and qubits connecting gates (indexes in the tensors) are represented by edges. Therefore, a two-qubit gate is represented as a node with four edges, two for input and two for output. For instance, the tensor network corresponding to the example of Eq. (10) is .
The operation of contracting a network of tensors is analogous to the variable elimination in Sec. IV. The indexes in common are summed over or contracted, and the size of the resulting tensor in memory is exponential in the number of remaining indices. Graphically, the contraction of two tensors is represented by eliminating the edges (indexes) that the corresponding two vertices (tensors) have in common, and replacing both vertices with a single vertex with the remaining edges (indexes). As in variable elimination, the cost also depends in the order in which indexes are eliminated.
Reference [30] relates the minimum cost of the tensor contraction to the treewidth of the corresponding line graph. The line graph G corresponding to the graph G representing the tensor network is defined with a vertex for every edge in G. If two edges are incident in the same vertex (tensor) in G, then we add an edge in G between the corresponding two vertices of G . In this way a twoqubit gate (a tensor with four indexes) results in a clique between four vertices in the line graph. For instance, the line graph corresponding to the tensor network in (E1) is .
It follows that the line graph of the tensor network for a quantum circuit is analogous to the undirected graphical model detailed above. The main difference is that diagonal two-qubit (and single-qubit) gates are treated more efficiently in the undirected graphical model obtained directly from the quantum circuit factor representation. In the line graph obtained from a tensor network a diagonal gate results in a clique with four vertices, but in the original graphical model is represented only by an edge and does not introduce new vertices. This can be seen comparing the examples in Fig. D2 , with treewidth 2, and (E2), which has treewidth 4. Because the cost is exponential in the treewidth, this optimization results in a smaller exponent for the computation. Figure 6 shows numerical upper bounds for the treewidth of the line graph as a function of depth for the circuits in Ref. [1] . The upper bounds were also obtained by running the QuickBB algorithm [37] . Now that the treewidth has been reduced by almost a factor of two in the undirected graphical model which is drawn directly from the quantum circuit, Fig. 5 , instead of using the line graph of the tensor network, Fig. 6 . This is due to the fact that in these circuits all two-qubit gates are controlled-Z gates.
