Layout synthesis, an important step in quantum computing, processes quantum circuits to satisfy device layout constraints. In this paper, we construct QUEKO benchmarks for this problem, which have known optimal depth. We use QUEKO to evaluate the optimality of current layout synthesis tools, including Cirq from Google, Qiskit from IBM, t|ket from Cambridge Quantum Computing, and recent academic work. To our surprise, despite over a decade of research and development by academia and industry on compilation and synthesis for quantum circuits, we are still able to demonstrate large optimality gaps. Even combining the best of all four solutions we evaluated, the gap is still about 5x for circuits with depths suitable for state-of-the-art devices. This suggests substantial room for improvement. Finally, we also prove the NP-completeness of the layout synthesis problem for quantum computing. We have made the QUEKO benchmarks open source.
Introduction
Recently, a quantum processor "Sycamore" from Google was shown to have a clear advantage over classical supercomputers on a problem named sampling random quantum circuit [1] . It is widely expected that in the near future, quantum computing (QC) will outperform its classical counterpart solving more and more problems. Achieving this computational advantage, however, requires executing larger and larger quantum circuits.
A quantum circuit consists of quantum gates acting on qubits. It was shown that only gates acting on one or two qubits are required for universal quantum computing [2] . After a quantum circuit is designed, it needs to be mapped to a QC device. However, qubit connections required by two-qubit gates are often greatly constrained by QC device layouts. QC layout synthesis resolves this issue by producing an initial mapping from the qubits in the circuit to the physical qubits on the QC device, adjusting the mapping to legalize two-qubit gates by inserting some new gates, and scheduling all the gates. The resulting circuit preserves the original functionality and is executable on the QC device.
Quantum circuits in earlier experiments used to be only dozens of gates on small devices, e.g., those with 5 or fewer qubits. In those cases, layout synthesis was usually realized by exhaustive enumeration. However, the task is increasingly intractable as the circuits get deeper and wider. Nowadays, a cutting-edge QC experiment requires the execution of a circuit of 53 qubits, 1113 single-qubit gates, and 430 two-qubit gates [1] . For a general circuit of this size, the number of possible initial mappings is 53!, and the subsequent scheduling and legalization steps have large solution space as well. Clearly, design automation is necessary. In addition, the size of the circuits that QC hardware is able to execute has been scaling exponentially in the past few years. [3] The fast increase of hardware capacity presents an even bigger challenge to layout synthesis.
Several layout synthesis tools are available and there are also benchmarks that help us to compare them. However, it is currently unknown how far these tools are away from the optimal solutions. In this paper, we present QUEKO benchmarks (quantum mapping examples with known optimal) which are quantum circuits with known optimal depth for the given QC device. Then, we evaluate four existing QC layout synthesis tools with QUEKO, namely t|ket 1 , greedy router included in Cirq 2 , DenseLayout plus StochasticSwap included in Qiskit 3 , and the recent academic work by Zulehner et al. [4] . To our surprise, rather large optimality gaps are discovered even for feasible-depth circuits: 1.5-12x on average on a smaller device and 5-45x on average on a larger device.
This result is somewhat surprising as, in comparison, the result of the VLSI circuit placement optimality study conducted more than 15 years ago using the PEKO benchmarks [5] revealed that the optimality gaps of the leading academic and industrial placers were 1.43-2.12X on average for circuits with over two million placable objects, while the quantum circuits used in this study have only up to 54 qubits and about 35 thousand quantum gates, yet shown a much large optimality gap.
The optimality gaps revealed in this study have a strong implication. If we can consistently halve the circuit depth by better layout synthesis, we effectively double the decoherence time of a QC device, which is equivalent to a large advancement in experimental physics and electrical engineering. Therefore, the gaps call for more research investments into QC layout synthesis. To draw a parallel, the VLSI placement optimality study using the PEKO benchmarks [5] spurred further research investment in circuit placement, resulting in wirelength reduction equivalent to two or more generations of Moore's Law scaling, but in a more cost-efficient way [6] .
The rest of this paper is organized as follows: Sec. 2 reviews relevant background of QC, Sec. 3 formulates the QC layout synthesis problem; Sec. 4 reviews related work; Sec. 5 provides the construction of QUEKO benchmarks; Sec. 6 evaluates aforementioned tools with QUEKO; Sec. 7 proves the NP-completeness of QC layout synthesis; Sec. 8 gives conclusion.
Background

Qubits
A qubit is in a quantum state |ψ represented by a vector in two-dimensional Hilbert space with L 2 -norm equals 1
where the two basis vectors are |0 and |1 . A quantum state of multiple qubits lies in the tensor product of individual Hilbert spaces. For instance, a general two-qubit state |φ is
where we omit the tensor product notation ⊗ between |· s for convenience. A joint state of two individual qubits |ψ 1 ⊗ |ψ 0 is
Quantum Gates
A quantum gate transforms an input state to an output state. For example, some common single-qubit gates are X, H, and T . † means transpose complex conjugate.
Two common two-qubit gates are CZ and CX (also named CN OT ). 
Another important gate is CCX, or Toffoli, gate which is universal for reversible logic and thus essential for QC logic synthesis [7] . It is a quantum gate on three qubits and can be decomposed into a set of single-qubit and two-qubit gates as shown in the following subsection. 
Quantum Circuit
In QC, a circuit or program is usually input as a piece of QASM code [8] , e.g., Fig. 1a . The code is rather simple to read, merely specifying each gate sequentially like instructions in traditional assembly language. We thus define a quantum circuit to be a list of quantum gates g 1 g 2 ...g M .
It is important to note that the qubits in a quantum circuit are logical qubits denoted as q 1 , ..., q N . We do not use "logical" to indicate error correction in this paper. Additionally, we denote qubit count by N , gate count by M , single-qubit gate count by M 1 , and two-qubit gate count by M 2 . For instance, in Fig. 1 , N = 3, M = 15, M 1 = 9, M 2 = 6.
We use the notation of gate like a set of qubits. We say the cardinality |g| = 1 if g is a single-qubit gate and |g| = 2 if g is a two-qubit gate. g ∪ g denotes the set of qubits involved in g or g . g ∩ g denotes the set of qubits involved in g and g . For instance, in Fig. 1 , q 1 , q 2 ∈ g 2 , q 0 ∈ g 13 , |g 4 | = |g 6 | = 2, |g 7 | = |g 9 | = 1, g 1 ∩ g 2 = {q 2 }, and g 8 ∪ g 9 = {q 0 , q 1 , q 2 }.
A 1D circuit diagram can also represent the circuit, e.g., Fig. 1b . In such a diagram, each wire stands for a logical qubit. We draw the control qubit of CX gate as • and the target qubit as ⊕. The gates in the diagram are executed from left to right. Gates aligned vertically are executed simultaneously. The 1D diagram provides some primitive timing information, but not explicitly. Due to its 1D nature, the diagram cannot clearly show some features of the quantum circuit. For instance, g 8 and g 9 can be executed simultaneously, but we separate them by some horizontal distance to avoid overlapping in the diagram.
Quantum Computing Device Representation
We represent the layout of a QC device with a device graph G = (P, E) where each node p stands for a physical qubit and each edge e stands for a connection that enables two-qubit entangling gates, i.e., we can only perform such gates on two physical qubits that are connected. This graph is also named as coupling graph or qubit connectivity. Device graphs used in this paper are shown in Fig. 2 .
Problem Formulation
Layout synthesis is divided into two sub-tasks: initial placement that produces an initial mapping from logical qubits to physical qubits, and gate scheduling that decides when and where to execute each input gate and insert SWAP 
(c) Two-qubit gate set of Toffoli circuit Figure 1 : Toffoli circuit (Single-qubit gates are colored gray. Identical two-qubit gates applied at different times have the same color, e.g., g 2 and g 6 are orange because they are both CX(q 1 , q 2 ) but at different times.)
gates to make two-qubit gates legal on the device graph. In the whole QC compilation/mapping workflow, there are usually circuit optimization stages before or after layout synthesis. During the optimization stages, gate reduction and commutation are performed, e.g., [9] . We assume that these optimizations are already applied prior to layout synthesis, so that every input gate should be executed and the relative order of input gates should not change in the layout synthesis process.
Initial Placement
During initial placement, we need to find a mapping from logical qubits in the quantum circuit to physical qubits on the device µ 0 : Q → P that benefits subsequent gate scheduling. If the two-qubit gate set, consisting of all the two-qubit gates, can be embedded in the device graph during initial placement (e.g., Fig. 1c can be embedded in Fig. 2c ), gate scheduling does not necessarily need to insert any gates. However, the case is not so ideal in general. For example, if we want to map the Toffoli circuit as shown in Fig. 1b onto device Ourense as shown in Fig. 2a , there must be some additional gates, since the device graph does not contain any triangles, but the two-qubit gates in Fig. 1c forms a triangle. A valid initial placement in this case is given in Fig. 3a where µ 0 (q 0 ) = p 0 , µ 0 (q 1 ) = p 3 , and µ 0 (q 2 ) = p 1 . 
Gate Scheduling
Given a quantum circuit g 1 ...g M , e.g., Fig. 1b , gate scheduling produces the spacetime coordinates (t j , x j ) for each gate. The coordinates specify when and where the gates are applied. We say that a gate is scheduled to cycle t if its time coordinate is t. For a single-qubit gate, the space coordinate is a physical qubit, i.e., x j ∈ P ; for a two-qubit gate, it is an edge in the device graph, i.e., x j ∈ E. SWAP gates may need to be inserted during gate scheduling to ensure that all two-qubit gates are executable. The input gate plus the inserted SWAP gates constitute the scheduled gate list g 1 , ...,gM . Since only SWAP gates are inserted and all the input gates are contained in the scheduled gate list, the functionality of input circuit remains unchanged after the layout synthesis process. Additionally, gate scheduling must respect dependencies in the quantum circuit. If gate g acts on qubit q, then g can only be executed after all prior gates, which act on qubit q, are executed.
A valid but not necessarily optimal gate scheduling example is given in Fig. 3b . The time coordinates for all the gates are displayed at the bottom, e.g., t 1 = 1 and t 15 = 13. The space coordinates can be inferred from the mapping, e.g. x 1 = p 1 , x 2 = (p 3 , p 1 ), x 14 = (p 0 , p 1 ), and x 15 = p 3 . There is an injective map from the original gates to the scheduled gates: f (i) = i for i = 1 to 10 and f (i) = i + 3 for i = 11 to 15 such that g i =g f (i) for i = 1 to 15. The three CX gates in the dashed boxg 11 ,g 12 , andg 13 constitute a SWAP gate. The adjusted mapping is shown after the them. The SWAP gate is inserted so thatg 14 andg 18 are on connected qubits p 0 and p 1 , thus executable. 
Formal Definition of (Depth-Optimal) Layout Synthesis Problem in Quantum Computing
Input A device graph G = (P, E) and a list of quantum gates g 1 ...g M acting on logical qubit set Q. All the input gates are in the gate set of the device, e.g., a set from Table 1 . Logical qubits are less or equal than physical qubits, i.e., |Q| ≤ |P |.
Output An initial mapping µ 0 : Q → P , and a scheduled quantum circuit consists of a new list of gatesg 1 ...gM , including SWAP gates, where each gate has a spacetime coordinate (t j , x j ). We use tilde to denote that a gate is scheduled from here on.
Constraints
• Feasible two-qubit gates: all the two-qubit gates in the scheduled circuit must be on two qubits connected in the device graph. Formally, for j = 1 toM , if |g j | = 2, then x j ∈ E.
• Executing all gates: all input gates should be executed. Formally, there is an injective map f :
Objective Minimizing circuit depth T , which is the maximal time coordinate of all the scheduled gates, i.e., T ≡ max j=1,...,M t j .
In this paper, we use depth as the default objective but other objectives can be used as well, e.g., the number of additional gatesM − M , or the fidelity of the scheduled circuit. The output and the constraints of the problem are independent of the objective. However, with other objectives like fidelity, more input information may be required.
Related Work
In the most general sense, the task of QC layout synthesis is generating a quantum circuit that satisfies QC device constraints and fulfills the functionality of the input circuit. Related works on this problem include [4, . These works may have some variations on the problem in mind. [10, 12-16, 25, 33] focus on multidimensional array device graphs (linear array for 1D, grid for 2D, and so on). [29] focuses specifically on SU(4) circuits and includes postsynthesis optimization. [35] focuses on adjusting the mapping after synthesis to improve fidelity. [20] considers some commutation rules. [26] [27] [28] consider the scheduling of QAOA circuits. The order of some two-qubit gates in QAOA circuit can be exchanged even if there are dependencies, since they commute, which is not applicable to general quantum circuits.
The produced quantum circuit should be not only executable, but also efficient. The efficiency can be measured with different metrics. The metric can be the additional "cost", which is usually proportional to the number of additional gates [4, 11-23, 25, 29] ; or circuit depth like this paper, since the qubits can only function well within the decoherence time [10, 24, [26] [27] [28] ; or circuit fidelity, since nowadays a common practice is executing a circuit multiple times and analysing the statistics of the results [31, [33] [34] [35] ; or a mix of the above [30, 32] .
Detailed discussions of the complexity of QC layout synthesis can be found in Sec. 7. These discussions indicate that large scale instances cannot be solved both exactly and efficiently. From the perspective of solution techniques, the current works can be divided into two categories. The first group focuses on deriving the exact solution for moderate-sized instances with the help of solvers [14, 17, 21, 24-28, 33, 34] . [17, 25] use a PBO (pseudo Boolean optimizer) to decide the SWAP insertion but do not explicitly schedule the gates. The same goes for [14, 21] , which use a MIP (mixed integer programming) solver and a SMT (satisfiability modulo theories) solver correspondingly. [26] [27] [28] use a temporal planner to schedule specifically QAOA circuits. The closest previous works concerning this paper are [24, 33, 34] . However, [33, 34] use a SMT solver to maximize fidelity. [24] splits circuits into "levels" and inserts gates to transform the mapping between the levels. This model of quantum circuit may not yield an optimal solution. Under this imperfect "levels" model, [24] aims to derive a depth-optimal solution with integer linear programming.
The second group of related works use heuristic search techniques [4, 10-13, 15, 16, 18-20, 22, 23, 29-32] . We only discuss the works targeting general device layouts below [4, 11, 18-20, 22, 23, 29-32] . The general approach is splitting the circuit into small sub-circuits for which the layout synthesis can be done efficiently, and then searching for the mapping transformation between these sub-circuits. A sub-circuit can be a "level" or "layer" mentioned in the last paragraph [4, 19, 22, 23, [30] [31] [32] , a set of several levels [11] , several levels but for a few specific qubits [29] , or individual gates [18, 20] . In order to find the mapping transformation, [18] inserts SWAP gates to move the two qubits that required by the next two-qubit gate in the shortest path; [20, 23, 30] also consider distances between qubits of further two-qubit gates; [31] additionally considers fidelity in the qubit movements; [4, 29] use the sum-of-distances plus the number of SWAP gates as the cost function in A* search; [32] uses bidirectional search; [19] uses a 4-approximation algorithm; [22] exploits some existing approximate solution of token swapping; [11] recursively considers SWAP gates as cuts in the device graph.
The complexity also brings difficulty to the evaluation of these solutions. Currently, the benchmarks usually are quantum circuit libraries of some realistic functions, e.g., [9] and RevLib [36] , or certain random circuits, which are thought to be the worst-case scenario, e.g., SU(4) circuits [29] . So far, researchers can only compare against each other, but do not know how far they are from the optimum. This paper aims to fill in this gap.
The QC layout synthesis problem is still quite new to compiler and design automation communities, so the name of the problem varies. It can be placement [11, 14] , routing [23] , compiling quantum circuits [20, [26] [27] [28] [29] 33, 34] , quantum circuit transformation [19] , mapping circuits to QC architectures [4, 17, 21, 24, 25, 30, 32] , conversion [12] or optimization [13] of circuits in QC architecture, realization of quantum circuits [15, 16] , or qubit allocation [18, 22, 31, 35] .
QUEKO Benchmarks
This work is inspired by PEKO [5] , placement examples with known optimal. Placement is a crucial step in classical integrated circuit design, where modules are placed on a chip with objective of minimizing total wirelength. Although this problem is NP-hard, the PEKO algorithm is able to generate benchmarks with know optimal solutions.
Similarly, for a generic input quantum circuit and a generic device graph, finding the scheduled circuit with optimal depth is NP-complete, which will be proved in Sec. 7. However, it is feasible to construct some benchmarks with known optimal solution. Given a target device graph G and a target depth T , we can construct an depth-optimal circuit. Then, by re-labelling the qubits, we derive a QUEKO benchmark.
Additionally, QUEKO can be customized for a given feature: gate density vector (d 1 , d 2 ) . The two components intuitively stand for the densities of single-qubit gates and two-qubit gates in the whole circuit. Suppose a circuit has n logical qubits, M 1 single-qubit gates, M 2 two-qubit gates, and a longest dependency chain of l, then d 1 = M 1 /(n · l) and d 2 = 2M 2 /(n · l). For example, in Fig. 1b The construction of QUEKO, as shown in Algorithm 1, starts with checking the validity of input data by calculating the number of single-qubit and two-qubit gates M 1 and M 2 . If M 1 + M 2 < T , then there would be too few gates to generate a circuit with depth T ; if M 1 + 2M 2 > N · T , then there would be too many gates for the given depth and device graph. We define the matching bound u of a graph G to be the minimal size of maximal matchings of G. This means we can find at least u edges in G that pair-wisely share no vertices. If M 2 > u · T , then there could be too many two-qubit gates for the given depth and device graph. In short, if M 1 + M 2 < T , M 1 + 2M 2 > N · T , or M 2 > u · T , we return an error to reject the input data. Otherwise, we proceed to three phases: backbone construction, sprinkling, and scrambling.
Backbone Construction Phase
This phase "grows" a sequence of T gates, each depending on the previous one, constituting a dependency chain of length T . This chain serves as the "backbone" of the circuit. For example, we start from the device graph as Fig. 4a (which is just Fig. 2a rotated) , and pick three executable gatesg 1 ,g 2 , andg 3 whose spacetime coordinates are (1, (p 0 , p 1 )), (2, p 1 ), (3, (p 1 , p 2 )). They constitute a dependency chain of length T = 3, since all of them act on p 1 . This is shown in Fig. 4b , where gates at different cycles are put on different "slices" from left to right. The "backbone" is colored green.
To be more rigorous, we first choose a random node or edge of G as x 1 . In every iteration afterwards, we randomly choose x k that overlaps with x k−1 . Thus, x k ∩ x k−1 = ∅, which enforces t k > t k−1 = k − 1 by dependency constraint. On the other hand, sinceg k is executable, it can at most take a single cycle, i.e., the optimal t k = t k−1 + 1 = k. Gate sequenceg 1 ,g 2 , ...,g T constitutes a dependency chain of length T . Because of this "backbone", the final depth of the scheduled circuit cannot be lower than T . Note that we do not need to use any SWAP gates for backbone construction.
Algorithm 1 QUEKO construction
Input: a device graph G = (P, E) with |P | = N and its matching bound u, a depth target T , and a gate density vector (d 1 , d 2 ) Output: QUEKO benchmark g 1 g 2 ...g M1+M2 , where M 1 and M 2 are the numbers of single-/two-qubit gates 1:
return error: input data not admissible 4: end if 5: m 1 ← 0, m 2 ← 0 // how many single-qubit gates and two-qubit gates we have used // Backbone construction phase 6: for i = 1 to T do 7: j ← rand({1, 2}) // randomly decide single-qubit or two-qubit gate 8: if j = 2 and m 2 < M 2 then 9:
x i ← rand(E) 10: while i > 1 and x i ∩ x i−1 = ∅ do 11: x i ← rand(E) 12: end while 13: t i ← i, m 2 ← m 2 + 1 14: else 15: x i ← rand(P ) 16: while i > 1 and x i ∩ x i−1 = ∅ do 17:
x i ← rand(P ) (t i , x i ) ← rand({1, 2, ..., T } × P ) 32: while ∃l ∈ {1, ..., i} such that t i = t l and x i ∩ x l = ∅ do 33: end if 45: end for // Output 46: sort g i according to t i , i = 1 to M 1 + M 2 47: return g 1 g 2 ...g M1+M2
Sprinkling Phase
The backbone construction phase uses T gates in total, we then randomly "sprinkle" the rest M 1 + M 2 − T gates, e.g.,g 4 ,g 5 ,g 6 ,g 7 shown in Fig. 4c . We randomly select spacetime coordinates (t i , x i ) that does not overlap with any existing gates with time coordinate t i . After sprinkling, a circuit with gatesg 1 ...g M1+M2 is created. Its gates are all executable; its depth is T ; its gate density vector approximates (d 1 , d 2 ). (There could be minor rounding errors in the ceiling function.)
It is worthy of noting that though only one longest dependency chain is explicitly generated in the backbone construction phase, the sprinkling phase may implicitly generate more. For example,g 4 depends ong 7 ; if we "sprinkles" a gate on p 4 at cycle 3, then another dependency chain of length 3 would exist in the output circuit. The higher the gate densities, the more likely that these implicit longest dependency chains are generated.
Scrambling Phase
As shown in Fig. 4d , we generate a random mapping τ from physical qubits to logical qubits and apply τ to the space coordinates ofg 1g2 ...g M1+M2 . For instance, x 1 = (p 0 , p 1 ), so the resulting gate g 1 is a two-qubit gate on logical qubits τ (p 0 ) = q 0 and τ (p 1 ) = q 2 ; x 7 = (p 3 , p 4 ), so g 7 is a two-qubit gate on τ (p 3 ) = q 1 and τ (p 4 ) = q 3 ; g 6 is a single-qubit gate on τ (p 2 ) = q 4 ... The specific types of single-qubit gates and two-qubit gates are not important, since QUEKO is only for layout synthesis, not for circuit optimization. We use X as the single-qubit gate and CX as the two-qubit gate.
Output
Sort the gates g 1 g 2 ...g M1+M2 according to the time coordinates to transfer the timing information originally in these time coordinates to the relative order inside the output gate list. The result is a QUEKO benchmark, as shown in Fig. 4e and Fig. 4f .
As we have proven, the depth of the output circuit is at least T because of the backbone. A QC layout synthesis tool can meet the optimum by finding the initial mapping that is the inverse of the scrambling mapping τ . Therefore, QUEKO circuits have known optimal depth T .
Note that QUEKO circuits also have known optimal gate count M 1 + M 2 . Since we assume that, in layout synthesis, all the input gates need to be executed, the result produced by the tools has at least as many gates as the QUEKO circuit. The optimal gate count is also met with the optimal initial mapping τ −1 , since no SWAP gates are needed in this case.
Experiment
Experimental Setup
To evaluate QC layout synthesis tools with QUEKO, device graphs, depths and sizes, and gate density vectors are required. We specify the choice of these parameters and the choice of tools to evaluate in this subsection. All the experiments were run on a Ubuntu 16.04 server, which has two Intel Xeon E5-2699v3 as CPUs and 128GB main memory. The QUEKO benchmarks are made open source 5 under the BSD license.
Device Graph
We used representative devices from three different QC hardware providers. Sycamore from Google [1] , Tokyo and Rochester from IBM 6 , and Aspen-4 from Rigetti 7 . The graphs of these devices are shown in Fig. 2 . Sycamore has 54 qubits, of which 53 are active; Rochester also has 53 qubits. Both of them are state-of-the-art devices, but Sycamore has richer connectivity. Aspen-4 has 16 qubits, and Tokyo has 20 qubits. They are both highly competitive devices, but Tokyo has greater connectivity.
Also, we have only listed superconducting devices because they are by far the most advanced QC devices. This does not mean that QUEKO cannot generalize to other technologies such as quantum dot 8 because our approach is valid as long as the basic quantum gates of this technology are single-qubit and two-qubit gates. 
Circuits in qasm files and device graphs Input
Depth and Size
We constructed two sets of benchmarks with different depth ranges. The corresponding size of these benchmarks can be deduced from the depth and the gate density vector, as shown in Algorithm 1. The first set has depths from 5 to 45, which is the near-term feasible benchmarks (B NTF ). In fact, one of the largest quantum circuits executed nowadays has depths 41 [1] , which is about the same with the upper bound of B NTF . We intended to find out the layout synthesis performance within the current execution capacity. The sizes of B NTF benchmarks range from 1136 to 34506 quantum gates. The second set of benchmarks, denoted as B SS has depth from 100 to 900 which are benchmarks for scaling study. B SS represents the performance of these tools when the decoherence time of QC device improves in the future. The sizes of B SS benchmarks range from 37 to 1727 quantum gates.
Gate Density Vector
We picked two special gate density vectors in the experiment: (0.51, 0.4) based on the quantum circuits used in Google's quantum supremacy experiment [1] , denoted "QSE" below, and (0.27, 0.36) based on the Toffoli circuit, denoted "TFL" below. It is beneficial to study QSE, since it is the only circuit so far with which experimental QC has shown a clear advantage. We chose the TFL because existing QC logic synthesis algorithms are based largely on reversible logic synthesis, which uses TFL as a fundamental element [7] . We also swept through possible gate density vectors and generated benchmarks for impact of gate density (B IGD ).
Layout Synthesis Tools
Currently, Google, IBM, and Rigetti are considered front-runners of superconducting QC. Inside their QC frameworks (Cirq, Qiskit, and pyQuil), there are tools for layout synthesis. Unfortunately, pyQuil does not provide options to breakdown the whole compilation into optimization and layout synthesis, so pyQuil was excluded from the experiments. We also included a recent academic work from Zulehner et al. [4] We used greedy router in Cirq version 0.6.0 as one of the layout synthesis tools, as shown in Fig. 5 . So far, only one router named greedy has been released, which contains an initial placement policy and a SWAP insertion policy based on heuristic search. Note that greedy router does not transform the gates into gates that are implementable on Google devices in Table 1 , so the resulting circuit it produces contains the original input gates and SWAP gates it inserts. For a fair comparison of depth, in all the experiments, we decompose SWAP gates inserted by the tools to three CX gates, like in Fig. 3b . Qiskit offers the most precise control over the so-called "transpiler". The transpilation is divided into individual passes, and users can define their own "pass manager" to make use of various transpiling modules that are offered. For the layout synthesis problem, there are Layout modules generating initial mapping and Swap modules inserting SWAP gates to the circuit to enable two-qubit gates. Among the various combinations, we chose DenseLayout and StochasticSwap as shown in Fig. 5 , which seemed to have the best overall performance. DenseLayout maps the logical qubits to an area on the device graph with dense connections. StochasticSwap perturbs the distance matrix of physical qubits and performs heuristic search for SWAP gates. The version of Qiskit in the experiments is 0.14.1.
Another highly competitive router, t|ket , comes from Cambridge Quantum Computing. Graph Placement uses graph monomorphism to derive initial mapping. Route performs heuristic search for SWAP gates. We used t|ket version 0.4.1 in the experiments.
Since all the tools evaluated use heuristics at some stages, sub-optimality is expected. Note that we only use default setup on all the modules in all the tools. Changing setup parameters in some of these modules specifically for QUEKO benchmarks may lead to better performance in the following experiments, but may lead to worse performance on other circuits.
Experimental Results
Performance on B NTF
In Fig. 6 , the horizontal axis is the optimal depth and the vertical axis is the depth ratio, which is the depth of layout synthesis result divided by the optimal depth T . In the case of a smaller device (Aspen-4) and sparser circuits (TFL), the optimality gap on average is about 12x for [4] , 10x for Cirq, 5x for Qiskit and 1.5x for t|ket . In the case of a larger device (Sycamore) and denser circuits (QSE), the optimality gap on average is about 11x to 14x for Qiskit. The optimality gaps of Cirq and t|ket grow with depth correspondingly from 35x to 50x and from 1x to 5x. Zulehner et al.
is not in the Fig. 6b , because for the larger device, it took so much memory that the operating system constantly killed it before finishing. This also happens sometimes for the smaller device experiments, so there are less blue data points than the other types of points in Fig. 6a .
Performance on B SS
We studied further scaling of t|ket and Qiskit on different devices as shown in Fig. 7 . The optimality gaps on average by t|ket are about 5x to 7x for Rochester, 3x to 4x for Sycamore, 3x for Tokyo, and 2x for Aspen-4. Note that for a fixed depth and a fixed device, the optimality gaps by t|ket varies rather widely. t|ket managed to find the optimal mappings for some QUEKO benchmarks. In general, as the depth increases, the depth ratio by Qiskit decreases at first and then converges to a value. The reason for this phenomenon may be that as the circuit deepens, the influence of initial placement gets smaller than the influence of SWAP insertion. seen that larger devices (Rochester and Sycamore) bring about larger optimality gaps. If the number of physical qubits are close, then richer connectivity (Sycamore versus Rochester) brings about smaller optimality gaps.
Performance on B IGD
To better understand the impact of gate density on layout synthesis performance, we fixed the device to Tokyo and the depth to 45, and swept through possible gate densities. The results are shown in Fig. 8 . Fixing a column, the single-qubit gate density increases as we go down, Qiskit seems to be rather insensitive to this change, which is sensible since the single-qubit gates do not induce difficulty in layout synthesis. However, t|ket is still sensitive to this change. Both tools are more sensitive to the change in the horizontal direction than in the vertical direction. Since the challenge to layout synthesis comes mainly, if not solely, from the two-qubit gates, this result is expected. The depth ratio of t|ket decreases when the two-qubit gate density is very high. This is because when the circuit is dense with two-qubit gates, graph monomorphism algorithms can extract more information from the first few layers to narrow down better initial mappings.
Complexity
Seeing the large optimality gap, it is natural for us to investigate the computational complexity of the depth-optimal QC layout synthesis problem, which was unknown till this point. Several related results are shown, e.g., determining the minimal number of SWAP gates to insert is NP-complete. [18] proves this theorem by reduction from subgraph isomorphism problem. [11] proves the NP-completeness of depth-optimal initial placement where SWAP gates are not allowed. The NP-completeness of depth-optimal QC layout synthesis for QAOA circuits is proven in [37] by reduction from 3-SAT. In this section, we prove this for general quantum circuits by reduction from Hamiltonian cycle problem, as Theorem 1 states. Theorem 1. Depth-optimal QC layout synthesis is NP-complete.
Proof. The original QC depth-optimal layout synthesis problem is not easier than its decision version: input and constraints remain the same; but output whether the depth of the scheduled circuit can be lower or equal to T . Inspired by [11] , we show that the problem of determining whether a Hamiltonian cycle exists in a graph is reducible to the QC depth-decision layout synthesis problem. The former is NP-complete, so the latter is also NP-complete.
Suppose the graph for the Hamiltonian cycle problem is G H = (V H , E H ), where |V H | = N . We construct a depthdecision QC layout synthesis problem using G H as the device graph and N as the target depth. The input circuit is N "levels" of gates. Level l contains a two-qubit gate g l = CX(q l , q (l+1) mod N ). All the other logical qubits at level l are fully occupied by single-qubit gates.
If there exists a Hamiltonian cycle in G H , (p 1 , p 2 , ..., p N , p 1 ), then let the initial mapping be µ 0 : q i → p i for i = 1 to N . It is easy to see that, with this mapping, all the gates in the constructed circuit can be executed.
On the other hand, if there exists a scheduled circuit with depth within N , we first claim that the mapping cannot change during the execution of the circuit. Every gate in a level depends on some gate in the last level. So every gate in level l has a dependency chain of length l, which is the earliest cycle it can be scheduled to. This means, if any SWAP gates are inserted in gate scheduling, certain dependency chain must lengthen and the depth of the scheduled circuit is larger than N . Therefore, if a solution within N cycles exists, each gate in level l must be scheduled at exactly cycle l and there can be no SWAP gates inserted. It is also easy to see that the input CX gates cannot constitute any SWAP gates. Therefore, the mapping from logical to physical qubits remains µ 0 throughout all the cycles in the scheduled circuit. The gates CX(q l , q (l+1) mod N ) being all executable means that they are mapped to edges of G H . This means µ 0 (q 1 ), µ 0 (q 2 ),..., µ 0 (q N ), µ 0 (q 1 ) is a Hamiltonian cycle in G H .
In conclusion, we established the equivalence between a Hamiltonian cycle in G H and the existence of QC layout synthesis solution within depth N . Thus, the Hamiltonian cycle problem is reducible to the QC layout synthesis problem. The latter is NP-complete, since the former is known to be NP-complete.
Conclusion and Future Work
In this paper, we formulated the problem of quantum computing layout synthesis and proved its NP-completeness. We constructed QUEKO benchmarks, each has a known optimal depth for the given device. With QUEKO, we examined four existing quantum computing layout synthesis tools, greedy router in Cirq, t|ket , DenseLayout plus StochasticSwap in Qiskit, and Zulehner et al. [4] and showed rather surprising results. Despite over ten years of research and development efforts by both academia and industry, the current QC compilation flow is far from optimal. In fact, even combining the best performances of all tools evaluated, the optimality gaps range from 1.5x to 45x on average for circuits of feasible depth on existing devices. These gaps reveal that there is substantial room for research into QC layout synthesis, potentially equivalent to an order of improvement of the decoherence time, which would require much higher investment in quantum device technologies to achieve. We plan to use QUEKO benchmarks as a guide to better layout synthesis tools. In addition, we plan to extend our research to construct examples with known optimal solutions for fidelity optimization, as multiple studies have shown that fidelity is a very important metric for quantum circuits in the NISQ era.
