As the effort to scale up existing quantum hardware proceeds, the necessity of effective methods to schedule quantum gates and minimize the number of operations becomes more compelling. Three are the constraints that have to be taken into account: The order or dependency of the quantum gates in the specific algorithm, the fact that any qubit may be involved in at most one gate at a time, and the restriction that two-qubit gates are implementable only between connected qubits. The last aspect implies that the compilation depends not only on the algorithm, but also on hardware properties like its connectivity.
I. INTRODUCTION
Modern computers rely on optimized instruction scheduling to takes full advantage of the computing capability of microchips. Harnessing more parallelism, that can be achieved by scheduling operations on multiple compute units at the same time, greatly contribute to realizing high performance in compute-intensive applications in various domains such as scientific computing, big data analytics, and machine learning. Compared to this mature field of classical computing, quantum computing is an area of active research that only has recently moved into technological relevance [1] [2] [3] [4] [5] [6] . While providing the list of (logical) instructions for machines with only a handful of qubits is a relatively simple task, for larger machines, the scheduling problem needs to be addressed with systematic methods. While pioneering works have already addressed the synthesis and compilation of quantum circuits either as a part of extensive software frameworks [7] [8] [9] or as an independent problem [10] [11] [12] [13] , the field is vastly unexplored.
Here we consider the three main constraints that a scheduler for quantum algorithms has to take into account. The first one is due to logical dependencies, i.e. constraints in the order of operations inherently in the algorithm. The other two are due to hardware constraints: No qubit can be involved in more than one gate at the same time and two-qubit gates can be implemented only between qubits that are physically connected. This last constraint, in particular, implies that routing operations are required to perform the logical gates described in the quantum algorithm, with larger overhead for less connected hardware. We propose a two-step approach in which the logical gates are initially ordered according to the Logical Data Precedence Graph (LDPG). This step neglects the connectivity constraint and focuses on executing the logical quantum gates as parallel as possible. The necessary routing operations are added in the second phase, and their number is minimized by a heuristic that favors the gates that does not require routing operations.
In the following, we describe how the subtasks of gate scheduling can be recast in terms of graph problems as edge-coloring and maximum subgraph isomorphism. The approach is illustrated by scheduling the Quantum Approximate Optimization Algorithm for a hardware with linear connectivity, arguably the most limiting geometry that still allows the scaling of the hardware.
II. REQUIREMENTS FOR PRACTICAL QUANTUM CIRCUITS AND TWO-STEP APPROACH
Finding the optimal scheduling, in both the classical and quantum case, is a hard problem that cannot be exactly solved with polynomial efforts [14, 15] . At the same time, schedulers based on heuristic methods have effectively harnessed the parallelism in classical computers. It is reasonable to expect that a similar benefit is achievable by scheduling quantum algorithms.
In this Section, we describe three kinds of constraints in practical implementations of quantum algorithms. We propose a two-step approach to deal with these constraints as separate subproblems and show how they can be tackled by specialized schedulers.
According to the gate model of quantum computation, the information is stored in the quantum state of the qubit register. To manipulate such information in any conceivable ways, it is sufficient to act with operations (called gates) that involve one or, at most, two qubits at a time. At the end of the computation, the information is retrieved by measuring the qubit register 1 . Implementation of algorithms on a specific hardware may differ based on what set of gates are available, on the number of gates required to decompose multi-qubit operations into gates acting on at most two qubits, and on how efficiently those gates can be error corrected. These aspects are of fundamental importance in the pre-compilation phase of quantum algorithms, but the focus of our work is on compilation tasks that still lie ahead.
In fact, the pre-compilation phase provides the sequence of single-and two-qubit logical gates that has to be performed in order to execute the algorithm. Once such a sequence is available, those operations have to be scheduled as a list of physical instructions. At this level, there are three types of constraints that have to to be taken into account:
Logical dependency: Certain operations have to be performed after the completion of previous gates. The dependencies are explicitly provided by the quantum algorithm.
Exclusive activation:
A qubit can only be involved in a single operation at a time. This also applies to quantum operations that formally commute: While their temporal order may not matter, they still cannot be executed at the same time if they share a qubit.
Physical connectivity: Two-qubit gates are possible only between qubits that are connected according to the hardware topology. In physical terms, a connection means that it is possible to control a strong enough interaction between the two qubits.
Notice that the exclusive activation constraint can be expressed differently for different hardware realization. While we focus on the case described above, an alternative formulation may, for example, require that no connected qubits can be involved in different gates at the same time (as it seems to be the case for the hardware under development at Google [16] ).
We propose a two-step approach. In the first step the temporal order and exclusive activation constraints determine a "Logical Data Precedence Graph" (LDPG) that is used to assign a priority value to each and every gate. In the second step, the connectivity constraint determines the necessary routing operations that are added to complete the "Physical Data Precedence Table" (PDPT).
Using concepts and notation from the standard literature on schedulers [14, 17] , the LDPG can be visualized as a graph G = (N , E) in which a node n ∈ N represents a gate and a directed edge e = (n i , n j ) ∈ E represents a dependency between the quantum operations (in this case, n j depends on n i ). Only explicit dependencies are shown with edges; meaning that, if gate n 2 has to follow n 1 and gate n 3 has to follow n 2 , we do not explicitly indicate a transitive dependency that n 3 has to follow n 1 .
Additional rules are required to properly take into account that certain quantum gates commute and, therefore, their order can be relaxed. In the next Section, we discuss the rules to create the LDPG and visualize the result in FIG. 2. From the LDPG, one can assign a priority value to each gate: The basic idea is that the priority of a certain gate increases if more gates logically depend on it. Large priorities indicate the gates that are along the critical path (i.e. the most time consuming sequence of logically dependent gates) and, therefore, should be scheduled to be executed as early as possible.
The PDPT is a table with as many rows as qubits in the hardware and one column for each sequential set of gates. The total number of columns corresponds to the circuit depth, and the attribute "physical" indicates that connectivity constraints are included. We fill the PDPT starting from the gates with the highest priority. If multiple gates have an equal priority, we prioritize gates that satisfy the connectivity constraint from the current mapping between physical and logical qubits. If this is not possible, we propose to derive the order from the solution of the maximum subgraph isomorphism between the connectivity graph and the interaction graph (the latter describing the gates that needs to be executed). Operations that exchange two connected qubits (also called SWAP gates in the following) are added at this time. Details are provided in Section IV.
III. CONSTRUCTION OF THE LOGICAL DATA PRECEDENCE GRAPH WITH

PRIORITY VALUES
The input is a quantum circuit composed by one-and two-qubit gates belonging to the specified set of gates available in the hardware. To preserve generality, the set of gates must be capable of universal quantum computation 2 . A preliminary step may be needed to associate the quantum algorithm with a circuit of such form: For example, multi-qubit gates or arbitrary single-qubit rotations must be decomposed in terms of the available gates. The output is the LDPG together with a priority value associated with each node (here representing a gate).
In the following we refer to two gates as "consecutive" if they are not separated by another operation and act on, at least, one common qubit. Notice, however, that quantum gates may commute, meaning that their order of execution has no relevance on the combined operation. For this reason, the definition of consecutive gates must be generalized to include gates that appeared FIG. 1. An example of a quantum circuit with three groups of pair-wise commuting gates (identified by the same color). Each line corresponds to a qubit, each box to an operation, and the execution order is from left to right. While it is clear that gates n 1 and n 7 are consecutive to gate n 5 , it is less obvious that also gate n 8 is consecutive to n 5 . This is due to the fact that n 7 commutes with n 8 : Had we chosen another order in the picture, n 5 and n 8 would be in a direct sequence also visually. As a second example, gate n 6 is consecutive to gates n 1 , n 4 , n 8 , n 3 , and n 9 . 6 separated by commuting operations but that can effectively be brought in direct sequence through gate reordering/commutation only. FIG. 1 clarifies this definition with an example.
We call "parents" (respectively "children") of a certain gate the consecutive and non-commuting gates that logically precede (respectively follow) the operation. In FIG. 1 , gate n 4 has two parents n 1 , n 2 and one child n 9 . While n 6 and n 7 are consecutive gates for n 4 , they commute with it and so they do not qualify for parent/child relationships.
Without consecutive gates that commute, each two-qubit operation has at most 2 parents and at most 2 children, while single-qubit operations have unique parent and child. This property is not preserved with commuting gates. Referring to FIG. 1, gate n 1 has 4 children, namely n 4 , n 5 , n 6 , and n 8 .
The parent/child relationship can be used to construct the LDPG efficiently. Each gate depends on its parents and, conversely, its children depend on it. A directed edge is then drawn from each parent to the gate and from the gate to each of its children. For simplicity, quantum operations can be added as nodes of the LDPG starting from the leftmost gates (those that do not have parents), The graph has directed edges, no self-loop and is acyclic. The nodes with only incoming edges and no outgoing edges (i.e. gates without children gates) are called leaves in graph theory, but sometimes we say that they belong to the "last generation" following the parent/child relationship. with the same color are pairwise commuting. According to the rules explained in the main text, observe that gate n 1 is a parent of n 8 since n 8 could be reordered ahead of n 4 and n 6 . In contrast, gate n 1 is not a parent of n 7 since gate n 5 separates them.
After an LDPG is built, priorities are assigned to each node in the graph. An effective strategy (common in scheduling algorithms for classical computers) is to define the priority as the latencyweighted depth of the node. The depth of a node n i corresponds to the maximum number of nodes traversed along any directed path from n i to any gate in the last generation (i.e. a leaf). Since not all gates require the same time or effort in actual implementation, we associate a latency that quantifies such effort. For example, the latency may represent the time required for the physical interaction to generate the specific quantum operation, but it may also depend on the fidelity of the quantum operation itself. Hereafter, we will think of the gate latency as the time required by the corresponding quantum operation and denote it as t i = latency(n i ). Latencies must be positive so every gate has a larger priority than its children and lower priority than its parents. In practice, the priority is a scalar value consistent with all logical dependencies.
The latency-weighted depth p i = priority(n i ) is then computed according to:
Priorities can be efficiently computed by traversing the (directed and acyclic) graph in a post-order, starting from the gates in the last generation (p j = t j when n j ∈ leaves) and, for each node, adding its latency to the maximum of its children's priorities. A simple, but realistic, situation is obtained when all priorities are unitary. This represents quantum circuits ran in a synchronous way in which each and every gate takes a fixed amount of time. For the numerical study in Section V, we consider such synchronous model together with the simple asynchronous case where t i = 1 for one-qubit gates and t i = 2 for two-qubit gates. See TABLE I and FIG. 3 for an illustration.
IV. CONSTRUCTION OF THE PHYSICAL DATA PRECEDENCE TABLE
Priority values are compatible with all logical dependencies between quantum operations, mean-
ing that no gate can depend, either explicitly or implicitly, on a gate with lower priority. One can therefore use the priority value to schedule the quantum gates and construct the Physical Data Precedence Table. This table has as many rows as physical qubits in the hardware and at least as many column as the maximum priority value. It indicates what physical qubits are involved in what gate at any given time.
However, the mere knowledge of the priority values is not enough to construct the PDPT. Two non-trivial actions have to be taken in order to satisfy the connectivity constraint and resolve ambiguities for gates with equal priority: Routing operations have to be added and a tie breaking strategy needs to be introduced.
We denote with Greek letters the index of the physical qubits (corresponding to the row index for the PDPT table) and with τ i the experimental time interval at which the gates scheduled in the i-th column of the PDPT are performed.
Entry (α, i) of the PDPT must provide two pieces of information: Which logical qubit, if any, is associated with the α-th physical qubit at (physical) time τ i and what gate is currently performed on that physical qubit, if it is not idle.
A. Routing operations
First of all, notice that two gates that act on (at least) one common qubit have the same priority if and only if they are both consecutive and commuting. We address this situation in the next subsection and neglect such possibility for the moment. A greedy strategy is applied when filling the PDPT starting from the highest priority gates. Let us assume that we are now scheduling gates with priority p. Routing operations are added at this stage through SWAP gates, the effect of a SWAP gate being to exchange the logical qubits associated with the two physical qubits involved in the SWAP.
To determine what qubits need to be exchanged, one needs to consider the "connectivity graph"
C representing a sort of blueprint of the actual hardware. In fact, each node of C represents a physical qubit, self-loops mark the qubits where single-qubit operations are available and (undirected) edges indicate the pairs of qubits between which it is possible to implement a two-qubit gate.
When all gates with priority (p+1) have been scheduled, we are left with a specific map between logical and physical qubits. We now look at the gates with priority p and color the nodes of C such that two physical qubits have the same color if and only if they correspond to a logical qubit involved in the same priority-p gate. No color is given to physical qubits that are idle. From the node coloring perspective, applying a SWAP gate corresponds to exchange the colors of two connected nodes. The goal is to obtain a node-coloring pattern in which the (at most two) nodes with the same color are always in contact: When this is the case, all connectivity constraints for the execution of logical gates with priority p are satisfied.
The development of an optimal strategy to exchange colors for the nodes in C is an interesting problem in itself and we indicate it as a subtask that could be optimized separately from the rest of the scheduler. We name this subproblem as the "color pairing" problem, requesting to minimize the number of color exchanges. In the following we provide a few considerations on the general case and describe an optimal (despite not unique) strategy for hardware with linear topology.
It is possible to compute a sort of distance between the current coloring pattern and an acceptable one (for which same-color nodes are connected). Given a pair of same-color nodes in C, compute the length of the shortest path between them. Sum up all such lengths to obtain the color-pair-distance D. Every SWAP changes the color of at most two nodes and the color-distance can change by at most 2 (all cases may be realized, with D changing by δ ∈ {−2, −1, 0, 1, 2}).
We propose a heuristic that starts performing all the SWAPs leading to δ = −2 and subsequently proceeds with those leading to smaller (or no) reduction of D. Acceptable coloring patterns are characterized by a zero distance and, therefore, at least D 2 SWAPs are required to achieve it.
When C is a linear graph, several optimal strategies are possible. Here is one (for open boundary conditions and node index running from 0 for the leftmost node to N − 1 for the rightmost node):
Algorithm 1 Pair-coloring for linear graph 1: procedure Left accumulation 2:
while n < N do 4: if n has unique or no color then 5: n ← n + 1 6: else 7: find m such that color(n)=color(m) 8: for k = m − 1, m − 2, · · · , n + 2, n + 1 do 9: apply SWAP(k,k+1) 10: n ← n + 2
In Appendix A we prove that such procedure is optimal. However, this is not the unique optimal strategy as can be easily see by considering the symmetric procedure starting from the rightmost 
B. Tie breaking strategy
When gates that are both consecutive and commuting are present in the quantum circuit, then it may happen that multiple gates with equal priority p acts on the same qubits. A typical case is when one needs to manipulate each computational state in a coherent way according to some classical function, usually decomposed in several one-and two-qubit gates that pairwise commute.
The quantum Fourier transform is another important example. Due to the exclusive activation constraint, we have to decide the order of execution. A possibility is to consider a random order.
Here, we aim to do better.
To determine the order of the remaining gates and describe a general prescription, it is convenient to introduce the "interaction graph" I p . Graph I p is constructed from the set of gates with priority p: Each node corresponds to a logical qubit, self-loops represent single qubit gates and undirected edges correspond to two-qubit gates.
We observe that satisfying the exclusive activation corresponds to divide the priority-p gates into subsets composed by gates that act on different qubits. We mark each subset with a different color, effectively associating a color with each edge (including self loops) of the interaction graph.
The problem of choosing appropriate subsets translates to the standard edge-coloring problem (no edges with the same color can share a node) in which one minimizes the number of colors involved. The fewer the colors, the larger the parallelism exploited by the scheduler. When the edges are divided in subsets, one proceeds to schedule one subset at a time while adding the routing operations according to the procedure described for gates with different priority.
However, while sets of gates with different priority are subjected to logical dependencies that pose constraint on their scheduling order, the attribution of edge colors in I p is completely arbitrary.
Here we propose a simple "look ahead" strategy to choose colors leading to a consistent logical-tophysical map between gates of multiple color subsets. This approach, which prioritizes reducing the routing cost over minimizing the number of edge colors, is expected to be advantageous for hardware with limited connectivity.
The logical-to-physical (LTP) qubit map can be seen as the identification of the nodes of the interaction graph over a subset of the nodes of the connectivity graph 3 . If I p is a subgraph of C according to the current LTP qubit map, no routing operation is required and the gates with priority p can be scheduled according to the solution of the edge-coloring problem. Otherwise, one or more different LPT maps are required.
We propose to derive the new map from the solution of the maximum subgraph isomorphism problem between I p and C. Therefore one has both the initial and desired LTP map and must solve the related routing problem. When I p is not fully contained in C, the edges belonging to the maximum subgraph must be eliminated from I p and a new maximum subgraph identified.
Ultimately, all gates with priority p will be scheduled: Each group corresponding to those of a subgraph (solving edge-coloring may be required) and between them the SWAPs required by the routing.
We observe that the number of SWAP gates may be reduced by selecting the next subgraph isomorphism in ways that consider (and try to minimize) the exchange cost between the current and next LTP qubit maps. Notice that, since finding the maximum subgraph isomorphism is a hard problem, approximate solutions are acceptable at every iteration [18] .
V. SCHEDULING QAOA FOR A 1D ARRAY OF QUBITS
We illustrate the two phases presented in the previous Sections by scheduling the Quantum Approximate Optimization Algorithm (QAOA) [19] [20] [21] on hardware with linear connectivity. We consider this example significant for two reasons: First, QAOA gives rise to situations where lots of commuting and consecutive gates have the same priority and the tie-breaking strategy plays a relevant role; second, the open-boundary 1D topology reflects actual short-term devices [3, 22, 23] and corresponds to the most connectivity-constrained architecture that is still scalable.
QAOA is a variational algorithm to solve combinatorial problems. The quantum circuit is a sequence of only two kind of operations, repeated for a desired number of times:
, expressed in terms of the Pauli matricesX i ,Ẑ i acting on the i-th logical qubit.V (β) corresponds to single-qubit rotations by the same angle on each logical qubit.Û (γ) corresponds to a gate diagonal in the computational basis and decomposable in gates involving onlyẐ matrices. The specific form ofĈ depends on the problem at hand and, for the well-studied case of the MaxCut problem 4 , it is the sum of parity gates likeẐ iẐj [13, [19] [20] [21] .
One has:Û
where the notation (i, j) ∈ I indicates that the (i, j) corresponds to an edge of the graph that defines the MaxCut instance and that also defines the interaction graph (this is the reason of the notation).
The complete quantum circuit, for a certain depth d, correspond to the sequence:
with {γ k , β k } k=0,1,··· ,d−1 being the variational parameters. The following commutation relations hold:
It follows that the only gates that are both commuting and contiguous belong to the same operation U (γ k ). The Logical Data Precedence Graph (LDPG) is straightforward to build and the priority can be assigned very easily. Noting with t X the latency of single-qubit rotations and with t ZZ the latency of the two-qubit parity rotations, one has that all gates {exp (−iβ kXi )} i have the same priority p = (t X + t ZZ )(k + 1) and all gates {exp (−iγ kẐiẐj )} (i,j)∈I have priority p = (t X + t ZZ )k + t ZZ .
As anticipated, the scheduler can take advantage of the freedom to order the parity rotations composing eachÛ (γ k ) by applying the strategies proposed for routing and tie breaking. We consider the case d = 1 since a schedule for d > 1 that requires a number of gates at most linear in d is always possible. To be convinced of this fact, notice that gates {exp (−iγ kẐiẐj )} (i,j)∈I should be scheduled in opposite order compared of those at depth (k − 1) to make the logical-to-physical map of the qubits compatible between the end ofÛ (γ k−1 ) and beginning ofÛ (γ k ).
FIG. 4 provides a visualization of the intermediate representations discussed in the Sections
above. One starts from the quantum circuit of QAOA with depth d = 1, then constructs the LDPG and assigns a priority value to each gate. The PDPT is filled by taking into account the connectivity diagram of the hardware and, to break the ties due toÛ (γ), looking for subgraph isomorphism with the interaction graph (here corresponding to the graph for the MaxCut instance).
Due to the linear topology, the maximal subgraph isomorphism can be reformulated as looking for long paths inside I.
In our numerical analysis, we schedule the QAOA for various system sizes N ∈ {10, 20, 30, 40, 60, 80, 100} in a way suitable for linear hardware with the same number N of physical qubits. In practice, we consider the complete utilization of the available hardware resources. We provide the results in terms of total gate count (including the routing operations) and of the circuit depth averaged over M = 1000 instances of MaxCut problem for random 3-regular graphs 5 . For simplicity, the synchronous model is considered for which t X = t ZZ = t SW AP = 1.
The schedule for the operationV (β) is trivial and corresponds to the same single-qubit gate exp (−iβX) applied on each of the physical qubits, irrespective of the logical qubit associated. No The rest of the edges are colored with a (randomized) greedy order. It is easy to see that the initial logical-to-physical map allows the execution of all gates forming the long path without any routing, despite the fact that they have two distinct colors. The subsequent color-pairing subtasks are solved via the left accumulation strategy. The lowest value between the baseline and 4N repetition of the long-path scheduler is used for each MaxCut instance.
In FIG. 5 we report the total number of gates, including the SWAP operations, to implement U (γ) from the three scheduling strategies above. FIG. 6 shows similar results for the depth of the quantum circuit. The numerics refer to the synchronous model with unit latency per gate.
We observe that the routing overhead can be described by a quadratic function of the number of qubits N , as can be expected in the worst-case scenario. The circuit depth is, instead, linear in N due to the possibility of performing SWAP gates in parallel. Our results suggest that, by adopting increasingly sophisticated strategies for the edge coloring task, one obtains a consistent reduction in the number of necessary SWAP gates and in circuit depth. In Appendix B, we provide additional considerations to relate our results to lower bound estimates. 
VI. CONCLUSIONS
We have presented a two-step approach to schedule quantum circuits. Three constraints have been taken into account: The logical dependency of the gates, the exclusive activation of the qubits, and their hardware-dependent connectivity. We propose to begin by capturing the logical dependencies using the "logical" data precedence graph (LDPG) and include the routing operations in a second phase. We phrase the exclusive activation in terms of the edge-coloring problem for the interaction graph and the connectivity in terms of the color-pairing problem for the nodes of the connectivity graph.
It is important to notice that the scheduling task addressed in this study does not include all the optimizations available when compiling quantum algorithms. In fact, we have supposed that the quantum algorithm is initially provided as a sequence of one-and two-qubit gates that are readily available in the hardware of interest. In general, this sequence is the output of another compilation task that may perform all or a subset of the following optimizations: Decompose an arbitrary single-qubit rotation as a sequence of fixed-angle rotations, possibly introducing a controlled approximation on the actual rotation angle; decompose multi-qubit gates into a sequence of one-and two-qubit gates; combine sequences of operations into a smaller number of gates (for example an arbitrary long sequence of single-qubit rotation can be compressed to only three rotations using the Euler decomposition of tridimensional rotations); attempt to exchange the order of logical operations (even when the corresponding quantum gates do not commute, this might still be possible by properly modifying one or both of the gates) to trigger further simplifications.
Within the scope of our work, the main limitation of the two-step approach is the lack of a "lookahead" strategy that may cause the routing at priority p to be undone (paying the overhead cost) at priority p − 1 or lower. We suggest a way to mitigate such effect through the approximated solution of the maximum subgraph-isomorphism. The expectations are confirmed by the numerics related to the QAOA. In fact, the long-path strategy is the specialization of the subgraph isomorphism for linear connectivity graphs.
Finally, we recall that quantum computing has only recently reached technological relevance and foresee a strong and renewed interest in all the auxiliary tasks required to implement quantum algorithms in the most effective way. We believe that gate scheduling represents one of the most prominent tasks to solve to take full advantage of quantum speedups.
ACKNOWLEDGMENTS
The authors would like to thank Mikhail Smelyanskiy for posing the question at the origin of this study.
• Appendix A: Proof of optimality of left accumulation for pair-coloring in 1D
This Appendix demonstrates that the "left accumulation" strategy presented in Section IV is optimal for the color-pairing problem in one-dimensional topology with open boundary conditions, meaning that the physical qubits are disposed on a line with the end qubits not connected to form a ring. Notice that this strategy is not unique (consider for example the symmetric strategy of "right accumulation") and that the overall task of minimizing the routing cost of scheduling a quantum algorithm depends on how the color-pairing strategy of gates with priority p influences the routing cost of those with priority p − 1. It may be possible that solving a subtask according to a locally suboptimal strategy leads to globally more efficient schedule for certain problem instances.
The color-pairing problem can be stated as follows: Given a connectivity graph C, consider a (node) coloring of its nodes such that at most two nodes share the same color. It is possible to exchange the color of two connected nodes (this operation corresponds to a SWAP gate). The goal is to reach a compatible coloring pattern, i.e. one in which all nodes sharing the same color are connected, with the minimum number of exchange operations.
Recall the definition of the color-pair distance when C is a line: The color-pair distance D is the sum of the number of nodes separating any pair of same-color nodes. Since every SWAP exchanges the color of at most two nodes and the color-distance can vary by at most 2 (all cases may be realized, with D changing by δ ∈ {−2, −1, 0, 1, 2}). Acceptable coloring patterns are characterized by a zero distance and, therefore, at least D 2 SWAPs are required to achieve it. For the purpose of color pairing, nodes with unique colors can be safely assumed to be colorless.
In the optimality proof, we consider such situation. Observe that a single exchange operation has only six different outcomes. When only one node is colored (either or ):
1. Moving the colored node closer to its companion reduces D → D − 1.
2. Moving the colored node farther away from its companion increases D → D + 1.
When both nodes have the same color (as for ):
3. No changes concerning the color pairing or D.
When the two nodes have different color (either or ):
4. The "locally best" move brings both same-color node pairs closer to each other. The distance diminishes by one for each color and then D → D − 2.
5.
The "locally balanced" move brings one same-color pair closer while separate the other pair farther apart. As a consequence, D is unchanged.
6. The "locally worst" move brings both same-color node pairs farther away from each other.
The distance increases by one for each color and then D → D + 2.
It is clear that any optimal strategy must not include any move of type-2, 3, or 6. When possible, moves of type-4 are preferable over those of type-1 and 5. The best scenario is when moves of type-4 suffice to obtain a compatible coloring pattern. However, moves of type-1 are unavoidable if there is a colorless node between two same-color ones. In particular, if a pair is separated by k colorless nodes, the number of required moves of type-1 is also k (an example with k = 2 is ). Of course, the same colorless node may separate multiple same-color nodes and therefore contribute a type-1 move for each such pair.
Are there situations when moves of type-5 are unavoidable? Yes, it happens when two samecolor nodes are separating another pair (think of the sequence ). To achieve one of the two compatible color patterns (either or ) one needs a type-5 move followed by a type-4 move. Any optimal strategy thus require one type-5 move for every same-color pair contained between, and separating, another same-color pair.
It is straightforward to verify that the left accumulation strategy presented in Section IV avoids any move of type-2, 3, and 6. In addition, it requires the minimum number of type-1 and 5 moves.
The optimality follows logically.
Appendix B: Lower bound for the number of routing operations
The numerical study in Section V shows that adopting our two step approach provides encouraging results for the problem of scheduling quantum algorithms on hardware with linear connectivity.
The efficacy of the scheduler is improved by adding stochasticity and by initializing the logicalto-physical qubit map according to long paths in the interaction graph. An important question remains unanswered: How close are the schedules we found compared to the global best schedule?
In this Section, with global best schedule we refer to the schedule involving the least number of gates, without considering the circuit depth. The two quantities are clearly related, but not always the circuit with less gate results in the shallowest depth. We consider two approaches: The first is the exhaustive enumeration of all the possible circuits, while the second is a heuristic bound.
Exhaustive search
This approach is only feasible for extremely small problem sizes. The number of quantum circuits involving N qubits and including at most S exchange operations grows according to:
There are obvious situations that can be excluded from the search: For example, starting from an initial LTP map or its reverse order are equivalent situations (reducing the first term in the expression above by a factor 2) and eliminating situations in which the same SWAP gate is applied twice consecutively can be proved not to eliminate uniquely optimal strategies 7 .
The number of possible quantum circuits is reduced to N ! 2 (N − 1)(N − 2) S−1 . Unfortunately, its scaling is still very unfavorable and, in practice, this limits our current numerical results to N ≤ 8. For N = 8, we solved all 150 instances considered by exploring S ≤ 9. To allow such broad exhaustive search, we further reduce the number of the SWAP sequences explored by following the method below. We observe that each exchange sequence can be thought as a number with S digits in base (N − 1), where (N − 1) are the physically distinct SWAP gates available for a line of length N . We order the SWAP sequences in increasing order, with the most significant digit identifying the first SWAP and the least significant digit identifying the last SWAP, and start evaluating them once at a time. For each specific sequence, we compute how many logical gates are left unscheduled since they involve logical qubits that never became adjacent. Since any exchange operation can modify the connectivity between logical qubits in a way that at most two additional pairs of qubits become connected due to that particular SWAP gate, we can deduce the minimum number of changes in the SWAP sequence that may allow a solution. For example, if 3 logical gates are not possible with a specific SWAP sequence, at least two final SWAPs have to change.
In general, if r gates are not possible, then at least r 2 SWAPs at the end of the sequence have to change. We report the values found for the total number of gates for a singleÛ (γ) operation in instances. In the exhaustive search with N = 8, all 150 instances were solved with sequences of at most S = 9 exchange operations.
Heuristic lower bound
It is possible to provide a lower bound on the number of SWAP operations required to schedule all terms ofÛ (γ) that is based on the adjacency matrix of the interaction graph I.
Let us fix the initial LTP qubit map, so that the adjacency matrix is uniquely defined. Due to the particular properties of linear connectivity, the distance from the diagonal (row-wise or column-wise does not matter because the interaction graph is undirected thus its adjacency matrix is symmetric) of every non-zero entry corresponds to the number of SWAP operations minus 1 required to move the two qubits involved in the corresponding gate in contact. For example, consider a specific non-zero entry in position (i, j). For the adjacency matrix it means that the logical qubit mapped to physical qubit i needs to interact with the logical qubit mapped to physical qubit j. One needs |i−j|−1 SWAPs to move the logical qubits along the line until they are adjacent.
If for each row we consider only the non-zero entry that is the most distant from the diagonal, we are effectively relaxing the problem. Let us denote with W the quantity obtained by summing up the distances of all these entries. It is tempting to consider this a lower bound for the number of SWAP gates required to implementÛ (γ) given the initial LTP map. This is not strictly correct since W must be:
• divided by 2 since every gate corresponds to two non-zero entries and they may contribute to two different rows. Specifically, non-zero entry (i, j) has its symmetric non-zero entry in (j, i). Both entries represent the same logical gate, i.e. edge of the interaction graph, but contribute to both row i and j.
• divided by 2 k, with k being the maximum degree of the interaction graph, since a single SWAP exchange two columns (and rows) affecting at most 2 k non-zero entries and possibly brings each of them closer to the diagonal by one position. For example, consider two nonzero entries at position (i, j) and (h, j) such that i, h > j + 1, i.e. the entries are in the same column and below the diagonal. Swapping two columns i and i + 1 moves both entries closer to the diagonal to position (i, j + 1) and (h, j + 1). A similar effect may involves also entries originally in column (j + 1). Due to the degree of connectivity, each column has at most k non-zero entries.
For the class of instances considered, 3-regular random graphs have k = 3.
Finally, we have to relax the constraint of having a fixed LTP map. This can be done by considering each permutation of the rows and columns of the adjacency matrix. The scope is searching for the LTP map that reduces the profile of the adjacency matrix [25] , effectively providing the minimum value of the quantity W .
So far our considerations are exact, but minimizing the profile is a NP-hard problem [26] .
We estimate the profile by using a heuristic method based on the reverse Cuthill-McKee (RCM) algorithm [27] .
In Figure B .1, we report the heuristic lower bound of the number of gates to implement a single operationÛ (γ) computed as the minimum profile (minus N ) of the adjacency matrix divided by 12, plus the number of two-qubit parity rotations (those are in number 3N/2). Observe that, while the heuristic estimate probably underestimates the number of SWAP operations, the fact that the solution of the minimum average bandwidth is approximate does not allow us to claim a rigorous lower bound.
To understand why we expect the heuristic lower bound to undercount the number of SWAP gates, consider the fact that we divide the average bandwidth by 2k. In the language of Appendix A, this means that only type-4 moves are considered, i.e. those SWAPs that reduce the color-pair distance D by 2. In addition, we also consider that each exchange operation counts as a type-4 move with respect to color-pairing distance for all future logical gates involving one of the exchanged qubits. It would not be surprising for the heuristic lower bound to be, for example, a factor 2 smaller than the actual minimum, at least for large enough systems. To address the performance of our scheduling methods, it would be interesting to have access to a tighter estimate of the lower bound. Each point is obtain as the average over 1000 instances of the MaxCut problem on random 3-regular graphs and the error bar represent the standard deviation. We have considered the synchronous model having unit latency for every gate (including the SWAP operation). The baseline and "long path" approach are described in Section V. The heuristic lower bound is obtained as described in Appendix B 2: Most probably it underestimate the number of necessary gates, but it is not mathematically guaranteed to be lower than the actual minimum. 
