Given a large and a small quantum circuit, we are interested in finding all maximal matches of the small circuit, called template, in the large circuit under consideration of pairwise commutation relations between quantum gates. In this work we present a classical algorithm for this task that provably finds all maximal matches with a running time that is polynomial in the number of gates and the number of qubits of the circuit for a fixed template size. Such an algorithm finds direct applications in quantum circuit optimization. Given a template circuit for which a lower-cost implementation is known, we may search for all instances of the template in a large circuit and replace them with their optimized version.
Introduction
For certain tasks quantum algorithms offer a provable computational advantage compared to the best possible classical methods [1] . For other problems such as finding the prime factors of a large integer [2] or solving linear systems of equations [3] , there is no mathematical proof but good evidence that quantum computers can solve them considerably faster than every classical computer. This raises hope that for many practically relevant problems quantum algorithms can outperform existing classical methods, see e.g., [4, 5] .
To profit from the power of quantum computing as much as possible, it is crucial to further optimize the implementation cost of existing quantum algorithms. One building block for this task is finding small circuits (or parts thereof) efficiently in a larger circuit -a process called template matching. More precisely, suppose we are given a (potentially large) quantum circuit C consisting of |C| gates and n C qubits and a template which is another (small) quantum circuit T with |T | gates and n T qubits. Template matching describes the problem of finding all possible maximal matches of T within C. 1 The idea for optimizing quantum circuits with templates was introduced in [6] , where a template is defined as a sequence of unitary gates U i , such that U |T | . . . U 1 = ½. 2 Let us now assume that the template matching algorithm finds the gate sequence U a . . . U b in the circuit C for some 1 ≤ a ≤ b ≤ |T | that matches the template. Since the full template implements the identity operator and since each unitary U i has an inverse (U i ) −1 = U Efficient template matching is a non-trivial task. The classical analog of this problem, where none of the gates commute, is well studied (see for example [7, 8, 9, 10] ) and found many applications in the context of computer-aided design (see [10] and references therein). The classical matching problem was reduced to the subgraph isomorphism problem in [10] . However, template matching in quantum circuits faces additional difficulties (as discussed in Section 2) due to the fact that certain gates in a circuit commute with other gates and others do not commute. The commutation relations could be mapped to the subgraph isomorphism problem, by introducing relations between the vertices in the graph that determine if it is allowed to interchange two vertices with each other to find a maximal match.
3 As far as we know, this more complex task has not been studied in the general context of graph matching algorithms so far, but the two extreme cases for commutation relations are well understood. In case all gates commute, efficient template matching is straightforward as we essentially need to check if all gates in the template can be found in the circuit. In the other extreme case where none of the gates commute template matching is not expected to be possible in polynomial time due to close relations with the subgraph isomorphism problem which is NP-complete [11] . Nevertheless for a fixed template (and under the assumption that no gates commute) a polynomial time algorithm for template matching is possible since for a fixed subgraph the subgraph isomorphism problem can be solved efficiently [12] . However, allowing certain gates to commute, it is not clear a priori if the template matching problem is still efficiently solvable for fixed template size, since the number of possible permutations of the gates in the circuit C can grow exponentially in |C|. In this work we present an algorithm that demonstrates that it is indeed possible to find all maximal matches in a quantum circuit efficiently.
Previous work. In [6, 13] , heuristic template matching algorithms were introduced. The algorithm in [6] was then applied in [14, 15] for reversible logic synthesis and achieves very low runtimes. In [16] , a template matching algorithm is presented that provably finds all matches. It is based on mapping the circuit to a satisfiability modulo theory problem and applying a specific solver to it. Moreover, it is shown that finding all the matches indeed helps to significantly reduce the gate counts further compared to heuristic approaches. Unfortunately, this improvement comes in tradeoff with the runtime of the algorithm. The algorithm is not efficient, i.e., its worst-case time complexity is exponentially in |C| and n C , and its practical runtimes are significantly higher than for the heuristic approaches (see [16] for a comparison with [14] ). We provide a step to overcome the tradeoff between not finding all the matches and long runtimes, by introducing an efficient template matching algorithm that provably finds all maximal matches.
Result. We fix an arbitrary set of quantum gates and assume that we can check in constant time if these gates commute with each other. Then, for any circuit C with |C| gates (from the fixed gate set) and n C qubits and any template T with |T | gates and n T qubits, we present an algorithm that provably finds all maximal matches of T in C. Furthermore the algorithm is efficient in n C and |C|, however inefficient in n T and |T |. More precisely, its running time is
We refer to Algorithm 3 for a detailed description of the algorithm and to Theorem 4.8 and Theorem 4.9 for a precise statement about the correctness and the complexity of the algorithm. We note that (1) is a theoretical worst-case bound on the running time of our algorithm whose purpose is to prove the efficiency in |C| and n C . In practice, we expect the algorithm to run much faster on most instances. In general there is a trade-off between speeding up the runtime of the algorithm using heuristics and finding all maximal matches. There are situations where it is beneficial to speed up the runtime of the algorithm at the cost of missing certain maximal matches. For such scenarios our algorithm should be combined with certain heuristics as discussed in Section 4.6. We also give a worst-case complexity analysis using the heuristics considered in Section 4.6. On the other hand there are situations, such as optimizing a small quantum circuit for current experimental architectures, where one would like to find all the matches to fully optimize the circuit and where the classical computational time is not the bottleneck. For such scenarios, Algorithm 3 should be run without heuristics such that it always finds all the maximal matches while still being efficient. We note that also in intermediate scenarios, where it is not obvious whether a low runtime or finding as many matches as possible is more effective, our algorithm offers a flexible solution. It is possible to change the parameters used for the heuristics such that we can fine-tune the trade-off between lowering the runtime and the number of found matches. Alternatively, one may adapt the search in the algorithm to first look for the most promising matching scenarios and stop the algorithm as soon as one has found sufficient matches for the considered task.
Notation. We write a circuit C as a gate list C = (C 1 , . . . , C |C| ), where the unitary performed by the circuit is given by U = U |C| U |C|−1 . . . U 1 , where U i denotes the unitary corresponding to the gate C i . A gate can be any description of a unitary operation together with an ordered list of qubits it acts on, e.g., the gate C-not(1, 4) represents a C-not gate controlling on the qubit with label 1 and acting on the qubit with label 4. If two gates A and B perform the same operation, eventually on qubits with different labels, we write A ∼ = B, e.g., C-not(1, 4) ∼ =C-not (2, 1) . If the unitaries that represent the circuits C and D are equal up to a global phase shift, we say that the two circuits are represented by the same operator, and we write C ≃ D. The concatenation of two circuits C = (C 1 , . . . , C |C| ) and
. 4 We denote the commutator of the unitaries corresponding to two gates A and B by [A, B] . Moreover, we write [i, j] C = 0 if and only if i = j or if we can pairwise commute gates in the circuit C such that the order of the gates C i and C j is interchanged, i.e., if i < j the gate C j can be moved before the gate C i and vice versa for the case j < i. The set Perm(1, 2, . . . , n) = {(1, 2, 3, . . . , n), (2, 1, 3, . . . , n), . . . } denotes the set of all possible permutations of (1, 2, . . . , n).
In the circuit model of quantum computation, information carried in qubit wires is modified by quantum gates which mathematically are described by unitary operations. For the examples in this work, we use C-not gates, Toffoli gates and single-qubit rotations, however, our algorithm works for an arbitrary gate set. For the single-qubit rotations, we will use the following convention
, which correspond to rotations by an angle θ around the x-, y-and z-axes of the Bloch sphere. One important special case is the not gate, σ x = iR x (π) in terms of which the C-not gate can be written
Similarly, the non trivial part of the action of a Toffoli gate is given by
It is convenient to represent quantum circuits diagrammatically. Each qubit is represented by a wire and gates are shown using a variety of symbols. Conventionally time flows from left to right. The circuit depicted in Figure 1 shows a C-not gate controlling on the most significant qubit (with label 1) and acting on the least significant qubit (with label 4), a Toffoli gate controlling on the qubits 1,2 and acting on the qubit 3 and finally a R x rotation with rotation angle π acting on qubit 1.
We will also work with a canonical form of quantum circuits that is independent on how the gates are commuted. This form is represented by directed acyclic graphs. We denote the set of successors of a vertex v i in such a graph G by Succ(v i , G), i.e., Succ(v i , G) contains all the vertices v j for which there is a (forward directed) path from vertex v i to vertex v j . On the other hand, we denote the set of predecessors of a vertex v i in a graph G by Pred(v i , G), i.e., the set Pred(v i , G) contains all the vertices v j such that there is a path from vertex v j to vertex v i . The direct successors and predecessors are the ones that are connected through only one edge to the considered vertex v i , and we call these sets DirectSucc(v i , G) and DirectPred(v i , G), respectively. Structure. We start in Section 2 with a discussion about difficulties for efficient template matching due to the fact that quantum gates may commute. In Section 3 we introduce a canonical form of quantum circuits as directed acyclic graphs. We then present and analyze the matching algorithm in Section 4. An overview of the algorithm is given in Section 4.1, followed by the pseudocode in Section 4.2. Section 4.3 presents a pedagogical example that illustrates how the template matching algorithm works in practice. Understanding this example might be simpler than reading through the pseudocode of the algorithm. We discuss the correctness and complexity of the algorithm in Section 4.4 and Section 4.5, where the details of the correctness proof are shifted to Appendix A for readability. Further, we provide some suggestions for heuristics to lower the runtime of practical implementations in Section 4.6.
Difficulties for matching quantum circuits
We start with a discussion about difficulties in constructing an efficient template matching algorithm. Let us list some different kind of problems that arise due to the commutative nature of certain gates. This list will help us to understand the structure of the template matching algorithm given in Section 4, which handles all of the following problems efficiently.
Ordering:
The simplest problem that appears due to commuting gates is illustrated in Figure 2 .
If we just start matching the first gate of the template with the first gate of the circuit, we assign the second qubit of the template with the third one of the circuit and hence the third gate of the template will not match. However, clearly the two circuits could be fully matched by commuting the first two gates in the circuit. Ordering problem when matching a template with a circuit. The numbers above the template label its gates. The number above the circuit refer to the labels of the gates in the template that can be matched with the corresponding gate in the circuit.
Additional matching gates:
Consider the case where some gates could be matched but they should not in order to find the maximal match. Hence a straightforward greedy approach is not necessarily optimal. Let us consider the following template and circuit depicted in Figure 3 . If we match the first two gates, the third gate will not match. Furthermore, it is not possible to commute the matched gates next to the last gate in the circuit (which matches the third gate of the template) or vice versa. However, there exists a full match that one can find by matching the first gate of the circuit (and ignore the second one) and commute it trough the second and the third one to match the full template.
Figure 3: Greedy matching does not always lead to the maximal match. The additional gate that would be matched in a greedy forward matching process is marked with a solid box. The numbers above the template label its gates. The number above the circuit refer to the labels of the gates in the template that can be matched with the corresponding gate in the circuit.
Disturbing gates:
We consider a template and a circuit as given in Figure 4 . The second gate in the circuit "disturbs" the match. The maximal match is found by commuting it as far as possible to the left or the right. In the considered case, we can match three gates (instead of two) by not commuting the second gate to the right. The disturbing gates are difficult to handle in general, since it is a priori unclear if one should try to move them to the right or to the left in the circuit. If one always considers both options, the time complexity of such an algorithm would be exponential in the number of disturbing gates.
Figure 4: It might be unclear to which position we should move a disturbing gate (marked by a solid box in the circuit) to find a maximal match. The numbers above the template label its gates. The number above the circuit refer to the labels of the gates in the template that can be matched with the corresponding gate in the circuit.
The reader may find it helpful to have these problems in mind while reading the proof of the correctness of the template matching algorithm given in Section 4. We close this section with a remark about a possibility to go beyond pairwise commutations.
Remark 2.1 (Beyond pairwise commutations). We only allow to commute single gates with each other throughout this work. In general, it could happen that in a circuit C = (C 1 , C 2 , C 3 ), no gates commute pairwise, however, the unitary corresponding to (C 1 , C 2 ) could commute with the unitary corresponding to C 3 . Hence, one could bring the circuit C into the form (C 3 , C 1 , C 2 ), which could help matching in principle. However, multiplying gates to check commutation relations is computationally expensive and we do not take such commutation relation into account in this work. Nonetheless, such relations could be used to improve an implementation of the matching algorithm in practice.
Canonical form for quantum circuits
Representations of quantum circuits are not unique in general, because various gates may commute. For example, the two circuits represented in Figure 5a and Figure 5b represent the same operator. (c) Canonical form of circuit C Figure 5 : We can represent circuits in a canonical form which stays the same under gate commutations in a circuit C. The canonical form is a directed acyclic graph where all the gates that can be commuted next to each other and do not commute are connected with an edge. The edge points from the gate that comes first in the circuit to the one that appears later on.
For some applications, it is desirable to work with a canonical form of quantum circuits. The canonical form that we use (and which was introduced in [13] ) is a directed acyclic graph where the vertices are labeled with the gate indices and where all the vertices corresponding to gates that can be commuted next to each other and do not commute are connected with an edge. The edges point from the gate that comes first in the circuit to the one that appears later on (see Figure 5c for an example). Using such a representation is not strictly necessary for our algorithm to be efficient, but it will simplify its description and lower the runtime for some subroutines.
We will allow to add different attributes to vertices in the pseudocode, which we access by "vertex.attribute". In particular, as mentioned above, we always add an attribute "label" referring to the gate index corresponding to the vertex. Further, we use G i to access the vertex with label i in the graph G and we denote the number of vertices in G by |G|. Note that we can store the vertices of the graph in an array at positions according to their index to have constant time access to any vertex with known label. Storing incoming and outgoing edges together with each vertex (as pointers to the direct successors and predecessors), we also have constant time access to all direct successors and predecessors of any given vertex.
An algorithm that constructs the canonical representation of any quantum circuit C with time complexity O |C| 2 was given in [13] and is described in Algorithm 1 for completeness.
Algorithm 1 CreateCanonicalForm:
Creates the canonical form of a quantum circuit 1: Input: Quantum circuit C with |C| gates 2: Initialize an empty directed acyclic graph G 3: for j ∈ {1, 2, . . . , |C|} do
4:
Set the attribute isReachable to true for all vertices in G
5:
Add a vertex with label i to the graph G 6:
if G i .isReachable and [C i , C j ] = 0 then Add an (directed) edge from vertex G i to vertex G j in G for preDec∈Pred(G i , G) do 10: preDec. ) . To see this, it is enough to notice that the merging process of the ordered lists in Algorithm 2 has worst-case time complexity O(|C| 2 ). Indeed, consider k ordered lists, each of which containing at most n numbers between 1 and n. Merging two of the ordered lists of length n can be done in time O(n). Since we remove doublicates during the merging process, we end up again with a list of length at most n (because there are only n different numbers). Therefore, merging all pairs of the k lists has time complexity O(kn), and we end up with ⌈k/2⌉ lists of length at most n. Going on recursively like this, gives a time complexity of
In our case, we have at most |G| direct successors v i for each vertex and each list (v i ).Successors can contain at most |G| entries, i.e., k = n = |G| = |C|. 
In the following, we will show some properties of the canonical form, which will turn out useful for proving the correctness of our matching algorithm. Proof. In the construction in Algorithm 1, we only add edges for non commuting gates. The order of these gates could hence not be interchanged in the circuit C ′ and one can see that one ends up with the same canonical form (however, the vertices might have been added to the graph in different orders).
Lemma 3.3 (Necessary and sufficient condition for interchanging gates).
Given a circuit C with a canonical form G and two indices i < j, then the following two statements are equivalent:
Proof. From the construction in Algorithm 1 it is clear that if G j ∈ Succ(G i , G), we have that the gate C j can not be moved before the gate C i and hence [i, j] C = 0, because the edges represent non-zero commutation relations. It thus remains to show that 
By the construction in Algorithm 1, the vertex G k is still accessible if it is visited in the inner loop with respect to i = n + 1 in the outer loop, and hence, an edge from G k to G n+1 is added and since
Indeed, to see that the vertex G k is still accessible when it is visited in Algorithm 1, assume by contradiction that it would not be accessible. Then, G k must be a predecessor of a vertex
However, we assumed that k is the largest index with these properties, and hence this leads to k = k ′ , which contradicts k < k ′ .
Template matching algorithm
For any given circuit C and any template T , TempMatch (see Algorithm 3) finds all maximal matches of T in C. 6 The intuition how TempMatch works is visible best by reading the example presented in Section 4.3. Let us first give a rough overview of the idea behind the algorithm, and the detailed pseudocode afterwards.
Overview of the template matching algorithm
The algorithm TempMatch loops over the gates in the template T and for each such gate T i for 1 ≤ i ≤ |T |, we search for matching gates C r in the circuit C acting on qubits listed in L q = {q 1 , . . . , q l } for 1 ≤ l ≤ n T . Then, we also consider all possibilities for choices of n T − l additional qubits in C that can be assigned with the qubits of the template T . Afterwards, the main subroutines ForwardMatch and BackwardMatch (see Algorithm 4 and Algorithm 5) are called to find the maximal matches of the partial template T ′ = (T i , . . . , T |T | ) in the circuit C under the condition that the gate T i is matched with C r . We think of splitting the template
) of the starting gate, such that all the gates in T forward cannot be moved to the left of the initially matched gate T i , i.e., [1, j] T forward = 0 for all j ∈ {2, 3, . . . , |T forward |}. The very efficient algorithm ForwardMatch finds the maximal match of T forward in the circuit C (under the condition that we match C r with T i ) in quadratic time in the the number of gates |C| in the circuit C (see Lemma 4.10 for the derivation of the worst-case complexity). The algorithm BackwardMatch can then be used to expand the match found by ForwardMatch to maximal matches of the template T ′ . The algorithm BackwardMatch thereby tries to add as many matches as possible with T backward . However, matching more of T backward might destroy some matches with T forward . To handle this tradeoff, BackwardMatch has to go trough all matching situations that could lead to maximal matches and is hence computationally more costly (but still efficient, see Lemma 4.11 for the worst-case time complexity) than ForwardMatch. In practice, if not nearly all of the gates commute in the template, we expect that several matching scenarios can directly be ignored, since there are more matches destroyed with T forward than could possibly be added by matching all the gates in T backward . Hence, the average case complexity of BackwardMatch is expected to be much lower than the worst-case complexity.
Pseudocode for the matching algorithm
In this section, we describe the pseudocode of the template matching algorithm. We stress that the focus of the code is readability and we do not optimize the constants of the runtime. We usually think of working with pointers to circuit or graph objects. Hence, an object might be modified by a method call, however it is not given back as an output. As a result, we may sometimes have to copy an object o, by calling o.copy. • Circuit C with n C qubits and |C| gates
• Template T with n T qubits and |T | gates 2: Initialize a list L M to store matches
⊲loop through all gate indices of T for starting a match at T i
7:
for r ∈ {k ∈ {1, . . . , |C|} :
← {s ∈ L q : the gate C r is acting non trivially on the qubit with label s} for p ∈ Perm(1, 2, . . . , n T ) do ⊲loop through all possible qubit orderings 12:T ← Label the qubits in T with the labels in L # Note that the canonical form of T andT is the same 14: if C r =T i then ⊲Check if the qubit permutation is such that C r matchesT i
15:
# Rooted template matching: Find the maximal matches of the template
16:
# (T i , . . . ,TT ) in the circuit C under the restriction thatT i is matched with C r
17:
# We match the maximal part in forward direction ofT i :
18:
# Expand forward match to maximal ones with partial template (T i , . . . ,T |T | ) :
20:
Add the matches in L rooted to L M Note that the elements M in the output L M in Algorithm 3 are sets containing index pairs (i, j) of matched gates, i.e., if the match M contains (i, j), the gate T i from the template was matched with the gate C j from the circuit. The qubit mapping can then be recovered from the matched gates.
Remark 4.1. The output of TempMatch indicates how to match the gates of the template with the gates in the circuit. However, the information of how to commute the gates in the circuit to move the matched gates next to each other is not contained for simplicity. This information can be restored efficiently by commuting the gates in between the matched gates to the left or to the right of the match.
Algorithm 4 ForwardMatch: Find the maximal match in forward direction
• List L q of qubit labels we are matching on
• Gate indices r in C and i in T (where we start matching) 2: # Initialization: 3: Initialize a set M ← {(i, r)} to store matched gate indicess 4: For all vertices in G C , initialize attributes v 0 ← MatchedVertexList.get (1) ⊲matched vertex as a root for further matching 9: if v 0 .SuccessorsToVisit is empty then
10:
GoTo "EndOfWhileLoop" GoTo "EndOfWhileLoop" 16: end if
17:
# We try to expand the match with the vertex v in the following.
18:
CandidateIndices ← FindForwardCandidates (G T , v 0 .matchedWith, M )
19:
if There exist a j ∈ CandidateIndices with C s = T j then 20:
# We found a match with v: 21: j ← Choose the minimal j ∈ CandidateIndices with C s = T j 22:
L ← Order vertices in S in increasing order according to teir labels (and store as a list)
MatchedVertexList.Insert (v) 27:
# No match with v was found:
Set the attribute isBlocked equal to true for the vertex v and all of its successors 30:
end if
31:
Label "EndOfWhileLoop" 32: end while 33: Output: M 7 The method SuccessorsToVisit.Get(i) returns the ith vertex from SuccessorsToVisit and removes it from the list. 8 The method MatchedVertexList.Get(i) returns the ith vertex from MatchedVertexList and removes it from the list. The method MatchedVertexList.Insert(v) adds a vertex v at the position according to the label of the first vertex listed in v.SuccessorsToVisit .
Algorithm 5 BackwardMatch:
Find maximal expansions of the forward match in backward direction
• Circuit C with canonical form G C with assigned attributes "matchedWith" and "isBlocked "
• Template T
• List L q of n T qubit labels we are matching on
• Gate indices r in C and i in T (where we start matching) 
s ← (GateIndices ) counter ⊲We consider C s for matching 13: v ← G C s
14:
# Trivial cases 15:
Add M to L M
18:
GoTo "EndOfWhileLoop" if v.isBlocked then
21:
MatchingScenarios.Push(G C , M, counter + 1)
22:
GoTo "EndOfWhileLoop" # Consider the gate C s corresponding to vertex v for matching:
25:
CandidateIndices ← FindBackwardCandidates(T, M, i)
26:
if C s ∈ {T j : j ∈ CandidateIndices } then 27:
# We found a match with the gate C s 28:
# Option 1: we match
30:
Choose a vertex G T j with a label j ∈ CandidateIndices , such that C s = T j 31:
v.matchedWith ← G 
# The gate C s corresponding to v can be moved to the start of the circuit C
47:
# or after the end of the already matched gates:
48:
49:
else 50:
# The gate C s corresponding to v might disturb the expansion of the match.
51:
# We have to consider two options.
52: Label "EndOfWhileLoop" 10 To improve the runtime, we could in addition check if the length of the match M ′ plus the number of gates in the template that could possibly be matched in the further backwards matching process, is smaller than the length of the initial forward match M forward . If this is the case, we could ignore this matching scenario because it can not lead to a match that is at least as long as the forward match, and hence it cannot lead to a maximal match. 
Example for a template match using the algorithm TempMatch
To demonstrate the working of our template matching algorithm TempMatch (Algorithm 3), let us consider an example. Suppose we are given a template T and a circuit C as shown in Figure 6 . For simplicity, we represent some gates in parallel. However, we may think of them as being stored in an ordered list and each gate having its own index. We recall that target as well as control nodes of different C-not gates (or Toffoli gates) commute with each other. In addition, R x and R z gates commute with target and control nodes of C-not gates, respectively. Note also that we use a restricted gate set in the example for simplicity, however, the algorithm works for arbitrary sets of gates.
Let us consider a fixed starting point for the matching, and for simplicity, let us assume that we want to start matching with the first gate of the template. 11 In the circuit, the algorithm TempMatch loops over all possible starting points for a match of the first gate and over all choices and orders of five qubits out of the eight qubits of the given circuit.
12 Let us consider C 8 as the starting gate for a match (as shown in Figure 6b ) and the case where qubits 3 − 7 have been chosen with the order given by the mapping of the qubit labels (1, 2, 3, 4, 5) of the template to the qubit labels (7, 6, 5, 4, 3) in the circuit. For simplicity, let us denote the template with relabeled qubit (which is denoted byT in Algorithm 3) again by T in the following. Figure 6 : Template T that should be maximally matched with a connected part of the circuit C. We start with matching the two marked gates. The numbers at the top denote the indices of the individual gates.
11 In the algorithm, we loop over all the gates in the template and consider them as starting gates. The given choice does not restrict the generality of the example, since you may consider the given template as a part of a larger template, where the first gate corresponds to a gate with index i of the larger template.
12 This loop leads to the term The algorithm TempMatch next starts its two main subroutines, first ForwardMatch and afterwards BackwardMatch to determine the maximal matches of T in C with the chosen starting gates and on the chosen qubits (in the fixed order). Forward matching. The subroutine ForwardMatch (Algorithm 4) is an algorithm that finds the maximal match of T in C in forward direction, i.e., with the gates that must follow the starting gate. In the graph representation this corresponds to vertices that are successors of the starting vertex. Since the gates considered by ForwardMatch cannot be commuted to the left of the starting gate, it can be proven that a greedy matching strategy is optimal (see Lemma A.1).
The algorithm starts with initializing a list MatchedVertexList = (G First, the vertex with lowest label in SuccessorsToVisit , i.e., the vertex G C 9 , is considered for matching (and removed from the list SuccessorsToVisit ). The candidates of the template for a match with G C 9 are the direct successors of the starting vertex G T 1 , i.e., the gates with labels (4, 6) in the template (which are found by Algorithm 6). Since T 6 = C 9 , we found a match, and we set the attribute matchedWith of vertex G Backward matching. It remains to find all maximal matches by also considering the vertices that are are not successors of the starting vertex G C 8 . In general, it could help to move gates that disturb this matching process as far as possible to the right (which corresponds to blocking the corresponding vertex and its successors). However, this might block some gates that have already been matched in the forward matching process. Hence, there is a tradeoff between matching more gates on the left or on the right of the starting gate. This tradeoff makes the matching process costly in general, since one has to go through all the possibilities of moving disturbing gates to the left or to the right. However, based on the fact that we have already found a maximal match in the forward direction of the starting gate (by running ForwardMatch), we can show that this process is still efficient in the circuit size (but not in the template size in the worst-case) as discussed in the proof of Lemma 4.11 given in Section 4.5.
We now start the method BackwardMatch (Algorithm 5). The list of all vertices in G C that have not been matched and are not blocked is given by GateIndices = (11, 7, 6, 5, 4, 3, 2, 1) (ordered in decreasing order). We start by picking the largest index, i.e., we consider vertex G C 11 for matching. The index 5 in the template is then found by FindBackwardCandidates, i.e., the only candidate in the template for a next match (in backwards direction) is the direct predecessor G Since T 5 = C 11 , we block the vertex G C 11 according to line 43 in Algorithm 5. Similarly, we block the vertices G C 7 and G C 6 in the following. Then, vertex G C 5 is considered for matching. We find that it matches the candidate vertex G T 5 , however, as we will see in the following, greedy matching is not always the optimal strategy. According to the code after line 27 we have to consider the option that we match the vertices (Figure 10a ), but also the one of blocking G C 5 and all of its successors (Figure 10b ). Although the second option destroys the already matched gate G C 22 , blocking G C 5 might help to match further gates on the left (which will turn out to be the case in the considered example). We have to consider both scenarios in the further matching process and we put both of them to a stack MatchingScenarios to keep track of them.
In the case where we matched the vertex G C 5 , one finds that no further gates can be matched (without blocking the matched gate G C 5 , what we do not have to consider, since it would lead to the same scenario as the non-matching case already added to the stack MatchingScenarios). Hence, we could match 10 gates in total in this scenario.
In the case where we do not match vertex G C 5 , one finds (by going through some different matching scenarios not listed here for readability) that the vertices G ( Figure 11 ). Hence, we can match 11 gates in total in this matching scenario, which turns out to be the maximal match. The maximal match in the circuit picture is shown in Figure 12 . T . We mark the gates matched while running BackwardMatch in orange (these matches will never be blocked again); Option 2: Situation if we do not match vertex 5 and hence block it. This implies that we also have to block the already matched vertex 22. 
Correctness of the algorithm
In this section we formally prove the correctness of TempMatch (Algorithm 3), i.e., that for any circuit C and any template T the algorithm finds all maximal matches. Let us first formally define what we consider to be a "template match" and when it is called maximal. In terms of the canonical form, a part E of a circuit C is connected if and only for all the vertices corresponding to the gates in E in the canonical form of C we have the following property: if two vertices are connected by a path, then all the vertices that lie on the path have also to correspond to gates in E.
Definition 4.3 (Equivalence of circuits up to qubit relabeling)
. A circuit C is equivalent to a circuit E up to qubit relabeling if and only if there exists a bijective mapping from the qubit labels in circuit C to the ones in circuit E, such that for the resulting circuit C ′ (that one gets by relabling the qubits in circuit C) we have that C ′ = E (i.e., C ′ i = E i for all i ∈ {1, 2, . . . , |C| = |E|}).
Definition 4.4 (Template match)
. We say that a template T has a match of length m in a circuit C if there exists a connected part E T of T of length |E T | = m that is equal up to qubit relabeling to a connected part E C of C. We refer to such a match M by a set of tuples of gate indices, where a tuple (i, j) means that we matched the gate T i with the gate C j . Definition 4.6 (Equivalence of sub-circuits). Let us consider a circuit C and two subset of gate indices A, B ⊂ {1, . . . , |C|} with |A| = |B|. We say that the subsets A and B describe equivalent sub-circuits of C if and only if there exists a bijective mapping f : A → B, such that for all i ∈ A we have
• the gates in the circuit C can be commuted, such that in the resulting circuit we have the gate with index f (i) at the positions i.
Definition 4.7 (Equivalence of template matches)
. For a match Q, let us denote the set of matched indices in the template by Q T := {i : (i, j) ∈ Q for some j ∈ {1, . . . , |Q|}} and the set of matched indices in the circuit by Q C := {j : (i, j) ∈ Q for some i ∈ {1, . . . , |Q|}}. We say that two matches M andM are equivalent if and only if
T describe equivalent subcircuits of the template T and
• M C andM C describe equivalent subcircuits of the circuit C.
We are now ready to state and prove the formal statement ensuring the correctness of TempMatch. This ensures that Algorithm 3 always succeeds, i.e., there are no situations where the algorithm does not deliver the desired output.
Theorem 4.8 (Correctness of TempMatch). Given a circuit C and a template T . Then Algorithm 3 finds all maximal template matches (up to equivalent matches) of T in C.
We note that not all the matches given as an output of Algorithm 3 might be maximal and there may be equivalent matches in the output. We ignored this for simplicity of the pseudocode and since for certain applications, it might be more efficient to work with this output instead of removing the non maximal and equivalent matches from it. The proof of Theorem 4.8 is given in Appendix A.
Complexity of the algorithm
In the previous section we have learned that Algorithm 3 is correct in the sense that it always finds all the maximal matches for any given template and any given circuit. Thus what remains to be understood is how efficient Algorithm 3 finds these matches. This is settled by the following theorem.
Theorem 4.9 (Complexity of TempMatch). The worst-case time complexity of Algorithm 3 for a circuit C and a template T is O |C|
We note that the running time stated in Theorem 4.9 may be simplified as
From this we see immediately that Algorithm 3 is efficient (i.e., polynomial) in |C| and n C and inefficient (i.e., exponential) in |T | and n T .
To prove the assertion of Theorem 4.9 we first need to understand the running time of the two subroutines ForwardMatch and BackwardMatch. This is done in the following lemmas, where we assume that |T | ≤ |C|. Furthermore, we assume that we can check if two gates commute in constant time. For a fixed (finite) gate set, one possibility to achieve this, is by storing the commutation relations between all the gates in a table. Since we will account time complexity O(|D| In the case, where we found a match (see line 20 in Algorithm 4), the most expensive part is to order the set S that contains at most |C| successors that we have left to visit from the point of view of the vertex v. Ordering the elements in S has worst time complexity O(|C| log |C|). Since the case where we found a match can occur at most |T | times, the full complexity of this case is O(|T ||C| log |C|).
In the case, where we cannot match (see line 28 in Algorithm 4), we have to block the vertex v and all of its successors. Since we assume constant time access to a list containing all successors of v, this can be done in time O(|C|). The case where we cannot match can occur at most as many times, as we have to run the while-loop. Hence, the full complexity of this case is O(|T ||C| 2 ). We conclude that the worst-case complexity of ForwardMatch is given by
2 ), where the second term |T | 2 |C| arises from line 13 and line 18 in Algorithm 4. Proof. The complexity of BackwardMatch depends on the number of possible matching conditions that are added to the stack MatchingScenarios during the while-loop. We consider the tree of possible matching conditions, i.e., we start from an initial match M forward and consider the branching defined by the two options considered in both cases of the if-condition in line 26 in Algorithm 5. First, we calculate an upper bound on the number of vertices of this tree, i.e., an upper bound on the number It(C, T ) of iterations in the while-loop. The branching that happens if we can match (see line 27 and the following lines in Algorithm 5), adds at most one matching condition to the stack with a counter increased by one and one matching condition where in addition a further gate of the template T backward is matched. The branching that happens if we cannot match (see line 42 and the following lines in Algorithm 5), adds at most one matching condition to the stack with a counter increase by one and one matching condition where in addition at least one of the initially matched gates (listed in M forward ) was disturbed. Note that the gates matched during the run of BackwardMatch (and stored in M forward ) will not be disturbed again later on (see the condition to add a matching condition to the stack in line 38 and line 59 in Algorithm 5).
Let us denote the number of cases where we disturb an initially matched gates in a certain branch β of the binary tree by t d (β), and the number of cases where we add an additional match by t m (β). Then, each branch β of the binary tree must satisfy t(β) := t d (β) + t m (β) ≤ |T |, since the matching process ends if all gates of the template are matched or were disturbed (and excluded for further matching). Each branch of the binary tree can contain at most |C| vertices, since each vertex reduces the size of the circuit that is left to consider for matching by one (since counter is increased by one). Therefore, and since at each branching either t d or t m is is increased by 1 for one of the two branches, the number of branches of the binary tree is bounded by
Since each branch is of length at most |C|, we find
Let us now consider the complexity of the computation at each vertex of the tree. The complexity of FindBackwardCandidates is O(|T | 2 ), since, in the worst-case, we have to check for each gate in T if it can be commuted to the current matching position. By assumption, we have constant time access to all predecessors of a vertex in the canonical form G C of the circuit C. Hence, the calculation at each vertex has time complexity O(|C| + |T | 2 ) = O(|C||T |). This finishes the proof.
Proof of Theorem 4.9. The assertion of Theorem 4.9 now follows from Lemma 4.10 and Lemma 4.11 and inspection of the loop structure in TempMatch.
Heuristics to reduce the runtime
Since the worst-case complexity of the algorithm can be a high degree polynomial in the circuit size and in the number of qubits in the circuit for large templates (see Theorem 4.9), one might want to introduce some heuristics to lower the runtime. This is however in tradeoff with finding all the matches. For the complexity analysis here we assume a fixed template size. The computationally expensive steps in the matching algorithm are:
1. Choosing the qubits in a circuit C on which we would like to search for a match with a template T (see line 9 in Algorithm 3).
2. Considering all possible matching conditions during the run of BackwardMatch (see the two options in the two cases starting in line 27 and line 42 in Algorithm 5).
Possible heuristics to reduce the runtime of each of these cases could be:
1. We fix a maximal "looking forward" length l ∈ N and start from the first matched gate C r in the circuit. Then, we take the first l successors of G C r with the minimal labels and store them in a gate list L. We consider only the qubits in a set Q L as possible candidates for the matching, where the set Q L contains all the qubits for which there is a gate in L that acts non trivially on the qubit. This reduces the theoretical complexity term O(n nT C ) appearing in Theorem 4.9 to a constant, i.e., O(1) under the assumption that all the gates in the considered gate library act on a bounded number of qubits. (The additional complexity for searching the successors stored in L is O(1) by Remark 3.1).
Furthermore, we may use some heuristics to choose "promising" subsets of the qubits listed in Q L during the matching procces. For example, one may consider the first gate T a in the template that connects the already matched gates with a new qubit. Then we search for all gates of type T a in the gate list L whose relation to the qubits on which C r (and the other matched gates in the circuit) are acting on corresponds to the relation of the qubits in the template. The qubits connected trough the so found gates might be chosen as candidates for the corresponding qubit in the template.
2. We fix a maximal branch length b ∈ N and a maximal number of survivors s ∈ N. In the algorithm BackwardMatch, we add all the matching conditions to the stack until each of them has undergone b iterations of the while-loop. Then, we evaluate how promising each of the matching condition is under some chosen metric. For example, one may choose the number of already matched gates for such a metric. We remove all the matching conditions from the stack up to s remaining ones with the highest value with respect to the chosen metric. Then we evolve these matching conditions further, and so on. Note that such a heuristic reduces the complexity of BackwardMatch from O |C| 2+|T | (see Lemma 4.11 and recall that we have a fixed template size) to O |C| 2 .
Alternatively, one may also consider a more clever order to visit the gates in the circuit. The current algorithm traverses the gates in decreasing order according to their labels (by using a counter over an ordered list GateIndices of gate indices, see Algorithm 5). However, changing the order to higher the chances that gates that matches are considered earlier, may significantly lower the runtime of an practical implementation.
We conclude that under the two mentioned heuristics the full matching algorithm has a worst-case time complexity of O |C| 3 for a fixed template size.
Remark 4.12. Let us assume that an application uses a certain "stopping criterion" for the algorithm TempMatch, for example, it might be fine to finish the search for maximal matches if a match of a certain length is found. For this use case, it is important to traverse the tree of matching scenarios as discussed in 2 above, in a way that is adapted to the class of circuits one considers. For example, one could start with a greedy strategy and always match gates if possible. Then one could consider the options where only once we do not match and so on.
Outlook
In this paper we presented an efficient algorithm for template matching in quantum circuits and hence, in particular, also for sub-circuit searches in reversible logic circuits. This provides a step to overcome the tradeoff existing today between not finding all the matches and long runtimes. Currently, we are working on an implementation of the matching algorithm in the IBM software library Qiskit [17] to test its runtime for practical applications. As mentioned, we expect considerably better scaling on average than given by the theoretical worst-case analysis (see Theorem 4.9). Furthermore, we plan to implement the heuristics explained in Section 4.6 to investigate how much they can speedup the algorithm and how many matches are missed in the average case, depending on the choice of the parameters for the heuristics. It would be interesting to see how many of the maximal matches can be found in a fraction of the runtime of the full algorithm using search techniques adapted to certain classes of circuits (see Remark 4.12) . Plotting the number of found matches against the runtime of the algorithm for different greedy search heuristics might lead to insights in the structure of common quantum circuits.
To use our algorithm for circuit optimization, it would be interesting to find further templates. In general, not too many templates are known. Universal decomposition schemes (such as the one given in [18] and implemented in [19] ) can in principle produce whole parameter groups of templates. It remains to be investigated how to make use of such templates in practice.
Another application of our algorithm is peephole optimization of quantum circuits. There, instead of searching for a template T in a circuit C, we may search for the longest connected parts in C on a chosen subset of qubits L q (i.e., every gate that only acts on these qubits can be considered as a match). If one finds a connected part E in C that requires more gates than a generic unitary on |L q | qubits (see [18] for the best known gate counts for arbitrary isometries), one may multiply the gates listed in E to find the unitary U E that describes the whole operation performed by the E. The unitary U E could then be synthesized with the best known methods, and the circuit E can be replaced with the newly synthesized circuit. A version of peephole optimization on two qubits is already implemented in the IBM software library Qiskit [17] and turned out to be very useful for circuit optimizations.
Another interesting direction is to generalize the template matching algorithm to arbitrary DAGs, where vertices can be interchanged under certain predefined rules (corresponding to the commutation relations between the different quantum gates in the circuit picture). Representing quantum circuits as DAGs, our current algorithm could be considered as a subgraph search algorithm. However, it requires some additional structure, such as that the numbers of incoming edges to a vertex is equal to the number of outgoing edges (since quantum gates are reversible). It would be interesting to see if all such restrictions could be removed or if some of them are necessary for the efficiency of the template matching algorithm. Searching for patterns in graph-structured data has applications in a broad range of areas, such as biology, computer vision, electronics, computer aided design, social networks, intelligence analysis, and artificial intelligence (see [20] for an overview). Hence, we would expect our algorithm to find further applications in different fields, where it is natural that certain vertices of the graph can be interchanged. and keeps the qubit order of the template. 13 Furthermore, ForwardMatch only blocks vertices in the canonical form of the circuit C that correspond to gates that could never be matched with gates in the full template T (or which would lead to equivalent matches).
Proof. Let us denote the canonical form of the circuit C and of the template T by G C and G T , respectively. We have only to consider gates C s with G Matching gates must correspond to vertices in G C that are successors of G Since a proper match is connected, we find that all the gates with labels corresponding to the vertices on the path p must also be matched with gates in the circuit. Therefore, we find a corresponding path from G It remains to be shown that ForwardMatch handles each successor of G C r in a way that leads to one of the (equivalent) longest matches. Since a proper match must be connected and we start matching C r with T forward 1 from the left to the right in the circuit, the indices CandidateIndices of the gates in the template that could be matched next must be direct successors of the already matched gates. These successors are found in Algorithm 6. We have the following two cases in the while-loop for a matched root vertex v 0 and a direct successor v of v 0 with label s that we consider for matching in the canonical form G C of the circuit C.
1. Gate C s matches with a gate T j with j ∈ CandidateIndices . In the following we show that the optimal strategy in this case is indeed to greedy match the two gates. There are two cases:
(a) If the vertex v = G C s is the only successor of G C r , such that C s = T j , we should match the two gates, since matching them will not disturb any possible further matches (but can only lead to longer matches). Indeed, not matching would reduce the possible candidates in the template for further matches and might block vertices in G C that are successors of the not matched gate. Note that the order in which the gates are visited in the while-loop ensures that all predecessors of v that are also successors of the starting vertex G ∈ DirectSucc(v 0 , G C ), it has to be a successor of v 0 with at least one vertex in between (because [C v0.label , C t ] = 0, since C t = T j = C s and G C s ∈ DirectSucc(v 0 , G C )). In this case, the vertex in between cannot be matched, and hence this scenario would not lead to a connected match (as long as we do not block the matched predecessors of G C t , which is not allowed since it would also block the starting vertex G C r ).
2. Gate C s does not match with a gate T j with j ∈ CandidateIndices . Similarly as above, one can see that matching the gate C s with any gate in T forward or in T backward cannot lead to a connected match. Since G with C r ), such that there is no sub-match in M max that is equivalent to M . Since the algorithm ForwardMatch goes trough all gates that could possibly be matched with T forward (see the proof of Lemma A.1), there is a first gate C s considered in this process that is not matched in the maximal match M max found by ForwardMatch (and there is also no equivalent match). Since matching a gate in this process can never disturb future matches (apart from equivalent ones) as shown in case 1 in the proof of Lemma A.1, we conclude that the gate C s could also be matched in the forward matching process. However, according to the algorithm ForwardMatch, the gate would then indeed be matched, leading to a contradiction with the assumption that C s is not matched in the maximal match M max found by ForwardMatch. Let us first give an overview of the proof idea. Since we have already given a maximal match of T forward in C, it remains to verify how many gates of T backward can be matched. In general, it can happen that a gate that disturbs the match of T backward can be moved to the right in C. This may allow us to match further gates in T backward , but may also disturb the maximal match of T forward . To handle this tradeoff, we have to consider both possibilities: (i) moving the disturbing gate as far as possible to the right, and (ii) moving it as far as possible to the left. These options correspond to blocking the successors or the predecessors of the vertex corresponding to the disturbing gate in the canonical form of the circuit C. We then have to keep track of both possibilities and go on with matching in both cases building up a stack of possible matching scenarios that could lead to a maximal match. Let us now give a detailed proof of the correctness of the algorithm.
Proof of Lemma A. 4 . The matching process works with the canonical representations G C of the circuit C, where the matched gates listed in M forward are marked as matched in G C , and their successors are blocked (corresponding to the state after running ForwardMatch). Let us first show that the strategy of first maximally forward match and afterwards start the backward match leads to all possible maximal matches of the full template. A priori it could happen that a non maximal forward match could lead to a maximal match of the full template. However, by Lemma A.3, different forward matches could only be sub-matches of the maximal match. Since in the process of backwards matching (as analyzed in detail below) further matched gates on the right of the starting gate C r with gates in T forward can only lead to longer matches, we conclude that the strategy of first finding the maximal forward match is indeed optimal.
Let us know consider the backward matching process. The while-loop of the matching process in Algorithm 5 goes through all the vertices in the canonical form G C of the circuit that are not blocked or already matched (as long as there are gates left to consider in T backward that could possibly be matched). Since the blocked gates will never match (see Lemma A.1), we loop trough all vertices corresponding to gates that could possibly be matched with T backward . The indices of these vertices are stored in a list GateIndices in decreasing order (see line 4 and line 5 in Algorithm 5). The variable counter keeps track of the number of gates with indices listed in GateIndices that we have already considered in the backwards matching process. During the matching, we create a stack of possible matching scenarios that may lead to a maximal match. All of these scenarios are then considered for further matching in the next steps of the while-loop. Therefore, it remains to show that each step in the while-loop with parameters (G C , M, counter) puts all the matching possibilities for the gate with index s = (GateIndices) counter that might lead to a maximal match on the stack MatchingScenarios for further investigation. The indices CandidateIndices of the gates in the template that could possibly match are the ones that can be moved to the left of the already matched gates in the template. These gates are found in Algorithm 7. To show the correctness of one cycle of the while-loop, let us consider two cases separately.
1. Gate C s matches with a gate in the template with index s ∈ CandidateIndices . First we note that if we decide to match the gate C s , then matching it with another gate T k = T j with k ∈ CandidateIndices would lead to equivalent matches (since the gates T j and T k could in this case be commuted to the same place in the template). If we do not match, the gate C s is moved as far as possible to the right in the circuit C in Algorithm 5 and the gates that cannot be commuted to the left of it are removed, i.e., we block the vertex G C s and all its successors. If we disturb the match of the starting gate by doing so, we can ignore this matching scenario. Moreover, if we block a vertex corresponding to a gate C t that is matched with a gate in T backward by blocking the successors of G C s , we can also ignore this scenario, since it is already on the stack (because in an earlier cycle of the while-loop, we considered the case of not matching C t ). Hence, we have left to show that if we do not match, it is never necessary to move the gate C s as far as possible to the left in the circuit to find a maximal match. Indeed, moving the gate to the left (and ignoring it for the further matching process) could lead to one of the following two cases:
(a) we match the gate T j later on in the matching process with a different gate in the circuit, (b) we do never match the gate T j .
It can be verified that case (a) leads to matches that are equivalent to the ones where we match C s with T j . In case (b), not matching T j with C s can only disturb the backwards matching process later on (since it blocks gates that do not commute to the right of C s in the circuit and reduces the possible matching candidates from the template), but cannot increase the match. Hence, we can ignore the case where we move C s as far as possible to the left.
2. Gate C s does not match with any gate in the template with index s ∈ CandidateIndices
One can see that the gate C s can never match and may disturb the matching. We consider both options: moving it as far as possible to the left and to the right in the circuit and remove the trailing gates. If we can move the gate C s to the right of all the matched gates in M forward with T forward or to the start of the circuit C (see the condition in line 45 in Algorithm 5), doing so will lead to a maximal match, since the gate C s does not disturb the following matching process or the current match. Hence, in this case we only add this possibility to the stack MatchingScenarios , where otherwise, we have to add both options of moving the gate C s as far as possible to the left or to the right. As in case 1, we do not have to add matching conditions to the stack if we disturb the initial match or any match with gates in T backward .
This finishes the proof.
We are finally ready to prove the assertion of Theorem 4.8.
Proof of Theorem 4.8. Assume that there is a maximal match M of a template T in the circuit C. We have to show that this match is found by Algorithm 3. The algorithm loops over all possible choices of n T "matching"-qubits out of the n C qubits of the circuit, as well as over all possible permutations of them. The qubit relabeling in the template in line 12 in Algorithm 3 then ensures that there is a run of the loop with a templateT , such that for all the index pairs (j, s) ∈ M , we haveT j = C s (which in particular means that the two gates are acting on qubits with the same labels). Assume that the indices of the gatesT 1 ,T 2 , . . . ,T k are not listed (as a first entry of an element) in M , butT k+1 is the first matched gate in the template, i.e., there is a tuple (k + 1, r) ∈ M for some r ∈ {1, 2, . . . , |C|}. Then, once in the loop of the algorithm TempMatch, we set the start index of the template i := k + 1 and the start index in the circuit to r. By the correctness of ForwardMatch and BackwardMatch (see Lemma A.1 and Lemma A.4), all the maximal matches (up to equivalent ones) of (T i , . . . ,T |T | ) in the circuit C are found starting with matching the gate T i with C r . Since the gatesT 1 ,T 2 , . . . ,T i−1 are not contained in the match we are searching for, TempMatch hence find the maximal match M (or an equivalent one).
