The design and optimization of quantum circuits is central to quantum computation. This paper presents new algorithms for compiling arbitrary 2 n × 2 n unitary matrices into efficient circuits of (n − 1)-controlled single-qubit and (n−1)-controlled-NOT gates. We first present a general algebraic optimization technique, which we call the Palindrome Transform, that can be used to minimize the number of selfinverting gates in quantum circuits consisting of concatenations of palindromic subcircuits. For a fixed column ordering of two-level decomposition, we then give an enumerative algorithm for minimal (n − 1)-controlled-NOT circuit construction, which we call the Palindromic Optimization Algorithm. Our work dramatically reduces the number of gates generated by the conventional two-level decomposition method for constructing quantum circuits of (n − 1)-controlled single-qubit and (n − 1)-controlled-NOT gates.
Introduction
The recent discovery of algorithms for prime factorization, discrete logarithms and other important problems [10, 16] that are more efficient on quantum computers than classical computers has escalated interest in quantum computing. However, physical limitations of current quantum technologies, such as coherence time and the number of available qubits, prevent the usage of quantum algorithms in any computationally significant setting. It is important, therefore, for any implementation of a quantum algorithm to make efficient use of the underlying quantum computing resources.
No matter what technology will ultimately be used to implement quantum computers, the quantum circuit is most likely to remain the primary model for quantum computation [8, 13, 17] . It allows us to represent an algorithm to be implemented by any quantum computer as a composition of quantum gates. Although it is analogous to a classical logic circuit, a quantum circuit requires novel compilation and optimization algorithms since the criteria for efficient quantum computation are radically different from classical computation. It is particularly important to reduce the size of quantum circuits in the early phases of compilation since the later phases may increase circuit sizes dramatically for each additional gate in the initial circuit representation [2, 4, 11, 15] . Ideally we would like to achieve the best circuit for a given class of gates and a given technology taking into account all relevant factors such as size, noise, decoherence time, and so forth. A general-purpose quantum compiler will require both technology-independent and technologydependent optimization techniques to achieve these efficiency goals. Until a fully scalable quantum computer technology emerges, we will restrict ourselves to machine-independent techniques.
In this paper, we focus on the design and optimization of quantum circuits consisting of controlled single-qubit gates for arbitrary 2 n × 2 n unitary matrices. In particular, we focus on the reduction of (n − 1)-controlled-NOT gates in such circuits. To achieve this reduction, we introduce a general algebraic gateminimization technique, which we call the Palindrome Transform. We then present an efficient iterative method, the Palindromic Optimization Algorithm, for decomposing a quantum circuit into matrices acting nontrivially on two or fewer vector components (two-level matrices). These algorithms are useful in the first phase of any general procedure for decomposing a quantum computation into an efficient quantum circuit. Ultimately we would like to produce efficient quantum circuits for different quantum technologies from high-level specifications of quantum computations.
The Quantum Circuit Model
We use the standard Dirac notation for quantum states, where a quantum state ψ is written in ket form as |ψ . A quantum bit, or qubit has state |0 , state |1 , or a linear combination of these states, written as |ψ = α|0 + β|1 , where α and β are complex numbers and |α| 2 + |β| 2 = 1. The state space of n qubits, which lie in a 2 n -dimensional complex Hilbert space, can be represented as a tensor product of the state space of each single qubit
and a state can be described by the vector
where the computational basis states are of the form |x n−1 . . . x 1 x 0 and the probability of measuring state |x , where x = x n−1 . . . x 1 x 0 , is |α x | 2 . We can model quantum computation using the quantum circuit model developed by Deutsch [8] and Yao [17] . The quantum circuit model consists of qubits, quantum wires, and quantum gates, where quantum wires provide communication between the sequential quantum gates by transporting output from one computation to serve as input to another. To identify the matrix elements of particular quantum gates, we order our states lexicographically. In our circuit diagrams, time increases from left to right, but the order of operators in a matrix sequence is applied to the state from right to left.
In the quantum circuit model, a quantum gate on n qubits is a 2 n × 2 n unitary matrix U . A composition of quantum gates G k . . . G 1 is called a quantum circuit C, where the product of G k . . . G 1 represents the unitary operator computed by C. Two quantum circuits are equivalent if the composition of their respective gates represents the same unitary matrix. That is, if circuit C 1 represents the matrix U 1 and C 2 represents U 2 , and if
A set of quantum gates is exactly universal if it can represent any unitary operation exactly by a composition of its gates; a set is approximately universal if it can approximate any unitary operation to an arbitrary accuracy by a composition of its gates [7, 12] . Since there are noncountably many operations, exact universality requires an infinite generating set of quantum gates. However, approximate universality can be achieved by certain discrete sets of quantum gates. In this paper, we consider exact universality using the universal set of (n − 1)-controlled single-qubit and (n − 1)-controlled-NOT gates [7] .
We use the following standard gates in our quantum circuits. The single-qubit Pauli-X operator
is similar to the classical NOT operation and takes the state |x → |1 − x . There also exist operations on multiple qubits, such as the ability to conditionally apply a single-qubit gate. Control gates perform the target operation S only if the control qubits are set appropriately. The (n − 1)-controlled gate, written as Λ n−1 (S), denotes n − 1 qubits controlling the application of the operator S to the target qubit. Throughout this paper, S represents a single-qubit gate. The controlled operation Λ n−1 (S) is defined by
where x n−1 ∧ . . . ∧ x 1 ∧ x 0 in the exponent of S denotes the Boolean product of the bits x n−1 , . . . , x 1 , x 0 . If the product of these bits is 0, then the operator is not applied. The Λ 1 (X) gate is known as the controlled-NOT gate (CNOT) and performs the operation |x, y → |x, x ⊕ y , where ⊕ denotes the logical exclusive-or operation. Henceforth, we will refer to Λ 1 (X) as the CNOT gate. In matrix form, the CNOT gate is
In this paper, we focus on decomposition techniques using two-level unitary matrices, where a two-level unitary matrix acts nontrivially on two or fewer vector components. Figure 1 shows a two-level matrix M . The row c contains 0's except for the two complex numbers α and β shown. Likewise the row r contains 0's except for the two complex numbers γ and δ. The rest of the matrix has 1's on the diagonal and 0's elsewhere. M acts nontrivially on the space spanned by the row c and the row r. We defineM to be the 2 × 2 unitary submatrix consisting of α, β, γ and δ shown in Figure 2 . We call this matrix the component matrix of M . Clearly,M is a unitary operator that acts on a single qubit. When necessary, we will indicate the vector components c and r on which M nontrivially acts by writing M c,r andM c,r . 
A Framework for Quantum Circuit Compilation
We now describe the first phase of our quantum circuit compilation process that generates for an arbitrary unitary matrix U an exact quantum circuit consisting of (n − 1)-controlled single-qubit gates and (n − 1)-controlled-NOT gates [13, 14] . The compilation steps of this phase are shown in Figure 3 . This first phase, called two-level decomposition, takes as input a 2 n × 2 n unitary matrix U and an ordering of decomposition and outputs a sequence of two-level matrices
Ordering of Decomposition
. This output is then converted into an optimized circuit, G m . . . G 1 , of Λ n−1 (S) and Λ n−1 (X) gates. Using standard techniques, the circuit of controlled operations can be further decomposed into a circuit composed of gates drawn from some universal set of basis gates [2] . One common exactly universal set is the set of single-qubit and CNOT gates [2] . Our framework here builds on and refines the conventional ordering and two-level decomposition method described in [2, 13, 14] .
In this paper, we improve the first phase by finding an optimal ordering of decomposition for the twolevel decomposition phase to minimize the number of Λ n−1 (X) gates generated for the circuit G m . . . G 1 corresponding to U . The remaining sections of this paper are organized as follows. In Section 4, we describe the conventional ordering and two-level decomposition algorithm used in the first step. In Section 5, we describe the second step that constructs a circuit of controlled single-qubit gates from the sequence of twolevel matrices. In Section 6, we describe the Palindrome Transform that characterizes the optimal ways to order subcircuits to maximize the amount of cancellation of self-inverting gates. In Section 7, we introduce our Palindromic Optimization Algorithm (POA) that dramatically improves upon the conventional ordering used in the two-level decomposition algorithm of the first phase. In Sections 8 and 9 we derive equations for the number of generated gates and compare the sizes of optimized and unoptimized circuits.
Two-Level Decomposition
We now describe the first phase of our quantum circuit compiler. This phase, called two-level decomposition, takes as input an arbitrary 2 n × 2 n unitary matrix U and produces as output a composition of two-level matrices V 1 . . . V k such that the product of V 1 . . . V k equals U . Phase I as described in this section uses the conventional ordering for two-level decomposition. In Section 7, we give a method for computing an improved ordering that dramatically reduces the size of the generated circuit.
We define the order of two-level decomposition as the sequence of vector component pairs that are nontrivially acted on by the two-level matrices in the decomposition V 1 . . . V k . We will associate an ordering pair (r, c) with a two-level matrix V j to identify the four complex numbers
The sequence of ordering pairs defines the order of the two-level decomposition. To avoid repetition in a two-level decomposition, we only allow pairs (r, c) where r > c. Throughout this paper, the first number of an ordering pair represents a row and the second a column in a matrix.
In all our sequences of ordering pairs, we begin with the pairs for column 0 followed by those for 1, followed by those for column 2, and so on up to column 2 n − 2. We call this a fixed-column ordering. In the conventional algorithm for two-level decomposition, the ordering has the pairs (c+1, c), (c+2, c), . . . , (2 n −1, c) for column c followed by the pairs (c + 2, c + 1), (c + 3, c + 1), . . . , (2 n − 1, c + 1) for column c + 1, and so on. We will use a triangular array order n to store the ordering pairs. The entries in rows 1, 2, . . . , Note that row 0 and column n − 1 are not used in the two-level decomposition algorithm since they violate the condition that the row value must be greater than the column value, but they are included for notational convenience.
Algorithm 1: Two-Level Decomposition Input: A 2 n × 2 n unitary matrix U and a 2 n × 2 n array order n dictating the order of the two-level decomposition. Output: A sequence of two-level matrices
To perform a conventional two-level decomposition on U , we call the procedure T woLevelDecompose on U and the conventional ordering array order n using Algorithm 1. With the conventional ordering array as input, the algorithm applies a transformation M 1 to U to set the matrix entry
It continues in this fashion until column 0 has a 1 in the top entry and 0's everywhere else. This process is sometimes called a quantum Givens operation [6] . It then iteratively applies this process to the 2 n − 1 × 2 n − 1 unitary submatrix in the lower right-hand corner of M 2 n −1 M 2 n −2 . . . M 1 U , ultimately decomposing U into a product of two-level unitary matrices.
Algorithm 1 produces as output a sequence of two-level unitary matrices V 1 . . . V k , where V j = M † j , the adjoint of M j . We can easily verify that V 1 . . . V k = U , and that k ≤ 2 n−1 (2 n − 1). We denote the complex conjugate of a complex number ζ = a + ib as ζ * = a − ib.
Controlled Single-Qubit Gate Circuit Construction
After performing the two-level decomposition on U , we need to construct a circuit from the sequence V 1 . . . V k of two-level matrices using Λ n−1 (S) and Λ n−1 (X) gates. To compute each V j , the circuit must perform a sequence of state changes in order to bring together the two vector components that are nontrivially acted on by V j . The algorithm uses Gray codes to transform each V j in V 1 . . . V k into a circuit of controlled singlequbit gates. We can determine the state changes needed for V j by constructing a Gray code between the two computational basis states |c and |r of V j . Let us define GrayCode(c, r) between state |c and state |r to be a minimal sequence of binary numbers g 1 , g 2 , . . . , g m in which g 1 = c n−1 c n−2 . . . c 0 is the binary expansion of c, g m = r n−1 r n−2 . . . r 0 is the binary expansion of r, and two adjacent binary expansions g j and g j+1 differ by only one bit for 1 ≤ j ≤ m − 1. That is, only one bit flip occurs between two binary numbers in the sequence. We call the order of bit flips between the binary expansion of c and the binary expansion of r in the Gray code the Gray code ordering for c and r. Note that a bit flip may not be required for every bit position. Also, there are at most n + 1 binary numbers in a Gray code between any pair of states. From the Gray code sequence, we determine the corresponding quantum circuit.
To construct a circuit from the Gray code g 1 , g 2 , . . . , g m for the two-level unitary matrix V j , we create a Λ n−1 (X) gate to transform state |g j into |g j+1 , for 1 ≤ j ≤ m − 2. Each gate performs a controlled bit flip on the differing qubit, conditional that all other qubits are the same as in states |g j and |g j+1 .
After the bit-flipping operations, we create a Λ n−1 (Ṽ j ) gate to transform state |g m−1 into |g m with the differing qubit as target and conditional on all other qubits being the same as in state |g m . We then create a sequence of Λ n−1 (X) gates to undo the initial sequence of bit-flipping operations by repeating them in reverse order.
Algorithm 2 presents the details of this circuit-construction process. It constructs a sequence of controlled single-qubit gates for each two-level matrix V j in V 1 . . . V k . Note that the output of Algorithm 2 is a sequence of palindromic subcircuits, subcircuits that read the same forwards as backwards. We will discuss the optimization of palindromic circuits in detail in the next section.
As an example, Table 1 contains a Gray code between basis states |000 and |111 . Figure 4 contains the corresponding quantum circuit of five gates, where ⊕ represents the Pauli-X operator, • represents a control on 0, and • represents a control on 1. Figure 4 : The circuit for the two-level matrix V j that nontrivially acts on states |000 and |111 .
Output: A circuit composed of Λ n−1 (Ṽ j ) and Λ n−1 (X) gates, for each
. . , g m = GrayCode(c, r);
procedure GrayCode(c, r) { let g = g n−1 g n−2 . . . g 0 be the binary expansion of c; let h = h n−1 h n−2 . . . h 0 be the binary expansion of r; output g; while g = h do { let g k be the rightmost bit in g that is different from the corresponding bit in h;
output the (n − 1)-controlled single-qubit gate Λ n−1 (S) targeting the bit differing between g j and g j+1 and conditional on the other qubits being the same as in g j ; }
The Palindrome Transform
In this section we present a general algorithmic optimization technique, which we call the Palindrome Transform, that can be used to minimize the number of self-inverting gates in quantum circuits composed of concatenated palindromic subcircuits. The minimization arises from determining an optimal ordering for concatenating the palindromic subcircuits that induces the maximal amount of cancellation due to the juxta-position of self-inverting gates. We then characterize the orderings of palindromic subcircuits that maximize the total amount of cancellation.
We call a gate A self inverting if AA = I, that is, if A is its own inverse. If we generate a sequence of self-inverting gates of the form
then we can eliminate this sequence by replacing it with the empty sequence. We call such a sequence self annihilating.
A number of quantum-circuit-generation algorithms produce subcircuits consisting of sequences of gates in which a prefix and suffix of each subcircuit forms a palindrome of self-inverting gates. That is, a subcircuit is of the form
for m ≥ 0, where each A j is a self-inverting gate and β is a unique gate that is not necessarily self inverting. For the purposes of this paper, we assume β is a controlled single-qubit gate Λ n−1 (S), where S is a component matrix. We call a sequence of the form (6) a palindromic subcircuit , or equivalently the longest prefix γ of α 2 , such that γ R γ is a self-annihilating sequence. For example, if we concatenate the two palindromic subcircuits ABCA 1 CBA and ABA 2 BA, we get the circuit ABCA 1 CBAABA 2 BA = ABCA 1 CA 2 BA. Here, AB is the overlap between these two palindromic subcircuits and BAAB is a self-annihilating sequence.
If we have a set P S of palindromic subcircuits, then we can use the following algorithm to find an optimal ordering of all the subcircuits in P S that maximizes the sum of the overlaps between successive subcircuits in any composition of the subcircuits. We call such an ordering a maximal overlap sequence for P S.
The algorithm uses a data structure called a trie [1] , sometimes called a radix tree [5] , to store the prefix α j A j of each palindromic subcircuit α j A j α R j . The trie is an ordered labeled tree in which there is a path from the root to a leaf that spells out the string α j A j . The root is labeled by the empty string and each non-root node is labeled by a gate. If there is another string α k A k that has a common prefix γ with α j A j , then the paths for α j A j and α k A k in the trie each share the prefix γ. For notational convenience, we will just use the middle A j to represent a palindromic subcircuit in a maximal overlap sequence.
Algorithm 3: The Palindrome Transform
Input: A set of m palindromic subcircuits
Output: An ordering A j1 , A j2 , . . . , A jm for the concatenation of these palindromic subcircuits such that
where length(γ) is the number of gates in the sequence γ. Method: 1 The results in this section also apply to subcircuits of the form A 1 . . .
1 , but these do not arise in the context of two-level decomposition.
procedure PalindromeTransform(P S, m) { initialize a trie T ; for j = 1 to m do enter(α j A j , T ); dfsPrint(T ); } procedure enter(string, T ) { let string = A 1 A 2 . . . A k ; start at root of T ; follow the longest path A 1 A 2 . . . A p in T that spells out a prefix of string ending at node x; create a new path starting at node x that spells out
visit the nodes of T in a depth-first-search order printing the label of each leaf when it is first encountered; } We call the trie produced by Algorithm 3 the palindrome trie. By entering the α j A j 's into the trie, we identify the maximal length common prefixes for all palindromic subcircuits. Note that we are using A j to represent the palindromic subcircuit α j A j α R j . By grouping the labels of the leaves of the trie in a depth-first-search order [1, 5] , we order the palindromic subcircuits to achieve the maximal possible total overlap of self-inverting gates between successive subcircuits.
We can characterize the orderings of the leaves of the palindrome trie that are maximal overlap sequences. Let T be a trie whose root node has p subtries with exactly one child labeled A 1 , . . . , A p , p ≥ 0, and q subtries T 1 , . . . , T q , q ≥ 0, where each subtrie T k has more than one child, as shown in Figure 5 . We assume that p + q > 0 and that the p + q subtries can appear in any order. where permutation(x 1 , . . . , x m ) is the set of all sequences that are permutations of the sequences x 1 , . . . , x m . We shall show that any sequence in mos(T ) is a maximal overlap sequence and conversely every maximal overlap sequence is in mos(T ). Listing the leaves of the trie in a depth-first-search order is one efficient way to produce such a sequence.
Theorem 1 Let T be a palindrome trie for a set P S of palindromic subcircuits. A sequence of palindromic subcircuits from P S is a maximal overlap sequence if and only if it is in mos(T ).
Proof. To show that every sequence in mos(T ) is a maximal overlap sequence we use structural induction on T . The sequences in mos(T ) recursively keep the leaves of the subtries of T contiguous. Single-leaf subtries of T correspond to palindromic subcircuits that cannot participate in any prefix sharing. If T j is a subtrie of T with k leaves, where k > 1, then T j adds 2(k − 1) to the number of cancelling contiguous self-inverting gates by sharing the gate represented by the branch from the root of T to the root of the subtrie T j . Assuming every sequence in mos(T j ) is a maximal overlap sequence, then every sequence in mos(T ) attains the maximal amount of sharing and thus maximizes the sum of the lengths of the overlaps between successive palindromic subcircuits. Thus every sequence in mos(T ) is a maximal overlap sequence. Conversely, it is easy to show that every maximal overlap sequence for P S corresponds to some traversal of the palindrome trie for P S represented in mos(T ).
Corollary 1
The procedure PalindromeTransform(P S, m) produces an ordering for the m circuits in P S that maximizes the total number of cancelling self-inverting gates.
Proof. The depth-first-search ordering of the leaves of the palindrome trie for P S has the mos property.
Corollary 2
The number of gates in the circuit produced by the palindrome transform ordering after cancelling all self-inverting gates is (number of leaves in trie) + 2(number of interior nodes in trie)
Proof. Note that a path α j from the root of the palindrome trie to a leaf labeled by A j followed by the reverse path α R j defines a palindromic subcircuit α j A j α R j . One gate is generated for each leaf. Each incoming branch to an interior node generates one gate before the leaf to perform an operation and one gate after the leaf to invert the effect of that operation.
The palindrome transform assumes the palindromic subcircuits can be concatenated in any order. If we treat the middle gate of each palindromic subcircuit as a generic gate, then we can use the palindrome transform to generate for an arbitrary unitary matrix U a sequence of controlled single-qubit gates in which the maximum amount of cancelling of self-inverting gates takes place, assuming a fixed column order of two-level decomposition.
To do this, we first construct palindromic subcircuits with a generic middle gate from the Gray codes for the conventional ordering of two-level decomposition for U . From these palindromic subcircuits, we use the palindrome transform to find an mos ordering of the generic gates. Using this mos ordering, we then use Algorithms 1 and 2 of the previous section to construct the quantum circuit C of Λ n−1 (Ṽ j ) and Λ n−1 (X) gates such that C computes U . The circuit C will have the maximal amount of cancellation of Λ n−1 (X) gates due to the juxtaposition of self-annihilating sequences. Note that any mos ordering produced in this fashion generates a circuit that computes U .
In the next section, we will give a direct enumerative method of constructing a circuit of this nature without having to construct the palindrome trie.
Palindromic Optimization Algorithm
We now describe our Palindromic Optimization Algorithm (POA). It takes as input a 2 n × 2 n unitary matrix U and produces as output a circuit G m . . . G 1 of controlled single-qubit gates that computes U minimizing the number of Λ n−1 (X) gates in the generated circuit. POA performs a two-level decomposition on U , assuming a fixed-column order 0, 1, . . . , 2 n − 2, where the columns of the matrix are labeled 0 to 2 n − 1 [13] . It uses a specially computed array n to direct the two-level decomposition in order to minimize the number of Λ n−1 (X) gates in the generated circuit. The order of two-level decomposition directs the generation of a sequence V 1 . . . V k of two-level matrices such that
Figure 6: A subsequence of the unoptimized circuit for an arbitrary 2 3 × 2 3 unitary matrix using the conventional ordering.
POA uses Algorithm 2 to generate the output circuit from V 1 . . . V k . It uses the Gray code algorithm described in Section 5 to determine the sequences of Λ n−1 (X) gates to perform the state changes to bring together the two nontrivial vector components for each controlled Λ n−1 (Ṽ j ) gate. We require the Gray code ordering to be 2 0 , 2 1 , . . . , 2 n−1 , where n is the number of qubits, to achieve the minimal number of Λ n−1 (X) gates. If a different Gray code order is used, the minimal number of Λ n−1 (X) gates may not be achieved for all n. For the stated setting, POA maximizes the overlap of Λ n−1 (X) gates over all two-level matrix decompositions, thus minimizing the number of Λ n−1 (X) gates in the generated circuit.
Algorithm 4: Palindromic Optimization Algorithm
Input: A 2 n × 2 n unitary matrix U and n, the number of qubits. Output: A circuit of (n − 1)-controlled single-qubit gates that computes U . Method: 
We now prove the optimality of POA assuming a fixed-column ordering 0, 1, . . . , 2 n − 2 for a two-level decomposition, a right-to-left bit ordering 2 0 , 2 1 , . . . , 2 n−1 for the Gray code order, and ordering pairs (r, c) in which r > c and the sequence of state changes must occur from c to r.
Figure 7: A subsequence of the optimized circuit for a 2 3 × 2 3 unitary matrix using POA.
Let P S(c, r) be the palindromic subcircuit generated for the Gray code sequence returned by the procedure GrayCode(c, r) in Algorithm 2. First, we examine the intercolumn ordering of the entries in array n and the row ordering within a given column necessary to achieve a minimal Λ n−1 (X) circuit for U . Then we prove that the ordering of the entries from row 1 to row 2 n − c − 1 in in each column c in array n is a maximal overlap sequence for 0 ≤ c ≤ 2 n − 2.
Lemma 1
The maximum possible overlap of Λ n−1 (X) gates between the last palindromic subcircuit generated for column c and the first palindromic subcircuit generated for column c + 1 is 1, for 0 ≤ c ≤ 2 n − 2. Further, an overlap of 1 is achieved between the circuit P S(c, r last ) followed by the circuit P S(c + 1, r f irst ), where r last is the last entry in column c and r f irst is the first entry in column c + 1, only when c is even, r last is odd, and r f irst is even.
Proof. For n qubits, we have a fixed column ordering 0, 1, 2, . . . , 2 n − 2. Let us first consider the case where column c is even.
We would like P S(c, r last ) and P S(c + 1, r f irst ) to overlap and thus share one or more Λ n−1 (X) gates. Since c is even and c + 1 is odd, the 2 0 bit of the binary expansion of c is 0 and the 2 0 bit of c + 1 is 1. For an overlap to occur, the 2 0 bit of r last must be 1 and the 2 0 bit of r f irst must be 0. Thus, an overlap between subcircuits P S(c, r last ) and P S(c + 1, r f irst ) occurs only when r last is odd and r f irst is even. Furthermore, the maximum overlap is 1 since after flipping the 2 0 bit of r last to 1, it remains 1. Similarly, the 2 0 bit of r f irst remains 0. Thus only one overlap can occur. Now consider the case where column c is odd. Using the same reasoning as above, an overlap can occur between P S(c, r last ) and P S(c + 1, r f irst ) only when r last is even and r f irst is odd. But, if c is odd, there must be at least one 1 in the binary expansion of c + 1 that is not present in c. Since the first bit flip is on bit 2 0 , there cannot be an overlap due to this differing 1 and thus the maximum overlap is 0.
Lemma 2 Within a column c, an overlap can occur between the subcircuits generated for two adjacent rows only if the entries for both rows are even or both are odd.
Proof. First consider the case where column c is even and r 1 and r 2 are the entries for two adjacent rows in column c. We have the following combinations: i. r 1 is odd, r 2 is even: Since only GrayCode(c, r 1 ) requires a 2 0 bit flip, P S(c, r 1 ) and P S(c, r 2 ) cannot have an overlap. ii. r 1 , r 2 are both odd: Since both pairs require a 2 0 bit flip, there exists at least one overlap. iii. r 1 is even, r 2 is odd: There cannot be an overlap. iv. r 1 , r 2 are both even: Since both pairs have a 0 in bit 2 0 , there may be an overlap. Similarly, if c is an odd column, then an overlap can occur only when r 1 and r 2 are both even or both odd.
We now prove that POA generates maximal overlap sequences. Let R Let us now examine how the palindromic subcircuits generated by the columns of array m are related to the subcircuits generated from array m−1 . Let P S Theorem 2 For a fixed-column two-level decomposition of an arbitrary 2 n × 2 n unitary matrix, the Palindromic Optimization Algorithm produces a circuit that achieves the maximal length of overlaps between successive palindromic subcircuits and thus minimizes the number of Λ n−1 (X) gates generated in the quantum circuit of (n − 1)-controlled single-qubit and (n − 1)-controlled-NOT gates.
Proof. The proof follows from Lemmas 1-3.
Gate Count Equations
We now quantify the number of gates in the circuits generated by our algorithms. In all our equations n is the number of qubits. We first derive the equation for the number of gates produced by using the conventional two-level decomposition algorithm assuming no cancelling of self-inverting gates. We then give the gate count for conventional two-level decomposition with cancellation. Finally, we derive the equation that gives the number of gates in the optimized circuit resulting from performing two-level decomposition in the order specified by POA.
Conventional Circuit Size
We will show that c n , the number of gates in the unoptimized circuit produced using the conventional order of two-level decomposition, is given by
We can determine the size of the circuit produced by the two-level decomposition algorithm for a 2 n × 2 n unitary matrix using the conventional ordering by taking the number of Gray codes of length j generated by Algorithm 2, given by 2 n−1 × n j and multiplying this number by 2j − 1, the number of gates in the circuit generated for a Gray code of length j. Thus the number of gates in the conventional circuit for n qubits is given by
Conventional Circuit Size with Cancelling
The number of gates in the unoptimized circuit after cancelling adjacent Λ n−1 (X) gates between palindromic subcircuits follows directly from Equation 9. From Lemmas 1 and 2, we conclude that only the inter-column overlaps allow for annihilation of gates using the conventional ordering array for order n . By Lemma 1, the number of gates that cancel is 2(2 n−1 − 1), so the gate count equation is then
POA Circuit Size
We will show that the number of gates poa n in the optimal circuit produced by the Palindromic Optimization Algorithm for an arbitrary 2 n × 2 n unitary matrix is
To derive Equation 11 for 2 n × 2 n unitary matrices, we consider the ordering array n−1 and apply POA to determine array n and the corresponding number of gates for the circuit for n. From column c of array n−1 , POA determines columns 2c and 2c + 1 of array n .
Consider the case of the even column 2c in array n . We note from the proof of Lemma 3 that the subtrie for this column is exactly the subtrie for column c in array n−1 with two additional branches as given in Equation 7 : one branch at one further depth containing a copy of the subtrie and a single leaf containing a single gate. This implies that the number of gates generated by column 2c in array n is twice the number of gates generated column c in array n−1 plus three, two for the additional branch and one for the additional leaf.
Similarly, the odd column 2c + 1 in array n generates two times the number of gates generated for column c in array n−1 plus two gates required for the additional branch as given in Equation 8.
Let poa n be the total number of gates generated by POA using array n . Summing the gate counts for every column and recalling that the number of gates that cancel due to inter-column overlaps is 2(2 n−1 − 1), poa n is then given by the recurrence 
Simplifying this equation, we get poa n = 7 3 (2 2n−1 ) − 7(2 n−1 ) + 10 3
Results
The Palindromic Optimization Algorithm results in a dramatic reduction in circuit size over the conventional method. Table 2 lists circuit sizes for n = 2, . . . , 7 qubits resulting from two-level decomposition using the ordering produced by POA, the conventional ordering, and the conventional ordering with no annihilation of self-inverting gates. When we use the conventional ordering [13] for two-level decomposition on a 2 3 × 2 3 unitary matrix, the resulting circuit contains 62 gates. Figure 6 shows the initial sequence of gates in this circuit. However, our palindromic optimization algorithm produces a circuit with 50 gates. Figure 7 shows the initial sequence of gates in this optimized circuit.
The reduction increases linearly with the number of qubits. For example, when n = 7, our method reduces the number of gates from 49,090 to 18,670 over the conventional method, a more than 60% reduction.
Conclusions
In this paper we have presented a framework for compiling an arbitrary 2 n ×2 n unitary matrix into a quantum circuit of (n − 1)-controlled single-qubit and (n − 1)-controlled-NOT gates in which the initial phase of the n Palindromic Conventional No canceling  2  8  8  10  3  50  62  68  4  246  378  392  5  1086  2034  2064  6  4558  10210  10272  7 18670 49090 49216 Table 2 : Number of (n − 1)-controlled gates in an n-qubit circuit using our algorithm, the conventional ordering, and the conventional ordering without canceling palindromes.
framework decomposes the matrix into a sequence of two-level matrices. We have shown that the order of two-level decomposition can have a dramatic impact on the size of the resulting quantum circuits and we have characterized those orders of two-level decomposition that, for a fixed-column ordering, minimize the number of (n − 1)-controlled-NOT gates that get generated. We have also presented an enumerative Palindromic Optimization Algorithm that produces circuits with the minimal number of controlled-NOT gates. This algorithm yields circuits that are significantly smaller than those produced by the conventional ordering for two-level decomposition.
