We show how to implement an arbitrary two-qubit unitary 
Introduction
In this work we deal with the processing of quantum information, that can be stored, e.g., in electron energy-levels, nuclear spins, photon polarizations, or other quantummechanical artifacts. Unlike most of the work on CMOS and nano-technology, quantum information processing is appealing not because of faster switching, lower power or cheaper manufacturing, but rather because it offers genuinely new opportunities at the logical level. Known applications can be broadly classified into computing and communication. Quantum computers can, in principle, quickly solve some computational problems considered hopeless for classical (non-quantum) computers, such as numberfactoring [6] . Quantum communication promises to expose * Contact author: Prof. Igor Markov imarkov@eecs.umich.edu
Gate libraries
Lower In particular, we never use input-independent rotation gates. Bounds that may potentially be tightened are shown in bold.
many eavesdropping attempts because quantum information can only be read once and cannot be copied [6] . As in the classical case, quantum communication involves some computation or decoding. Quantum computation can be modeled using quantum Turing machines, quantum finite automata and quantum circuits. Circuits are much easier to analyze and, in fact, constitute the dominant model in the field (see examples in Figure 1 ). All major quantum algorithms, including Shor's for number-factoring and Grover's for search, are available in this model [6] . Quantum circuits are useful in quantum communication and cryptography -the fields where commercialization of basic research has already started. 1 As with conventional circuits, a quantum logic circuit outlines how to implement a given computation in hardware, and the type of hardware used may affect the choice of gate library. This consideration motivates circuit synthesis, i.e., finding circuits that implement given functionallyspecified computations. However, unlike conventional logic circuits, for which CMOS implementations dominate industrial and academic agenda, quantum circuits have been implemented in a variety of fundamentally different technologies. For example, in Nuclear Magnetic Resonance (NMR) technologies, where number-factoring algorithms have been demonstrated, quantum logic states are stored in nuclear spins. In optical technologies, used for quantum communication and cryptography, quantum bits are encoded as polarizations of photons. Solid-state silicon-based quantum technologies and electrons floating on liquid helium encode quantum bits in electron spins. Trapped ions encode qubits in orbital states of electrons. Other successful research uses quantized currents in semi-conductors.
To be consistent with quantum mechanics [6] quantum gates and circuits must be reversible. This means having the same number of inputs and outputs, and also being able to reconstruct input values from output values. In 1980, Toffoli initiated the study of reversible circuits that use purely classical gates and operate on purely classical 0-1 states. His work was motivated by low-power considerations, following Bennett's 1973 proof that information loss implies energy loss. Work on synthesis of reversible circuits appeared at DAC and ICCAD in 2002 and 2003, and work on test generation at VTS 2003. However, "classical" reversible circuits discussed above that consist of gates NOT, CNOT, TOFFOLI and their immediate generalizations do not possess any more computational power than AND-OR-NOT circuits. A quantum algorithm that outperforms best known classical algorithms for the same task (e.g., Shor's poly-time number-factoring algorithm) cannot be implemented using only classical reversible gates. On the other hand, common gate libraries of quantum circuits contain some gates with classical behavior, such as the CNOT gate, and useful quantum circuits may contain large "classical" reversible sub-circuits that can be optimized without any knowledge of quantum mechanics. Thus, while the study of classical reversible circuits may be useful, one also has to study purely quantum gates [2] .
The main difference between a purely quantum gate and a classical reversible gate is the ability to produce a "superposition" state such as (|0 + |1 )/ √ 2 out of a classical ("basis") input state such as |0 . In this example, we are dealing with one quantum bit, and the ability to store and transmit a linear combination of zero and one is a quantum property unavailable in textbook CMOS circuits. Moreover, while recent work in the VLSI community on smaller transistors is beginning to use quantum effects for faster switching, no quantum wires are available to transmit quantum states from a device to a device. Recent implementations of quantum circuits by experimental physicists and chemists imply several independent solutions to this problem. Indeed, photons are convenient mobile carriers of quantum information, and when stationary particles are used, quantum gates are "brought to qubits" with tuned RF pulses.
Our work follows up on recently proposed synthesis algorithms for generic quantum circuits [2] and applies to a variety of implementation technologies. To discuss those algorithms we recall that quantum bits are complex-valued vectors. According to quantum mechanics, quantum computations and quantum gates that operate on those bits are represented by linear operators. They can be captured by matrices. For example, a one-qubit gate operates on a complex two-dimensional vector space with basis vectors |0 and |1 , making it a 2 × 2 matrix.
Unlike classical logic gates that operate on bit-strings, quantum logic gates operate on complex vectors that are complex linear combinations (superpositions) of bit-strings. The dimension of the respective linear space is the number of different bit-strings. As in classical circuit diagrams, quantum circuits diagrams (see Figure 1 ) represent every bit by a wire. However, one speaks of qubits instead of bits in order to emphasize the availability of superpositions. A two-qubit computation is thus represented by a matrix acting on four-dimensional vectors. According to quantum mechanics, all quantum computations (including gates and circuits) act as unitary operators: in particular all matrices are square and have inverses. This implies that any quantum gate and any quantum circuit have as many input qubits as output qubits. Furthermore, due to the no-cloning theorem [6] , non-trivial fan-outs are not allowed in quantum circuits. Consequently, all vertical cuts that do not cross any gates cross the same number of wires.
We demand that quantum measurements [6] only be applied after the quantum circuit under consideration, which is typical. We make use of the fact that measurements are unaffected by global phase, that is, by multiplication by a scalar. That said, the reader can largely ignore measurements from now on and focus on combinational circuits.
In our quantum circuit diagrams the more significant qubits correspond to higher lines. Gates are applied left to right and are chosen from the gate library below, known to be universal [1] . The term "rotation" here refers to "Bloch sphere rotations" from quantum mechanics.
• The x-axis rotation:
• The CNOT gate C i j , flips the i-th bit if the j-th is 1.
(Here, i = j.) For example, on two qubits,
In some technologies, an arbitrary one-qubit operator may be just as easy to implement as the specific ones shown above. Therefore, we will also consider the basicgate library, which consists of arbitrary one-qubit operators and the CNOT. It is easy to change from one library to the other: using Euler angles to describe rotations in 3 gives a decomposition of an arbitrary one-qubit gate U into e iΦ R z (θ)R y (φ)R z (ψ) [1, Lemma 4.1]. For example, the Hadamard gate H can be expressed as
Our work proposes analogous minimal decompositions for arbitrary two-qubit operators. In general, given the matrices of individual gates, the matrix of the circuit is computed bottom-up using the following linear algebra operations. First, when two gates (A and B) with equal numbers of input/output wires are composed sequentially (A on the left and B on the right), their matrices are multiplied (BA). Second, when two gates (no constraints here) act on on disjoint inputs, i.e., are composed in parallel, the respective composite gate is represented by the tensor (Kronecker) product of two matrices. In particular, if we wish to augment the gate A by a wire whose value does not change, the resulting matrix can be A ⊗ I or I ⊗ A, where I is the 2 × 2 identity matrix.
Note that two gates composed in parallel can be instead "moved apart", augmented with unchanged wires, and then viewed as composed sequentially. This is captured by the equation A ⊗ B = (A ⊗ I)(I ⊗ B) with identity matrices of appropriate dimensions. Augmenting with identity allows one to capture sequential composition of gates with different numbers of inputs, and mixed sequential-parallel composition when only some of the wires are read by both gates. A simple way to simulate a quantum computation on an input vector is to compute matrix-vector products.
Recent empirical work on quantum communication, cryptography and computation [6] resulted in a number of experimental systems that can implement two-qubit circuits [4] . Thus, decomposing arbitrary two-qubit operators into fewer gates from a universal library may simplify such physical implementations. Two-qubit synthesis is also interesting because it can be used in the context of peephole optimization to simplify larger circuits. Additionally, quantum communication protocols usually transmit one qubit at a time, and encoding/decoding circuits typically require only two or three qubits.
While the universality of various gate libraries has been established in the past [3, 1] , the minimization of gate counts has only been studied recently. To this end, Zhang et al. [8] propose a generic quantum circuit with six CNOT gates that can implement an arbitrary two-qubit operator. Bullock and Markov [2] [7] proved that three CNOT gates are necessary and sufficient, and proposed another generic quantum circuit. Our work improves or broadens each of the above results, as summarized in Table 1 . When applying our results to specific useful computations, we discovered that a decomposition that is inferior in the worst-case sense sometimes produces smaller circuits.
The remainder of the paper is structured as follows. Section 2 provides background on quantum circuits. Constructive upper bounds are proven in Section 3, and lower bounds outlined in Section 4. Our algorithms for synthesis of quantum circuits are applied to useful operators in Section 5, and Section 6 summarizes our results. Many proofs are omitted, but can be found in our pre-print at the Los Alamos site http://xxx.lanl.gov/abs/quant-ph/0308033
Preliminaries
The Bloch sphere isomorphism [6] identifies a unit vector n = (n x , n y , n z ) with N = n x σ x + n y σ y + n z σ z , where
are the Pauli matrices. Under this identification, rotation by the angle θ around the vector n is given by the special unitary operator R n (θ) = e −iNθ/2 . It is from this identification that the decomposition of an arbitrary one-qubit gate U = e iΦ R z (θ)R y (φ)R z (ψ) arises. In fact, one may take any pair of orthogonal vectors in place of y, z.
Lemma 2.1 Let n, m ∈ 3 , n • m = 0, and U ∈ SU(2). Then there exist θ, φ, and ψ such that U
Moreover, the product R n (θ)R m (φ)R n (−θ) is again a rotation through the angle φ around the axis p, where p is the image of m under the rotation R n (−θ). In the special case of n ⊥ m and θ = π/2, we have p = n × m. For convenience, we define S m = R m (π/2), and note that S * m = S −m . The S z gate is the same (up to phase) as the usual S gate.
Given a one-qubit gate, e.g., R x , a superscript defines the wire on which it is applied, e.g., R j x . As for two-qubit gates, C a b denotes the controlled-not (CNOT) gate that inverts wire Circuit identities
merging R n gates n ⊥ m =⇒ S n R m (θ) = R n×m (θ)S n changing axis of rotation Table 2 . Circuit identities used in our work. a (target) when wire b = a (control) carries |1 . χ j,k represents the two-qubit SWAP gate that interchanges wires j and Table 2 summarizes circuit identities used in our work.
The canonical decomposition of SU(4) [5] states that for any U ∈ SU(4) there exist a, b, c, d ∈ SU(2) and δ diagonal in the magic basis such that U = (a ⊗ b)δ(c ⊗ d). We implement δ in two steps. First, the following implementation is known for a generic diagonal matrix in SU(2) [2] . 
where a = R * z (θ + φ), b = R * z (θ + ψ), and c = R z (φ + ψ). Second, we implement the change-of-basis matrix.
This circuit is much smaller than the 7-gate circuit given for E in [2] , that includes three CNOT gates.
Minimal two-qubit circuits
This section details the results summarized in Table 1 . We set bS * z = b and S z d = d to absorb the outermost S gates. Remaining S gates get absorbed by moving S x and S z through CNOT gates, by merging S z and R z gates, and by the "changing axis of rotation" identity from Table 2 . Furthermore, we replace the pair of adjacent CNOT gates with a CNOT and a SWAP (the CNOT elimination rule), and push the SWAP to the end of the circuit. As we started by computing U = e iπ/4 χ 1,2 U, the two SWAP gates cancel out. Thus the circuit in Fig. 1 We do not have a sharp result for the{R x , R z , CNOT} library, but fall short by one gate.
Proposition 3.3 Any two-qubit operator can be simulated by nineteen gates in the {R z , R x , CNOT} gate library.
Proof: Use the decomposition of Theorem 3.1, and expand the a, b , c, d gates using the R z R x R z decomposition. Move the bottom-line R z gates inward, and combine them with the R y gates. Next, re-expand the conglomerated R y , R z gates using the R x R z R x decomposition, and combine neighboring R x gates to obtain
This circuit has nineteen R x , R z , and CNOT gates. Finally, it may be that in a given quantum technology, it is just as easy to implement any one-qubit operator as to implement the R z and R y gates. We count gates in Fig. 1 . 
Lower bounds
In this work we distinguish two types of gate libraries for quantum operator. that are universal in the exact sense (cf. the Solovay-Kitaev theorem). The basic-gate library [1] contains the CNOT, and all one-qubit gates. Libraries of the second type also contain the CNOT gate and one-qubit gates, but we additionally require that such libraries contain only finitely many one-parameter subgroups of SU (2) . We call these elementary-gate libraries, and Lemma 2.1 indicates that if such a library includes two one-parameter subgroups of SU (2) , corresponding to rotations R n (t) around orthogonal axes, then the library is universal. Below, we derive gate-count lower bounds for both library types.
A circuit topology is a circuit diagram (a graph) with CNOT gates and placeholders for one-qubit gates. As in Fig.  1 , the placeholders are labelled so as to specify a subset of one-qubit operators; in this work, the only such subsets are one-parameter subgroups of SU (2) and all of SU (2) . We say a circuit topology (of n wires) is universal if it can compute any operator in U(2 n ), up to a global phase. We have dim[U(2 n )] = 2 2n , which for two qubits yields 16. Ignoring phase decreases dimension by one, hence T needs at least 15 one-parameter placeholders or at least 5 arbitrary-onequbit placeholders. 2 To produce a sharper lower bound on the total number of gates required, we use identities from Table 2 to prove that three CNOT gates are additionally required to prevent cancellation and absorption of one-parameter placeholders. Indeed, it is clear that a circuit with n wires and k CNOT gates needs no more than n + 2k one-qubit gates. One can further eliminate redundant parameters by observing that an R z gate can pass through the control of a CNOT, and an R x may pass through the target. Proof: Proposition 4.1 implies that at least three CNOT gates are necessary in general; at least five one-qubit placeholders are required for dimension reasons. The resulting overall lower bound of eight basic gates can be improved further by observing that given any placement of five one-qubit gates around three CNOTs, one can find two one-qubit gates on the same wire, separated only by a CNOT. Using the R z R x R z or R x R z R x decomposition as necessary, the 5 one-qubit gates can be replaced by fifteen oneparameter gates in such a way that the closest parameterized gates arising from the adjacent one-qubit gates can be combined. Thus, if five one-qubit placeholders and three CNOTs suffice, then so do fourteen one-parameter placeholders and three CNOTs, which contradicts dimensionbased lower bounds. 2
Synthesis of useful operators
In the generic case, our constructions are optimal or nearoptimal. However, practical circuits need not expose worst cases, and our synthesis techniques can benefit from additional optimization. In this section, we examine how our synthesis procedure can be modified to better handle useful circuits from the literature. Example 5.1 Consider the operator H ⊗ H obtained by applying the Hadamard gate to both lines of a two-qubit circuit. This operator is used to create superpositions, and appears in Grover's quantum search and Shor's numberfactoring algorithms [6] . While it is clear that H ⊗ H can be implemented using only two basic gates and no CNOTs, the decomposition process of Theorem 3.1 yields the following highly suboptimal circuit. The excess gates occur because the decomposition process of Theorem 3.1 begins by replacing U with U = e iπ/4 χ 1,2 U, thus obscuring the tensor-product structure. On the other hand, the original operator H ⊗ H appears prominently at the beginning of the circuit. It can be shown that this happens in general; that is, that if U = X ⊗ Y , then the matrices c, d of the proof of Theorem 3.1 satisfy c ⊗ d = X ⊗ Y , and the remainder of the circuit computes the identity. This suggests a general optimization to find such contiguous subcircuits and remove them. In our case, it would ensure the automatic discovery of small circuits.
3
Another way to recognize tensor products is to decompose U directly instead of replacing U with e iπ/4 χ 1,2 U. In general, this yields one additional CNOT as the CNOT-pair elimination rule from Table 2 no longer applies. Yet, sometimes this alternative decomposition yields smaller circuits. Example 5.2 Let U = C 1 2 (I ⊗ σ x ) be the matrix that interchanges |00 ↔ |01 while fixing |10 and |11 . This operator occurs in the Deutsch-Josza algorithm -one of the orig-
