In this paper we investigate how quantum architectures affect efficiency of the execution of Quantum Fourier Transform (QFT) and linear transformations, that are essential parts of the stabilizer/Clifford group circuits. In particular, we show that in most common and realistic architectures including Linear Nearest Neighbor (LNN), 2D lattice, and bounded degree graph (containing a chain of length n) n-qubit QFT and n-qubit stabilizer circuits can be parallelized to linear depth. We assume no auxiliary qubits could be used during the computation. Some lower bounds are constructed that show efficiency of our approach.
Introduction
Quantum computation has attracted attention because it appears to reduce the computational complexity of certain calculations. For the quantum circuit computational model, there exists a number of physical quantum information processing implementations, such as, but not limited to, liquid NMR (up to 12-qubits at a time) [11] , and trapped ions (8 qubits) [9] . Generally, a large number of qubits is required for computational purposes. In this work, we do not allow for any auxiliary qubits to be used to reflect apparent hardness of scaling quantum information processing devices.
Quantum circuits have been optimized to require less space, fewer gates and smaller depth. This is important from the point of view of the efficient potential realization of the quantum algorithms. However, in realistic quantum computations [9, 11, 16 ] not all interactions between a pair of qubits are possible to establish directly. A study of quantum computing architectures for the existing and emerging quantum technologies shows that the fastest/possible direct interactions form a bounded degree graph (e.g. liquid NMR quantum information processing), and 1D or 2D (sub)lattices [15] . A mixed architecture, where qubits may be teleported to where they are desired was studied in [5] . However, such an architecture cannot in general be realized in any technology, and it was shown that teleporting of a single value (simultaneous teleportation of many qubits may be less efficient) is only efficient if compared to more than 2-4 levels of SWAPs. The latter is important for this work since we are only using depth-1 swapping of multiple qubits via SWAP gates.
Generally speaking, due to the spatial constrains it seems unrealistic to believe that a direct scalable implementation of the "sea-of-qubits" (where every two qubits are neighbors) architecture will ever be found. Furthermore, in classical computation the number of neighbors is limited, and there is no obvious reason to believe that the quantum world is different. Thus, the complexity of the circuit designs must be refined to take it into account the limitations of possible quantum computing architectures. In the architectures that we consider, two gates can be executed in parallel as long as they operate on different sets of qubits. This is a natural assumption for most quantum technologies.
The linear nearest neighbor (LNN) architecture, also known as chain nearest neighbor, is often considered as a good (and, in fact, very restrictive) approximation to what a scalable quantum architecture may be. Mathematically, in a LNN architecture with n qubits q 1 , q 2 , ... , q n , the 2-qubit gates are allowed between qubits whose subscript values differ by one. The LNN architecture describes 1D lattices. It misses possible direct interactions in 2D lattices and may restrict the number of useful interactions in connected graphs. However, if one can show that a circuit could be efficiently reorganized to be executed in the LNN architecture, such circuit could be run in many other architectures.
The Quantum Fourier Transformation (QFT) is an analogue of the classical discrete Fourier transformation, however, in the quantum case the transformation is applied to the amplitudes. QFT serves as a basis for a number of efficient quantum algorithms. Most notably, it is at the heart of integer factoring and discrete logarithm polynomial time quantum algorithms [14] . Therefore, efficient implementation of QFT is important. This is why this topic has been studied extensively [3, 4, 10] . Researchers presented linear and logarithmic (using a number of auxiliary qubits) depth circuits. However, we point out that the computational model used is "sea-of-qubits". Known circuits for QFT have a regular structure [10, 12] . However, they require direct interaction between every two qubits, which makes such circuits especially inconvenient for quantum architectures where only a finite number of neighbors is allowed. In an architecture with a finite number of neighbors, such as LNN, state transfer down the chain may require up to n SWAP gates. A linear depth QFT circuit possible to execute in the LNN architecture has been reported in [6] . We rediscover this circuit with our generalized technique and study lower bounds.
Stabilizer circuits (also known as unitary stabilizer circuits or Clifford group circuits) were introduced and studied for their use in encoding and decoding stages of quantum error-correction codes [2, 8, 10] . Stabilizer circuits were efficiently simulated [1] as an 11-stage sequence of Hadamard (H), Phase (P) and linear functions (C) as H-C-P-C-P-C-H-P-C-P-C. Each P and H stage is a depth-1 computation composed with single qubit gates. The complexity of stabilizer circuits is, thus, defined by the complexity of realization of linear functions. Efficient circuits for linear functions are, therefore, at the very core of quantum computation. In this paper we show that every stage C can be executed in linear depth in the LNN architecture. Thus, the entire stabilizer circuit requires only linear time to be executed.
The remainder of the paper is organized as follows. We start by introducing a concept of skeleton circuit and study its properties. In Subsections 2.1 and 2.2 the lessons learned are applied to show that QFT and linear reversible/stabilizer circuits can be executed in linear time in the LNN architecture. Section 3 reports lower bounds for a particular, and, what appears to be a very important class of skeleton circuits. Concluding remarks can be found in Section 4.
Skeleton circuit
Any quantum circuit composed with single qubit and two-qubit gates can be thought of as a circuit composed of generic two-qubit operations each of which consists of a two-qubit gate of the initial circuit with the surrounding gates absorbed into it (trivial case when only single qubit gates are applied to a specific qubit throughout an entire computation is ignored as not interesting). We call such circuit a skeleton circuit. Obviously, complexity of a skeleton circuit defines complexity of the initial circuit (assuming any two-qubit gate has a finite cost) and vice versa. We next study skeleton circuits of a certain type and apply the lessons learned to construct circuits for QFT and linear reversible/stabilizer circuits which can be executed in linear time in the LNN architecture.
The basic skeleton circuit we consider is illustrated in Figure 1(a) . Mathematically, the skeleton circuit SC is defined as
where G * is a two-qubit gate that operates on the qubits indicated in brackets, i * take Boolean values, and for a gate G, G 1 is the gate G itself, whereas G 0 = Id (identity, i.e. this gate is not applied). In other words, i * are used to indicate if a gate is present or not. Since every two quantum gates that operate on non-intersecting sets of qubits commute, the SC circuit can be executed in parallel in
. This is illustrated in Figure 1 (b) in case n = 5.
Next, the circuit can be adapted to the LNN architecture through inserting SWAP gates SWAP(q s , q t ) after each gate G i k k (q s , q t ). This is illustrated in Figure 1 Let us note that the skeleton circuit that we consider can be executed in linear time in the LNN architecture for any initial LNN connectivity pattern of the input and return the output in any desired order. For that, at most linear depth swapping stage before and after the circuit is required, which, obviously, does not change the overall linearity of the depth. The circuit illustrated in Figure 1 (c) does not only allow execution in the LNN architecture, it also does not change the LNN connectivity pattern (q 1 − q 2 − ... − q n ), and thus such circuits can be cascaded one after the other with no swapping in between. This observation will be used in Subsection 2.2.
QFT in the LNN architecture
A circuit that realizes QFT and requires no ancilla qubits is illustrated in Figure 2 (a). Its skeleton circuit Figure  2 (b) is obviously of the type considered in the previous section with all i * = 1. Therefore, QFT can be executed in linear time. This is, however, a known result. [6] reports a construction that is equivalent to ours. The new results that we add to this discussion are lower bounds presented in Section 3.
Stabilizer/linear circuits
Synthesis of efficient linear circuits has been studied in [13] . Authors report a synthesis algorithm able of coming up with a circuit containing O( n 2 log n ) CNOT gates. It was also proven that their synthesis is asymptotically optimal in that there exists a linear function that requires Θ( n 2 log n ) CNOT gates. In this paper, the goal is different. We target minimization of the depth as opposed to the number of gates used. The depth of our circuit is linear in the number of qubits n, and it is upper bounded by 18n + O(1) CNOTs (assuming every SWAP is substituted with a suitable 3-CNOT implementation). We also prove asymptotic optimality, which in our case is a trivial result.
Every reversible liner function of n variables q = (q 1 , q 2 , ..., q n ) t can be written as matrix multiplication A q, where A is an n × n Boolean non-singular matrix. Synthesizing such function is equivalent to composing a sequence of gate operations that transform matrix A to its reduced echelon form. Due to the reversibility the reduced echelon form of A is the identity matrix. A standard technique for transforming a matrix A to the identity is to apply Gauss-Jordan elimination algorithm. In the following, we illustrate application of the Gauss-Jordan elimination algorithm and next modify its circuit to allow it be executed with a linear number of computational stages. Parameters i * and p * take Boolean values and they are used to indicate if the gate has been applied (1) or not (0). Parameters p * are reserved for the gates applied to update values of the diagonal elements of the matrix A during Gauss-Jordan elimination. • Step 1. Make sure that the pivot element a 1,1 = 0. If a 1,1 = 0 assign p 1 := 0. Otherwise choose a j,1 = 0, apply gate CNOT(q j , q 1 ) and make assignment p 1 := 1.
• Steps s = 2..n. Transform each a s,1 to 0 through application (if needed) of the gate CNOT(q 1 , q s ). If at step s a gate was applied set i s := 1, otherwise, i s := 0.
•
Step n + 1. Make sure that the pivot element a 2,2 = 0. If a 2,2 = 0 do nothing (p 2 := 0), otherwise choose a j,2 = 0, apply gate CNOT(q j , q 2 ) and set p 2 := 1.
• Steps s = n+2..2n−1. Transform each a s,2 to 0 through application (if needed) of the gate CNOT(q 2 , q s−n+1 ). If at step s a gate was applied set i s := 1, otherwise, i s := 0.
• . . .
•
Step n(n+1) 2 − 2. Make sure that the pivot element a n−1,n−1 = 0. If a n−1,n−1 = 0 do nothing (p n−1 := 0), otherwise apply gate CNOT(q n , q n−1 ) and make assignment p n−1 := 1. After this step, all parameters p * must be set.
• Step • Steps s = n(n+1) 2 ..n 2 − 1. If a k,l = 0, apply CNOT(q l , q k ) for k = l..1 inside for l = n..2 and set i s to one iff a gate has been applied.
We next use the gate commutation rule (two CNOT gates commute iff target of one gate is not equal to the control of the other) and circuit identity CNOT(a, c)CNOT(c, b) = CNOT(c, b)CNOT(a, b)CNOT(a, c) to move all (n − 1) gates CNOT(a, c) with parameter p * to the front of the network. Note, that every time commutation rule is used, the gates just change their position and every time the circuit identity is applied we introduce a new gate CNOT (a, b) . However, such a gate can always be commuted to the closest on the left CNOT(a, b), and this is accounted for by the updates to the i * gate presence indicator. The circuit gets transformed to the one illustrated in Figure 4 . Parameters i * are changed through EXORing each i j , j < n(n+1) 2 with p k , for k < n such that q k is the target of the gate used at step j. The constructed circuit consists of three parts marked I-III in Figure 4 . Step:
n-1 n n+1 n+2
2n-2 2n-1 n(n+1)/2-1 n(n+1)/2-2 n(n+1)/2 n(n+3)/2-4 n(n+3)/2-3 n(n+3)/2-2 n -3 Skeleton of each of these parts is described by equation (1), which is obvious for parts II and III and requires a short explanation for part I. Divide skeleton circuit ( Figure 1a ) into (n − 1) parts with the first containing first (n − 1) gates, the second containing next (n − 2) gates, and so on, the last, (n − 1) st part containing one last gate. Then, gate G i for i = 1..n − 1 from part I of the circuit in Figure 4 can be matched (via "skeletonization") to some gate in the i th part of the skeleton circuit SC. Thus, every linear reversible function can be computed as a maximal depth 3 * (2n − 3) = 6n + O(1) circuit. Furthermore, since each SWAP-CNOT pair can be rewritten as two CNOTs ( Figure 5 ) and SWAP requires no more than 3 CNOT gates, the overall depth in terms of CNOTs can be upper bounded by expression 18n + O (1) . We note that in some quantum information processing proposals pair CNOT-SWAP can be executed more efficiently than a single CNOT or a single SWAP, such as in [7] , Fig.  1 . Due to locality reasons our upper bound has the same asymptotic as a lower bound, and thus our circuits are asymptotically optimal. Using H-C-P-C-P-C-H-P-C-P-C decomposition for stabilizer circuits [1] these upper bounds directly translate to at most depth 30n + O(1) circuit composed with generic two-qubit gates, and at most depth 90n + O(1) circuit in the library with single qubit and CNOT gates. 
Lower bounds
In this section we study lower bounds on the depth of skeleton circuit SC defined in equation (1) assuming all gates are present (i.e. each i * = 1). We further assume that a pair of gates G(q i , q j )SWAP(q i , q j ) requires two units of the execution time, one for each of the gates. In practice, a direct implementation of pair G(q i , q j )SWAP(q i , q j ) may be more efficient [17] , but the particulars of such construction depend on the specific Hamiltonian, unknown in the general case. The depth of circuit illustrated in Figure 1 (c) is thus 4n − 6. The lower bounds achieved below are directly applicable to the QFT circuit.
To prove lower bounds, we need to restrict the set of possible computations. We define two circuit type quantum computational models A and B. We require that in each of them in order to compute the SC (equation (1)) all
two-qubit gates need to be executed, and no ancilla qubits could be used. Furthermore,
• in model A we assume that the gates required to be executed in SC cannot be commuted (other than trivially-a pair of gates operating on non-intersecting sets of qubits always commutes);
• in model B we allow possibility of the execution of gates in any order (i.e. this lets us obtain bounds that allow commuting gates through the circuit, without worrying about which gates actually commute, and what kind of corrections are needed in case they do not commute).
The architectures considered in this paper are LNN, 2D square lattice, and bounded degree graph with the degree of each vertex no more than k. We next prove a number of lower bounds, refer to Table 1 . Every three stages of the SC have a single fixed qubit that interacts with three other qubits. This is either q 1 , q 2 , or q n . Thus, every three logic levels have to be separated by a round of SWAPs, each having depth at least 1, i.e. each sequence LLL must be replaced by LSLL or LLSL to be able to run the circuit in the LNN architecture. We call this 3L → 1S requirement. With the 3L → 1S requirement, the total depth must be at least 2n − 3 + 1 2 (2n − 3) = 3n + O(1) logic levels. Therefore, using just the 3L → 1S requirement proves our circuit is at most factor 4 3 off the optimum. We now improve this bound to 10n 3 + O(1) by showing that every 4 computational stages must be separated by at least depth-2 swapping stage (4L → 2S requirement). 4L → 2S is slightly more restrictive than 3L → 1S. The difference between the two is that in one LLSLL is allowed, but not in the other. We next prove that depth-1 level does not suffice in separating some two computational stages from the next two by exploring the properties of SC and LNN architecture.
LNN 2D square lattice bounded degree graph
Assume all 4 computational stages L i , L i+1 , L i+2 , and L i+3 are solely in the first half of SC. Second half is symmetric to the first half and thus a similar proof holds for it. We do not prove the boundary case (where one part of the 4-stage computation is in the first half of the SC and the other part is in the second half) because its contribution to the final figure is only a constant. Next, assume i is odd. The proof for even values i is analogous. Name the qubits q 1 , q 2 , ..., q n top to bottom. The computational stages L i and L i+1 use interactions q i+2 − q 1 ,
, which in LNN architecture can only be aligned as follows: The best depth-1 swapping reduces the architectural distance between these qubits by 2, which is not enough for the desired interaction to be allowed. Thus, the depth of swapping must be at least 2. This concludes the proof of 4L → 2S requirement.
We finalize the proof of (1) . This implies that the circuit we constructed explicitly (Figure 1(c) ) must be within factor of 2 ⌉ (this means that all computational stages are in the first part of SC; the proof for the symmetric second part is similar) must contain at least one swapping stage if ran in 2D square lattice architecture. We prove this by finding three interactions that form a loop. Vertices in such loop cannot be isomorphically mapped to the vertices of 2D square lattice. The interactions that form such a loop, assuming qubits are named q 1 , q 2 , ..., q n top to bottom, are q i−1 The lower bound that we just proved may be interesting to those physicists working on implementing 2D architectures for quantum information processing. The lower bound shows that, with certain restrictions, running QFT in 2D square lattices cannot in principle be any faster than 133.(3)% of the QFT circuit executed in LNN. and they all require different qubit-to-qubit interactions to be available. Next, note that in the LNN architecture application of a single SWAP may make at most two new interactions become available for a gate to be applied on. Thus, the total number of SWAPs that one must execute in a circuit to go through all n(n−1) 2 possible interactions is at least
. This means that the total number of gates to be executed in LNN architecture to compute SC must be at least
. At most n 2 gates can be executed in parallel. Thus, the depth of the circuit is at least the least total number of gates to be executed divided by the maximum number of gates that can be executed simultaneously, i.e. This lower bound is constructed based on the assumption that all gates in SC need to be executed, and does not take it into account that the order they are executed in is important. Thus, the restriction on the form of the computation is significantly weaker than that for model A, and the proven lower bound is looser.
Using similar techniques it can be shown that in an architecture where each qubit has a finite number of neighbors bounded by number k:
• the lower bound for executing SC is (2 + • the lower bound for executing SC is (1 + 
Conclusion
In this paper we studied the complexity of execution of quantum Fourier transformation and linear reversible functions (and thus stabilizer circuits) in restricted architectures.
We rediscovered the depth 4n + O(1) circuit (composed with SWAPs and controlled-Z gates) for QFT initially reported in [6] which is possible to execute in the LNN architecture. We proved a number of lower bounds for the depth of QFT circuit, all a constant factor (ranging from 1 4 to 5 6 , and depending on the computational model and assumptions made) away from the above upper bound. Some of our lower bounds could be used by those experimentalists working on implementing advanced architectures as a guide to how complex architectures may need to be for particular types of computations. For instance, we proved that, with certain restrictions, executing QFT in 2D square lattices cannot in principle be any faster than 133.(3)% of the QFT circuit executed in LNN.
More importantly, we presented constructive algorithm for synthesizing linear depth stabilizer circuits in the LNN architecture. In particular, we showed that any stabilizer circuit can be executed in at most 30n + O(1) stages each composed with generic two-qubit gates, which in the library with CNOT and single qubit gates translates to at most depth 90n + O(1) circuit. This upper bound is, obviously, asymptotically optimal (due to, for instance, locality reasons).
