During the physical design process, the second process of the quantum circuit design flow, using some optimization techniques after layout generation might be useful to improve the metrics or meet the design constraints. 
Introduction
As the transistor size continues to shrink to atomic scales, Moore's law confronts the small-scale limitation that prevents feature size from being made smaller than atoms [1] . On the other hand, as the quantum regime is approached, quantum effects become increasingly significant. Although these quantum effects are great barriers in classical CMOS progress, they can be used to develop a radically different form of computation [2] . Theoretically, quantum computers, computers using the quantum effects, could outperform their classical counterparts when solving certain problems. Factorization [3] , unsorted database search [4] , and the simulation of quantum mechanical systems [5] are some problems thought to be intractable on a classical machine that can benefit from quantum algorithms. For example, in quantum cryptography, the non-cloning property of quantum states [6] and the phenomenon of entanglement [7] have been utilized to help in the exchange of secret keys between various parties, thus ensuring high security of cryptosystems using public key [8] . MagiQ Technologies [9] and IdQuantique [10] have built such cryptographic systems based on the single-photon communication.
A quantum algorithm requires a quantum circuit for a successful implementation. In a large picture, the quantum circuit design flow can be divided into two main processes: synthesis and physical design ( Figure 1 ). Optimization techniques might be useful to improve results of the synthesis and the physical design processes. In the recent works, physical synthesis concept [11] [12] [13] was introduced to improve the objectives or meet the design constraints by local manipulation of netlist considering the layout information. Following the optimization concept, in this paper a new optimization technique, called gate location changing, is proposed for the optimization of quantum circuits in the physical design stage. The proposed technique takes an initial netlist and a layout, and tries to change locations of the gates that are on the critical path, considering the scheduling information. The purpose of the gate location changes is to reach a circuit with a lower latency.
Ion trap technology [14] is used as the underlying technology. Ion trap technology has been physically realized using universal elements for quantum computation with a clear scalable model [15] .
The paper continues as follows: an overview of the prior work is presented in Section 2, followed by an introduction to the ion trap technology in Section 3. Section 4 includes the details of the proposed optimization procedure. Section 5 shows the experimental results, and Section 6 concludes the paper.
Related Work
Despite significant work done on optimization in the synthesis stage [16] - [21] , only a few studies have been done on optimization in the physical design stage.
Svore et al. [37] [38] proposed a design flow that starts with a quantum program and generates its corresponding physical operations. Their work outlined various file formats and provided initial implementations of some of the necessary tools. Their design flow, which has four phases, converts a high-level program specified in the mathematical abstractions of quantum mechanics and linear algebra into a low-level set of machine instructions scheduled on a fixed H-tree-based layout [38] .
Similarly, Balensiefer et al. [39] [40] proposed a design flow which takes a quantum description in QCLP 0F 1 P [41] and generates a technology-dependent netlist. In the physical design phase, the generated netlist is scheduled on a fixed layout by a list-scheduling algorithm [42] . No optimization is done on the laid out circuit.
Whitney et al. [16] suggested a quantum design flow that takes a description and generates its layout in ion trap technology. They proposed new heuristics for layout generation and scheduling. Their physical design stage includes laying out and scheduling a fixed netlist. The technique proposed in their paper merges some gate locations during layout generation to improve latency. Overall, this approach can be considered an optimization technique in the physical design stage.
Additionally, hand-optimized layouts have been proposed in the literature [43] . Metodi et al. proposed a uniform Quantum Logic Array architecture [44] , and extended it later in [45] . Since the focus of their work was on the architectural perspective, the details of physical layout or scheduling were not explored. The same group later developed a tool to automatically schedule physical operations, given a quantum circuit and a fixed grid-based layout structure [46] . Aqua group [22] in conjunction with the Yamamoto group [23] are working on new principles 1 QCL (Quantum Computation Language) defined by B. Omer [41] utilizes a syntax derived from C and provides a quantum simulator for code development and testing on a classical computing platform. [36] . Their approaches are based on an optically connected network of few-qubit quantum registers. They use nuclear spins for realizing such registers.
In our previous paper [11] [12], we introduced the physical synthesis concept in quantum circuits. The physical synthesis includes techniques that change the netlist using layout information to reach better circuits in terms of latency and/or area. In [13] , we proposed a physical synthesis technique called Auxiliary Qubit Selection. While our previous papers introduced the physical synthesis concept and proposed some techniques for applying it, in this paper, a layout-level optimization technique is proposed that uses scheduling and layout information to change the locations of gates on the layout and decrease the latency.
Technology Abstraction
In ion trap technology, a physical qubit is an ion, and a gate is a location where a trapped ion may be operated upon by a modulated laser. Pulse sequences applied to discrete electrodes on the edges of the ion traps cause the ions to be trapped or ballistically moved between traps. Figure 2a shows a layout that was experimentally demonstrated for a three-way intersection [47] .
In this paper, the library of macroblocks defined in [16] is used for two reasons. First, by using the macroblocks, some of the low-level details can be removed and the analyses do not have to consider the variations in the ion traps technology implementation. Details such as ion species, electrode sizing and geometry, and exact voltage levels necessary for trapping and moving ions are all summarized within the macroblocks. Secondly, a carefully timed application of pulse sequences to electrodes in non-adjacent traps is required for ballistic movements along a channel. Using basic blocks consisting of a few ion traps has the benefit that building an interface between the basic blocks requires communication only between the two blocks involved. Figure 3 shows the library defined in [16] . In this library, each macroblock consists of a 3x3 structure of trap regions and electrodes with some ports to allow qubit movements between the macroblocks. The black squares are gate locations. The gates may not be performed at intersections or turns in the ion trap technology. Different orientations of each of these macroblocks can be used in a layout. Figure 2 shows a possible mapping of a demonstrated layout (Figure 2a ) to a macroblock abstraction ( Figure 2b ). As Figure 2c shows, the laser pulses are guided to the gate locations by an array of MEMS mirrors located above the ion trap plane in order to apply quantum gates [48] . Figure 3 . c) MEMS mirrors placed above the ion traps plane guide the laser beams to gate locations [16] . Figure 3 . Library of basic macroblocks used in this paper. Ports (P0-P3) and electrodes of each marcoblock make it possible for ions to be moved and trapped. Some macroblocks contain a trap region where gates may be performed (black squares) [15] .
Some key characteristics of ion trap technology can be summarized as follows:
• Rectangular channels lined with electrodes make "wires" in ion traps. Atomic ions (qubits) can be suspended above the channel regions and moved ballistically by application of voltages on the channel electrodes [49] . Therefore, a movement control circuitry is required for each wire to handle any qubit communication.
• Any operation available in the ion trap technology can be performed at each gate location. This makes it possible to reuse gate locations for different operations within a quantum circuit.
• Fabrication and control of ion traps in the third dimension is difficult. Thus, scalable ion trap systems are twodimensional [47] . Therefore, routing channels should have T-junction(s) or cross-junction(s) to allow ions to move from one channel to another.
• Multiple ions may use any routing channel as long as control circuits prevent one channel from having more than one ion at each instant of time.
• Aside from Manhattan distance between the source and target locations for an ion movement, the geometry of the wire channel is also important in the calculation of movement latency. Experiments have shown that a right angle turn takes substantially longer than a straight channel over the same distance [49] .
Gate Location Changing Technique
After generating the layout, the scheduling information is available and can be used to find the critical path. Changing gate locations for the gates located on the critical path may decrease the latency of the critical path. Following this idea, a new optimization technique, called gate location changing, is proposed that uses the scheduling information to find the critical path and change the gate locations for the gates located on the critical path in a manner that the latency can be decreased.
To illustrate the proposed technique, Figure 4a shows a QASM [50] instruction sequence operating on qubits Q0,…,Q4. Figure 4b shows the equivalent quantum circuit. Figure 4c shows the layout generated for the netlist by the dataflow-based algorithm proposed in [16] that appears to be the best reported results in terms of latency. In Figure 4c , each gate location is labeled by the gate number that is to be operated in it. The dataflow graph of the circuit is shown in Figure 5a . The label of each edge shows the delay between two nodes whereas the label of each node represents its delay to the end of the tree and is used as the node's priority. Physical latencies shown in Table  1 [50] [51] are used to calculate operation latencies. The latency of the circuit shown in Figure 4c is therefore, 245 µs. 
Idle (per μs)
The critical path is highlighted in Figure 5 by solid bold arrows. If the gate locations for G6 and G8 as well as for G7 and G9 are exchanged, the layout and the labels on the dataflow graph are modified as shown in Figure 4d and 5b, respectively. In this example, the latency of the modified circuit is 219 µs. Consequently, proper gate location changing can improve the latency of the circuit by about 11%. (a) 
The Optimization Procedure
The optimization procedure is shown in Figure 6 . It takes a netlist and a scheduled layout and applies the gate location changing technique to the scheduled layout. The procedure starts with the analysis of the scheduled dataflow graph ( Figure 5 ) to find the critical path (highlighted by solid bold arrows in Figure 5 ). Then, in Step II, cut sets 2 are extracted from the critical path ({G2}, {G7}, {G8}, {G9}, {G10}, {G11}, {G12}, {G13}, and {G14} in Figure 5a ). The procedure continues by examining each cut set to find alternative locations for the gates located in the set that reduce the latency (Step III). If such gate locations are found (e.g. found for {G7} and {G8}), the new gate locations are accepted (e.g. gate locations of G7 and G8 in Figure 4 are exchanged with G9 and G6, respectively). Cut sets are examined one by one from the beginning to the end of the critical path (from {G2} to {G14}). Only gate locations on the same column or on the neighboring columns are checked for exchanging. Changing locations of some gates changes the routes started from or ended to them. Therefore, these routes should be modified. For example, in Figure 4 , after exchanging gate locations G6 and G8, the routes (G1,G6), (G3,G6), (G6,G9), (G7,G8), (G5,G8), (G8,G9), and (G8,G10) should be modified. A modified version of Maze routing [53] is used to find routes in Step V. 2 Cut set is a set of gates in a graph that, when removed, break the graph into unconnected pieces.
Update Scheduling (VI)
Gate Location Changing Technique After updating the routes, the scheduling should also be revised to reflect the changes in gate locations. However, performing a complete scheduling in each iteration can dramatically increase total run time of the optimization program for large netlists. Considering that, since the proposed approach often modifies a small part of the scheduled dataflow graph, there may not be a need for a complete scheduling in each iteration. Therefore, in the proposed procedure, the scheduling is incrementally updated in each iteration of the optimization loop. This decreases the run time of each iteration and leads to overall run time reduction.
The scheduler selects operations based on their dependencies and priorities 3 . The gate location changing procedure changes the priorities of operations. Therefore, the update-scheduling operation must modify the priorities of the modified nodes and propagate the effects of these changes to the nodes located higher than the modified nodes in the dataflow graph. For example, after exchanging gate locations (G7↔G9) and (G8↔G6), the priority of G1 to G9 should be modified. Since the delay to the end of the tree does not change for the nodes located below the modified nodes, their priorities do not need to be modified. The propagation continues up to the root of the dataflow graph (i.e., a dummy node with the first level gates as its children). The optimization loop continues until there is no unprocessed cut set.
Experimental Results
We experimented with a number of quantum circuit benchmarks from [54] (1 to 67) and [55] (68 to 78). Table 2 shows the list of the benchmarks with their quantum gate count and number of qubits processed in the circuit. Error probabilities and physical latencies shown in Table 1 are used for the gates and for the two types of move operations in ion trap technology [50] [51] . Table 3 shows the experimental results. The correctness of our approach was verified by Quiver, a tool to aid in visualization and verification of quantum circuits [56] . 3 defined as 'the delay of a node to the end of the tree' Table 3 shows the latency of the benchmark circuits achieved by the proposed technique compared with the best one in literature [16] . The latency of circuits before and after applying the proposed technique are shown in the columns "Before Applying GLC" and "After Applying GLC", respectively. The results reported in the column "Before Applying GLC" is obtained by the best prior physical design flow in terms of latency. The column "Improvement" shows the latency improvement resulted from the proposed technique in this paper. As can be seen, an average improvement of 11.06% is achieved in the latency of the benchmarks. 
Heuristic Algorithm Analysis
As stated before, we follow a greedy approach to accept or reject one gate location changing. In other words, the increase in the latnecy resulted from changes in gate locations are rejected. To examine the impact of applying other heuristics on the result, we used simulated annealing (SA) heurisitc [57] which attemps to avoid local minimum. Table 4 shows the results of using the heuristic. The column "Our Approach Based on SA" under "latency" shows the latency obtained when we substitute SA for our greedy approach. The columns "Our Approach Based on Greedy" and "Our Approach Based on SA" under "Runtime" show the runtime of optimization technique using SA and greedy approach, respectively. The column "SA/Greedy Ratio" under "Latency" contains the ratio of the latency obtained by SA to that achieved by our greedy approach. The last column includes the ratio of the runtime of the flow based on SA approach to that based on our greedy approach. It can be observed from the table that SA produces better results than greedy approach in most cases. However, on average, the runtime of SA is almost 16 times longer. This observation might suggest that while various heuristics may provide slightly different results, it is the execution time that varies the most among them. In other words, it appears that the execution time is the determining factor in choosing among the heuristic approaches. Based on this, we chose the greedy approach for the remainder of this paper. 4 All results of this section are obtained on a 3 GHz Pentium IV with 1 gigabyte of memory. 5 As calculated by "Rational Quantify" suit 
Error Analysis
To evaluate the proposed technique in terms of reliability, we use critical error path calculation proposed in [58] . The critical error path is the sequence of qubit interactions that introduce the highest error into the circuit, in a way similar to the critical latency path through a circuit. Figure 7 illustrates the process of estimating the critical error path. It uses a simple, but effective model of a complicated error propagation process to estimate a parameter referred to as Error Distance [58] . We use the method proposed in [58] but we also consider other physical operations as well as gates to calculate the critical error path. Error probabilities shown in Table 1 and T respectively show the number of straight and turn macroblocks that should be traversed by a qubit to reach the next gate location. Table 5 shows the maximum error distance for the benchmarks before and after applying the proposed technique. As can be seen, the gate location changing technique could improve the maximum error distance of quantum circuits up to 18.5% for the attempted benchmarks. 
Time Complexity
The time complexity of the proposed technique can be calculated based on the runtimes of the steps shown in Figure 6 . The runtime of each step is calculated as follows.
Step I can be done in O(g) where g is the number of gates.
Step II examines the critical path graph to find cut sets. Therefore, the time complexity of this step is again O(g). After finding the cut sets, the procedure enters an optimization loop. The number of iterations of the loop is equal to the number of cut sets. The upper bound of the number of cut sets is the number of gates. The first step of the loop, Step III, checks gate locations in the same column or in the next column to find better gate locations for the gates that are in the current cut set. The number of locations that should be checked is bounded. Therefore, the time complexity of Step III is O(1). If such a gate location(s) is found, the layout is updated in Step IV, "Modify the Locations". This step includes a few operations. Therefore, the upper bound of the time complexity of this step is O(1).
Step V that updates routing finds the routes between the modified gate locations and their neighbors. Since the routes should be found between gate locations located in the same column or located in the neighboring columns, the time complexity of the route-finding is equal to O(1). As shown in Figure 6 , the scheduling should be updated in each iteration. The dominant part of the update-scheduling runtime is the runtime of updating the priorities. The number of steps for updating the priorities is equal to the level of the lowest gate whose location has been modified because when gate locations are modified, only the priorities of the nodes located higher than the lowest gate in the dataflow graph need to be updated. Therefore, time complexity of the update-scheduling process is O(g). Based on this analysis, the overall time complexity of applying the proposed technique can be found as
Overall time complexity = O(g 2 )

Conclusion
In this paper, we proposed a layout-level optimization technique called gate location changing that modifies the layout by considering scheduling and layout information to improve the latency of quantum circuit execution. In the proposed technique, layout and scheduling information is used to find better gate locations for gates located on the critical path decreasing the overall latency. Experimental results show that the proposed technique improves the latency of quantum circuits by up to 26.1% for the attempted benchmarks.
We followed a greedy approach in applying our optimization technique. To analyze the effectiveness of our greedy approach, we used simulated annealing heuristic in accepting or rejecting a change. The results show that although SA leads to slightly better results in most cases, its runtime, on average, is very longer.
We evaluated the proposed technique in term of reliability by a metric called maximum error distance. As the results show, our layout optimization approach reduces maximum error distance.
