Proposing an architecture that efficiently compensates for the inefficiencies of physical hardware with extra resources is one of the key issues in quantum computer design. Although the demonstration of quantum systems has been limited to some dozen qubits, scaling the current small-sized lab quantum systems to largescale quantum systems that are capable of solving meaningful practical problems can be the main goal of much research. Focusing on this issue, in this article a scalable architecture for quantum information processors, called SAQIP, is proposed. Moreover, a flow is presented to map and schedule a quantum circuit on this architecture. Experimental results show that the proposed architecture and design flow decrease the average latency and the average area of quantum circuits by about 81% and 11%, respectively, for the attempted benchmarks. 
INTRODUCTION
Quantum computing is a rapidly evolving research area. Its huge computational power [1] [2] [3] [4] compared to classical computing is the main motivation of researchers. The architecture of a quantum computer plays a key role in its performance. The main task of a quantum architecture is to determine the organization of processing units and how they interconnect and communicate to manipulate data [5, 6] . Much of the quantum research until now has been concentrated on the two extremes: quantum algorithms and complexity theory at the top, and quantum physics at the bottom. Quantum information processing has moved up to the point where architecture-level solutions can be beneficial to close this gap [7] . In spite of their importance, they have not received much attention, leaving much area for high-impact research. Although several candidate 12:2 S. Sargaran and N. Mohammadzadeh technologies have been presented for the realization of a quantum computer to date [8, 9] , developing a scalable computer architecture for each one is still a challenging problem. The fundamental criteria [10] that a quantum-computing technology must have, and the challenge of designing quantum systems that are able to perform quantum error correction have motivated some research groups to describe extra categories to address the architectural requirements of a large-scale quantum machine [7, 11] .
Ion trap technology is a very promising candidate for building a quantum computer [1] [2] [3] . It not only sufficiently satisfies the DiVincenzo's requirements but also has several important properties demonstrated experimentally. It provides robust, high-fidelity state preparation [4] and readout [5] , high-fidelity universal gate operations [6] , and long qubit coherence times [4, [7] [8] [9] . Kielpinski et al. [10] in a high-impact paper presented an architecture for an ion-trap quantum processor, called QCCD, 1 which uses separate memory and interaction regions. The scalability of this architecture faces a wide range of technological challenges [11] . For example, it has been noticed that the individual addressing of ions becomes prohibitively difficult due to the geometry constraints of the planar traps containing in excess of 100 electrodes. Hence, processing a very large number of ions in these chips seems practically difficult. This problem can be managed if reliable means of connecting these restricted size chips can be found. Monroe et al. [12, 13] presented a modular distributed quantum-computer architecture, called MUSIQC, in which Elementary Logic Units (ELU), each composed of a string of 50-100 trapped ions, are connected through an optical switch. The hybrid structure of MUSQIC benefits from the strengths and counters the weaknesses of both ionic and photonic technologies. This new architecture also has some challenges. For example, Monz et al. [14] showed that the coherence of an N-qubit state decays by a factor N 2 faster than the coherence of a single qubit. However, the authors of Reference [12] showed that with loading of up to ∼10 qubits an acceptable fidelity is reachable. However, decreasing the number of qubits in each ELU means that we have a lot of ELU that is challenging because of the complexity of the resultant optical switch. Ahsan et al. [15, 16] used the MUSIQC hardware but modified each ELU to have the QLA architecture [17] instead of being a string of ions. This architecture also faces some technological difficulties. First, since each ELU in this architecture is an array of gate locations including at most two qubits, the resulted layout has a large number of electrodes, therefore, a large area. Second, laser beams should be distributed to a large number of gate locations over the entire layout. Hence, building such a large-area chip seems practically hard. Focusing on these issues, our architecture, SAQIP, mitigates these scalability concerns by using technology capabilities and creating customized ELUs. Our architecture is consisted of a large number of full-custom blocks connected by a reconfigurable optical switch network. The overview of the SAQIP architecture is shown in Figure 1 . The reconfigurable optical switch network can also be organized in a tree-like structure such that the height of the tree scales only logarithmically with the number of blocks. The blocks are designed in the full-custom style based on the multiplexed ion trap structure [6] . In other words, in our architecture, the structure of each processing unit is fully customized for a partition. Each block is consisted of some interaction zones and each zone confines a few ions. Quantum data is transferred between zones of a block by ballistically shuttling ions and between blocks by photons via the optical switch network.
The procedure of mapping and scheduling a given quantum circuit on this architecture is a significant job, but existing quantum computer-aided design flows are not applicable to this work because of its different structure. Therefore, we propose a design flow to map and schedule a quantum circuit on this architecture. Figure 2 shows our computer-aided design (CAD) flow that has four main parts: partitioning, placement, channel routing, and scheduling. The partitioning step divides qubits among blocks such that the number of connections between blocks is minimized. The placement and routing steps generate a physical layout for each block. Finally, the scheduling step determines the execution order of gates on the generated layout. We automate the tasks involved in the flow. Although at first glance, some parts of this flow is like a classical CAD flow, the different nature of quantum domain necessitates new algorithms for tasks. Full details of the flow including the algorithms are mentioned in Section 3.
The rest of this article is organized as follows: An overview of the prior work is presented in Section 2, followed by the details of the proposed architecture in Section 3. Section 3 also includes the proposed design flow to map and schedule a quantum circuit on this architecture. To illustrate the proposed architecture and design flow, an example is presented in Section 4. Section 5 shows the experimental results, and Section 6 concludes the article.
RELATED WORK
The quantum architecture determines the structure of processing and memory units and how they interconnect and communicate to manipulate data [18] .
The nearest-neighbor architecture that has attractive properties from a hardware design point of view has been investigated in some research works [19] [20] [21] [22] [23] [24] . The recent achievements in application of the topological quantum error-correcting codes and their good properties such as needing only nearest-neighbor operations have inspired some researchers to focus on architectures that are capable of generating multidimensional cluster states [25] [26] [27] [28] [29] [30] [31] [32] .
QuMA implements the quantum microinstruction set QuMIS [33] . QuMIS does not offer feedback control, and is tightly bound to the hardware implementation. Moreover, as the number of qubits grows, QuMA cannot fetch and execute instructions fast enough to apply all operations on qubits on time. Therefore, Fu et al. [34] proposed eQASM 2 that mitigates the quantum operation issue rate problem by efficient timing specification, single-operation-multiple-qubit execution, and a very-long-instruction-word architecture. Some methods [35, 36] have been proposed to map a quantum circuit onto the IBM QX Architectures [37] . All of these studies are based on superconducting technology. Linke et al. [2] compared the ion-trap system proposed by Monroe et al. [13] and the IBM Quantum Experience superconducting system experimentally. The results showed that the ion-trap system outperforms the superconducting system on all results [2] .
Since our architecture is based on the ion-trap technology, in the remainder of this section, we review the studies whose underlying technology is the trapped ion one. We partition the studies into two parts. In all works mentioned in the first part, qubits are transported ballistically within regions and teleported between regions. In the second part, we describe most recent and related works based on photonic interconnects. At the end, we explain the main differences between the most recent works and ours.
The movement of data in a quantum computer has been studied in some research works. Kielpinski et al. [10] presented an architecture for a scalable ion-trap quantum processor, called QCCD 1 , which uses separate memory and interaction regions. Some homogeneous systems have been developed around this basic idea. Mariantoni et al. [38] proposed the quantum von Neumann architecture including dedicated hardware for memories, zeroing registers, and a quantum bus. Metodi et al. [17] used local gates and quantum teleportation to transport quantum data across their ion-trap QLA architecture. Automated layout generation and optimization of quantum circuits for ion-trap architectures were explored in References [39, 40] . Isailovic et al. [41, 42] studied interconnection networks and communication channel bandwidth for such architectures with emphasis on the ancilla preparation for teleportation-based communication [43] . Svore et al.'s tool [44, 45] maps a technology-independent QASM netlist 3 to a technology-dependent fault-tolerant netlist during its design flow. The authors automatically generate an H-Tree-based layout built from a single tile and schedule a circuit on it using the ASAP 4 method.
Four main microarchitectures have been proposed for quantum computing in ion trap technology. They are QLA [17, 46, 47] , CQLA 5 [48] , Qalypso [40] , and Requp 6 [49] , and can be thought as a 2 Executable quantum instruction set architecture. 3 A netlist consists of a list of connections in a circuit and a list of the gates they are connected to. 4 As Soon As Possible. 5 Compressed quantum logic array. 6 Reconfigurable quantum processor.
range from inflexible to flexible architecture. Arrangement of compute regions, ancilla generation regions, memory regions for idle qubits, and teleportation network resources are different in these microarchitecture. In the QLA microarchitecture, all units are identical like an FPGA. Each unit can perform a two-qubit gate. Each such compute region also includes ancilla generation resources, enough room for two encoded qubits, and a teleportation router for communication. CQLA enhanced QLA by adding a new type of data region to it. Therefore, compute regions are identical to those in QLA and memory regions store eight qubits. Qalypso [41] is an enhanced version of CQLA with more flexibility in assigning of ancilla generation resources. It utilizes optimized and pipelined ancilla generators to create ancilla for use in processing units. Dousti et al. proposed Requp [49] , a multi-core reconfigurable quantum processor architecture. The quantum reconfigurable compute region (QRCR) proposed in this architecture distributes the ancilla qubits in the processing unit based on the issued instructions. This feature prevents from wasting the resources because of overestimating the ancilla qubits needed.
Heckey et al. [50] proposed an ion trap multi-SIMD architecture composed of k SIMD operating regions each with a width or qubit capacity of d to execute many quantum applications written in Scaffold programming language. Its architecture model is so similar to QLA and has the same disadvantages. This work and the next studies done on it [51] [52] [53] [54] have mainly focused on compiling and scheduling a quantum program on this architecture.
Maslov [55] presented an approach to compile quantum algorithms into physical-level operations. Goudarzi et al. [56] formulated the quantum instruction placement problem and used a net-weighting timing-driven placement solution based on a modified version of the force-directed placement tool, SimPL [57] to solve it.
Lekitsch et al. [1] proposed a trapped ion-based quantum computer module based on longwavelength radiation quantum operations. They use the microwave-based multi-qubit operations instead of the laser-based ones, and laser beams are only applied for state preparation and detection, photoionization, and sympathetic cooling. In this architecture, modules are aligned next to each other in a two-dimensional array, and ions are ballistically transported from one module to the adjacent module. This architecture encounters some technical issues such as the creation of strong magnetic field gradients and the requirement of calibration operations and well-controlled voltages, which are required to perform quantum operations. The other challenging part of this scheme is to strictly align all modules to each other to avoid large barriers or interruptions of the overlapping electric fields from occurring. Communication between distant ions is another challenging issue in this architecture. Since it features nearest-neighbor ion-ion interactions, they can only take place between nearest-neighbor modules. Therefore, if two distant qubits (ions) in two non-adjacent modules want to communicate with each other, then one of them should be ballistically transported the entire path. Ballistic transport of an ion-qubit decreases the fidelity of its state. Therefore, the distance that an ion qubit may be transported ballistically before quantum error correction must be applied, is limited. There is a general agreement that large quantum systems need different forms of communication for longer distances [58, 59] .
Coupling of far qubits and communication between distant sites is a challenging issue in the designing of a quantum processor architecture. Cirac et al. [60] suggested using photonic interconnects for communication between distant atoms. Using the photons in the cavity to act as a quantum channel to mediate interactions between non-local qubits has been explored in other works [61] [62] [63] [64] [65] . These photonic channels have been utilized to propose a modular, scalable distributed quantum architecture [66] .
Distributed quantum computing has recently drawn attention of some groups of researchers. Devitt et al. integrated the topological quantum model with the photonic module to build a photonic topological quantum network over different areas [67] . Van Meter et al. investigated the performance of arithmetic algorithms on a quantum multicomputer system [68] . One of the main foundations of their paper is that large-scale quantum systems need heterogeneous interconnections. Van Meter et al. extended this idea with a design in that electron-spin quantum dots were coupled through nano-photonic waveguides [69] . The different nodes of this quantum multicomputer communicate via a multi-level interconnect.
Although the QCCD architecture [10] proposed by Kielpinski et al. seems to be good for small ion-trap quantum processors with about 100-1000 qubits, scaling it to a million-qubit processor is difficult because of the challenges with interconnects, diffraction of optical beams, and the complexity of qubit control [12] . To scale beyond the QCCD architecture, Monroe et al. proposed a modular distributed quantum-computer architecture, called MUSIQC, in which Elementary Logic Units (ELU), each composed of a string of 50-100 trapped ions, are connected through an optical switch ( Figure 3 (a)) [12, 13] . Ahsan et al. used the MUSIQC architecture but modified each ELU to have the QLA architecture [17] instead of being a string of ions used in the original MUSIQC proposal (Figure 3(b) ) [15, 16] . However, the area and delay of this architecture framework can be improved by customizing each ELU based on a sub-circuit. In other words, Ahsan et al.'s proposal is not optimized in terms of area and delay, because all units have an identical structure like an FPGA. Focusing on this issue, this article modifies Ahsan's proposal to have an architecture with full-custom blocks and photonic interconnects (Figure 1 ). Moreover, this article proposes a flow to map a quantum circuit on this architecture. Since building blocks of this architecture are fully customized, quantum circuits mapped onto this architecture potentially have less delay and less area than those mapped onto Ahsan et al.'s architecture.
THE SAQIP ARCHITECTURE AND THE MAPPING PROCEDURE
In the original MUSIQC architecture proposed by Monroe et al. [13] , the basic logic unit is composed of a string of 50-100 trapped atomic ions. Ahsan et al. [15, 16] modified the original proposal and changed the ELU architecture from the string of ions to the QLA architecture. This architecture makes some assumptions that can be relaxed based on the recent advancements in ion-trap technology [12, [70] [71] [72] [73] [74] [75] to reach a better architecture. The current state of ion-trap technology provides the requirements to present a suitable quantum architecture. In the following four paragraphs, the new technology features used by our architecture are mentioned.
A circuit mapped onto the QLA occupies more area and takes longer than it mapped onto a full-custom fabric. In other words, if each ELU is customized based on the circuit to be mapped on it, then the latency and area may be dramatically improved.
The second assumption of the QLA is that the maximum number of qubits placed in each gate location is two. This assumption increases communication in each ELU and has high cost in time and fidelity of a circuit [76] . Moreover, it results in a large number of gate locations in each ELU. In ion-trap technology, switching the laser from one trap to the next over the entire layout at a random order has a significant overhead. Therefore, the architecture must minimize the gate locations in the layout where lasers are applied. The disadvantages caused by the second assumption may imply that if the number of qubits in each gate location increases, then the error and the latency are improved. However, trapping a very large number of qubits in a single gate location is impractical, because lengthening the ion chain progressively increases the difficulty of identifying the vibration modes [77] and therefore, decreases the gate fidelities. Moreover, the heating rate and hence, the dephasing rate increases.
Experiments show that placing up to ∼10 qubits in each gate location gives a good fidelity for gates [12] . Putting more than two qubits in each location has another significant effect. The gates that have the same control qubits and are ready to be scheduled can be performed at the same time. This capability may reduce the circuit latency dramatically. Following this line of reasoning, the number of qubits in each gate location can increase up to 10.
The third assumption of QLA restricts quantum operations to only one-and two-qubit gates. The gates with more than two qubits such as a Toffoli gate have been already demonstrated with high fidelity in the ion-trap technology [70, 73, 74, 78, 79] . Using gates with more than two-qubits may decrease the latency and error probability.
The aforementioned matters motivated us to propose a Scalable Architecture for Quantum Information Processors (SAQIP) that uses the full capabilities of the technology. This architecture is composed of full-custom blocks interconnected by photonic links. 
Insert routing channels • Schedule the dataflow graph Figure 4 shows a big picture of our mapping flow. Its pseudo code is given in Algorithm 1. Its input and output are a fault-tolerant logic circuit netlist and the mapping of the circuit on the architecture, respectively. We use the same quantum error correcting (QEC) code (Steane code [80] ) as Ahsan's proposal to make the netlist fault-tolerant. In the rest of this section, we explain the technical details of the procedure, justification of the usage of Steane code, and time complexity of the flow. 
Dataflow Graph and Interaction Hypergraph Generation
In the beginning, the dataflow graph of the circuit G D (line 1, Algorithm 1) and the interaction hypergraph 7 of the qubits G CI are generated (line 2-Algorithm 1). In the dataflow graph, each node represents a gate and each edg represents a qubit dependency. In the interaction hypergraph of qubits, each node denotes a qubit and each edge (hyperedge) denotes the gate that is performed between the qubits connected by the edge (hyperedge). Each edge (hyperedge) in the interaction graph is labeled with the level number of the gate in the netlist. Since qubits interact in different parts of the circuit, the numbers on the edges (hyperedges) between two (three) qubits may be distant from others. Therefore, it seems that if the edges (hyperedges) with outlier values are removed from the graph, the next clustering step could be more efficient. In this article, the IBM SPSS analytics toolkit [81] is used to detect and remove edges (hyperedge) with outlier values (line 3, Algorithm 1).
The K-way Partitioning
The modified interaction graph G I is partitioned into k subgraphs where k is the number of ELUs (line 4, Algorithm 1). The multilevel k-way hypergraph partitioning algorithm [82] is used to partition the hypergraph into subgraphs where each subgraph is to be mapped onto an ELU. This algorithm generates high-quality parts while enforcing tight balancing constraints. It reduces the size of the hypergraph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). The various steps of the multilevel k-way partitioning algorithm are depicted in Figure 5 .
Qubit Arrangement Problem and Solution
After the hypergraph is partitioned into the subgraphs, a full-custom physical layout is generated for each subgraph (ELU) (lines 5-9, Algorithm 1). Since each region should have up to 10 qubits, qubits of each ELU are segmented into s strings. Before partitioning qubits into strings, a linear arrangement is found for all qubits assigned to an ELU to keep a global view in the partitioning step (line 6, Algorithm 1). The algorithm used for finding linear arrangement for qubits of an ELU is mentioned in Algorithm 2.
The qubit arrangement problem takes the interaction graph G i (Q i , E i ) to put each qubit q k in place x k on a straight line such that the metric i, j w i j |x i − x j | is minimized. The places are labeled 1, 2, 3, . . . , n. The w i j is the number of edges between vertex q i and q j in G i . This problem is known as an optimal linear arrangement problem [83, 84] . The time complexity of the qubit arrangement problem is NP-hard. In other words, there is no known polynomial-time algorithm, which finds the optimal order. Therefore, the approximate polynomial time algorithm proposed in References [83, 84] is used to obtain an approximately optimal arrangement.
The algorithm used for solving the qubit arrangement problem is provided in Algorithm 2. It generates the adjacency matrix A(G i ) representation of the interaction graph G i (Q i , E i ) (line I). Then, the corresponding Laplacian matrix L(G i ) of A(G i ) is calculated, which is positive semidefinitive (line II). When L(G i ) is diagonalized, its spectrum has non-negative eigenvalues (line III). In the next step, the second eigenvector V 2 corresponding to the second smallest eigenvalue λ 2 is computed (line IV). The components of V 2 is indexed using qubit identifier q i for i = 1, 2, 3, . . . , n, where n is the number of qubits assigned to the ELU (line V). In other words, V 2 [1] is assigned to qubit 1, V 2 [2] is assigned to qubit 2 and so on. At the end, the position of qubit q i is the index of V 2 [i] in sorted V 2 (V 2 ) (lines VI and VII). 
Cut string s from the minimum-cut point into two strings
If the number of qubits in S p1 > 10 then 3.
Partition (S p1 ); 4.
If the number of qubits in S p2 > 10 then 5.
Partition (S p2 );
Partitioning of the Sorted String of Qubits
When the sorted long string is obtained, it is segmented into strings with 3-10 qubits by Algorithm 3 (line 7, Algorithm 1). A hierarchical minimum-cut approach 8 is followed to segment the string obtained from the previous step into s strings. Algorithm 3 recursively partitions the longsorted string S into shorter strings of 3-10 qubits. In each function call, the function cuts the input string from the minimum-cut point into two shorter strings. If one of the generated strings is longer than 10 qubits, then the function is recursively called with it. If the interaction graph is highly connected, then the partitioning may perform not very well. However, it should be noted that removing the edges with outlier values from the original interaction hypergraph in the previous step improves the result of this step.
Placement and Routing
As the partitioning step divided the string into shorter strings, a partition-aware placement is performed (line 8, Algorithm 1). The algorithm used for placement and routing (lines 8 and 9 of Algorithm 1) is mentioned in Algorithm 4. A position should be determined for each of strings generated in the previous step. The cluster growth placement idea [85] from the classical CAD literature is used to place strings in this article. It is a constructive placement algorithm using a bottom-up approach. The algorithm selects unplaced strings sequentially and places them in the layout. In this algorithm, the string that is most highly connected to the already placed ones is selected to be placed. Then, an exhaustive search is carried out to find the best possible location for the string. The outline of the cluster growth algorithm is shown in Algorithm 4. At this point, the positions of strings are determined, and channels are added to the layout for communication between strings. Ions (qubits) can be ballistically transported along these channels. Although the approach proposed in this article tries to minimize the communication between Let S be the set of strings to be placed; 2.
Select the string S j from S with the most connectivity; 3.
Place S j in the layout; 4. S = S − S j ; 5.
While (S ∅) do 6.
Select the string S j that is highly connected to the already placed strings; 7.
Place S j in the layout; 8.
strings, when the number of strings increases the number of routes for ion transportation increases and may make the routing a challenge to design. In this article, a modified version of a maze routing algorithm is used [86] to choose a path for ion transportation.
Schedule the Netlist
When the layout of all ELUs are generated, the timing information is extracted and the dataflow graph of the circuit is updated with the timing information. Then, the dataflow graph including the timing information is scheduled (Line 10, Algorithm 1). The scheduler implements a greedy scheduling scheme. It maintains a list of operations that have all their dependencies fulfilled and therefore, are ready to be executed. Among the ready instructions, the instruction with the highest priority will be run and is more likely to gain access to the resources it needs. These contested resources include both gates and channels/intersections. Once all the possible instructions are scheduled, time advances until one or more resources are freed and more instructions can be scheduled. This scheduling process continues until the full instruction sequence is executed.
The Choice of Quantum Error Correcting Code
We have two reasons to choose the Steane code. The first reason is to be fair in comparing our architecture with the Ahsan's work, because they use the Steane code as the quantum error correcting code. The second reason is the good fault-tolerance properties of this code. These properties can be summarized as follows:
a. Well-known encoding and error correction procedures: Encoding and error correction on the encoded qubit block are well-defined processes and can be easily confirmed [87] . b. Transversal implementation of Clifford gates: The Clifford gates can be executed in a bitwise manner, which ensures that the error in one qubit doesn't propagate to multiple qubits in the encoded qubit block. c. Distillation-free demonstration of fault-tolerant non-Clifford gates: The implementation method for the non-Clifford gates such as Toffoli and T gate is slightly complicated but can be broken down into two steps: magic-state preparation and data injection into the magic state.
d. Scheduling on hardwares with different types of constraints:
The fault-tolerant procedures of the Steane code can be efficiently scheduled on a hardware with different qubit connectivity constraints specified by the quantum device physical environment. In contrast to the widely famous topological codes [88] , which are tailored to the hardware that supports 2D nearest-neighbor interactions, the Steane code can be mapped onto a range of hardware architectures such as a linear chain of qubits, 2D nearest neighbor hardware [89] and the one which supports flexible physical shuttling mechanism for the qubits for the implementation non-distance fault tolerant quantum computation [12, 13] . Historically, the Steane code has been considered as a premier choice for protecting trapped ion qubits, which are considered as one of the strongest candidates for the realistically achievable large-scale quantum system [12] . Even though the topological codes generally achieve a higher noise threshold to provide greater noise protection against noise as compared to the Steane code, when it comes to the overall resource overhead, reliability and the time to execute large-size quantum algorithms on the trapped ion computer, the Steane code edges ahead of the topological code [90] . e. Supporting the variety of noise models: The Steane code is one of the codes that has been studied in the context of noise models other than depolarization channel. It is also known that the Steane code can be used to correct errors, which are correlated in time [91] . In a quantum hardware where qubits suffer from errors specified by multiple types of noise processes, the Steane code acts as a powerful tool of protection against noise.
It is worthy to note that other concatenated codes can be supported with slight modification in the basic structure of our toolset.
Time Complexity Analysis
The time complexity of our mapping flow can be calculated as follows. Since the input netlist should be parsed to generate the dataflow graph and the interaction graph, the time complexities of Steps 1 and 2 are O (n д ) where n g is the number of gates. All edges of the interaction graph should be examined in Step 3; therefore, this step needs O (n д ) examinations. The k-way multilevel partitioning algorithm (Step 4) computes a k-way partitioning of the interaction graph of qubits in O (n e ) = O (n д ) time [50] where n e is the number of edges of the interaction graph. The time complexity of Algorithm 2 (Line 6) is O ((
2 ), where n q and k are the total number of qubits and the number of ELUs, respectively [51] . Algorithm 3 partitions a sorted string of qubits in O (log(
, where s is the number of strings, because first the strings should be sorted based on the connectivity metric and then strings are selected one by one.
Step 9 can be performed in
2 ) time. Finally, the time complexity of list-based scheduling algorithm is O (n д log(n д )) [92] . Therefore, the overall complexity of the flow can be calculated as
The time complexity scales log-linearly with the number of gates, but quadratically with the number of qubits in an application. The scheduling step contributes n д log(n д ) and the routing and finding qubit linear arrangement steps contribute n 2 q to the time complexity.
AN EXAMPLE
In this section, an example is given to illustrate the SAQIP architecture and the mapping flow. Figure 6 (a) shows the QASM instruction sequence operating on qubits q0, q1, q2, . . . , q17 for an 8-qubit quantum ripple carry adder [93] . This netlist has 52 gates and 18 qubits. The last qubit of each gate is the target qubit. Figure 6 (b) shows the interaction hypergraph of the qubits. The output of the k-way partitioning step is presented in Figure 6 (c). In this example, the qubits have been partitioned into two parts.
To save the paper space, only the qubits assigned to ELU1 are placed and routed. Figure 7 shows the steps taken to place and route the qubits of ELU 1. According to the interaction hypergraph, some gates are performed between qubit q3 from ELU 2 and qubits q12 and q4 from ELU1. Therefore, q3 should be transferred from ELU2 to ELU1 when needed. Thus, in addition to the qubits assigned to ELU1, a dummy qubit, called Dummy_q3, is also considered in placement and routing. Figure 7(a) is the Laplacian matrix calculated form the adjacency matrix. The eigenvalues are mentioned in Figure 7 (b). The eigenvector V 2 corresponding to the second smallest eigenvalue (λ 2 = 1.00013098006427) is computed as shown in Figure 7 (c). The qubits are sorted and partitioned based on this vector (Figure 7(d) ). The vertical dotted lines shown in this figure separate the qubits into three strings. Then, the strings are placed and routed as presented in Figure 7 (e) by Algorithm 4. The dataflow graph annotated with delays and the critical path is extracted. Each edge is labeled by the time a qubit requires to move from one gate location to the next gate location plus the delay of the gate located at the head of that edge. Latency of a circuit is the total time that it takes for a circuit to be executed on a particular layout. After a circuit is scheduled on a layout, its total time (Latency) can be calculated. We use the same propositions as SQripT [16, 94] to calculate the execution time. The latency of a circuit can be written as sum of delays on the critical path. Its first component is the sum of delays due to the transportation of qubits of gates located on the critical path, through ballistic shuttling channels inside ELUs. Its second component is the sum of delays due to the logical EPR pair generations for communication, which are on the critical path. Its third component is the sum of delays due to swap operations that are on the critical path and are required in each gate location to access to a qubit. Finally, its fourth component is the sum of delays of gates located on the critical path. The latency of this circuit calculated from the scheduled dataflow graph and based on the values mentioned in Table 1 is 67,327μs.
THE SAQIP PERFORMANCE
To quantify the performance of the SAQIP architecture, a CAD tool, called SAQIPSim, has been developed that (1) places and routes a quantum circuit onto the SAQIP architecture, (2) schedules the sequence of quantum operations, and (3) reports performance metrics such as total execution time. The device parameters mentioned in Table 1 are used in the SAQIPSim tool. Although the assumed values for the parameters are optimistic, those can be achieved in the near future through rapid technology advancement [95, 96] .
Benchmark Circuits
To evaluate the proposed architecture and compare it with the best architecture in the literature [16] , quantum adders with different size, two QFT 9 circuits, and two randomly generated quantum circuit are used. All results of this section are obtained on a 2.6GHz Intel Core2 Duo with 4 gigabytes of memory. The first step of Shor's quantum factorization algorithm is called quantum modular exponentiation [97, 98] , which can be constructed from quantum adder circuits [99] . In this article, two candidate adders QRCA 10 and QCLA 11 that use two different addition strategies are chosen to evaluate SAQIPSim. QRCA has a linear-depth structure. An n-bit QRCA uses about 2n qubits to perform 2n Toffoli and 5n CNOT gates [93] . However, QCLA has a logarithmic-depth structure using 4n qubits to perform up to n concurrently executable gates [100] . This circuit is composed of about 5n-3log 2 n CNOT and Toffoli gates for n-bit addition. The benchmarks are shown in Table 2 . The number of qubits and the number of gates are mentioned in the table. Table 3 shows the latency of the benchmark circuits achieved by the SAQIPSim compared with the best tool in the literature, SQripT toolbox 12 [16] . The column "SAQIPSim (SQripT Config.)" under "Latency" shows the latency of circuits obtained by our toolbox with the configuration mentioned as the optimal architecture configuration in the SQripT toolbox. The column "SAQIPSim 9 Quantum Fourier Transform. 10 Quantum Ripple Carry Adder. 11 Quantum Carry Look-ahead Adder. 12 The source code of the toolbox is available on https://users.cs.duke.edu/∼ahsan/SQrIpT/ToolBox/. (Optimal Config.)" under "Latency" contains the best latency of circuits achieved by our toolbox.
Simulation Results
The column "SQripT" shows the latency of circuits obtained by the best prior toolbox SQripT. The "SQripT Config" and "Optimal Config." under "Improvement" include the improvement resulted from our architecture over SQripT with the best SQripT configuration and our best configuration, respectively. The results show that the SAQIP architecture and our mapping flow can dramatically decrease the circuit latency over the best previous architecture and toolbox. It seems that the QCLA circuits benefit greatly from the SAQIP architecture. The experiments show that using the k-way method for partitioning qubits into ELU and the eigenvector approach for segmenting qubits of each ELU into strings results in a significant decrease in the latency of executing of a circuit.
It is worthy to note that the structure of a circuit has a great impact on the effectiveness of a mapping flow. The QFT applications have more regular structures than the other benchmark circuits. These circuits can be partitioned into recognizable parts. In part p, qubit q p is the target qubit of gates and the control qubits of those are q p+1 , q p+2 , . . . , and q n , respectively. This regular structure causes that Ahsan's approach behaves for the QFT applications more efficient than for the other benchmark circuits. Therefore, the improvement percentages for these circuits on average are less than those for the other benchmark circuits.
Quantum data is transferred inside ELUs by ballistically shuttling ions and between ELUs by photons via the optical switch network. Our mapping algorithm focuses on this fact that minimizing the communication between ELUs and inside ELUs have great effects on the circuit latency. Since communication between ELUs has a significant delay in comparison to the delay of other physical operations, the main objective of the algorithm that partitions qubits into ELUs (multilevel k-way hypergraph partitioning algorithm [82] ) is minimizing the interaction between ELUs. However, since qubits moves ballistically to communicate with other qubits inside ELUs, minimizing the distance that each qubit traverses, has a direct effect on the circuit latency. Therefore, the proposed partitioning algorithm (Algorithm 3) and placement one (Algorithm 4) (that partitions and places the qubits assigned to an ELU) keep most interacting qubits close together. These algorithms play great roles in our improvement. Table 4 shows the runtime of the mapping program for the benchmark circuits in detail. The main reason of this great improvement is that Ahsan's toolset solves the linear arrangement problem to partition all qubits into ELUs, while we use the k-way partitioning technique for partitioning. Our approach has smaller runtime than Ahsan's approach. It is worthy to note that in each group of benchmarks (QCLA and QRCA), the improvement increases when the number of qubits increases. The results show that our mapping flow is good not only at the latency but also at the runtime. Table 5 reports the area consumption for benchmark circuits obtained by our toolset and Ahsan's one. The number of macroblocks consumed for each benchmark circuit is reported in this table. Unlike the Ahsan's architecture that uses the similar QLA-like ELUs, ours maps sub-circuits onto full-custom fabrics. Therefore, circuits mapped by our flow theoretically consume less area than those mapped by the Ahsan's flow. As results shows, our approach decreases area (number of macroblocks) by about 11.62% (on average) for the attempted benchmark circuits.
CONCLUSIONS
In this article, an architecture called SAQIP was presented based on the MUSIQC hardware to build large quantum computers using new technology features that are available today. It benefits from the capabilities recently provided by advancements in ion-trap technology to generate layouts with better latency and area. It is composed of full-custom blocks connected by the photonic network. It gives better area and delay for the quantum circuits in comparison to Ahsan's SQripT architecture. We use area and latency, because, to a first approximation, lower area and lower latency are likely to decrease decoherence and error probability. Moreover, a design flow was proposed to map a quantum circuit onto the proposed architecture. The experimental results show that the proposed architecture and flow can improve the latency and the area by up to 99% and 27%, respectively, for the attempted circuits.
