More computational resources (i.e., more physical qubits and qubit connections) on a superconducting quantum processor not only improve the performance but also result in more complex chip architecture with lower yield rate. Optimizing both of them simultaneously is a dicult problem due to their intrinsic trade-o. Inspired by the application-specic design principle, this paper proposes an automatic design ow to generate simplied superconducting quantum processor architecture with negligible performance loss for different quantum programs. Our architecture-design-oriented proling method identies program components and patterns critical to both the performance and the yield rate. A follow-up hardware design ow decomposes the complicated design procedure into three subroutines, each of which focuses on dierent hardware components and cooperates with corresponding proling results and physical constraints. Experimental results show that our design methodology could outperform IBM's general-purpose design schemes with better Pareto-optimal results.
Introduction
As a promising computation paradigm, Quantum Computing (QC) has been rapidly growing in the last two decades and found its strong potential in many important areas, including machine learning [14, 20] , chemistry simulation [32, 39] , etc. In particular, the superconducting quantum circuit [13] has become one of the most promising technique candidates for building QC systems [5, 9, 37] due to the ever-increasing qubit coherence time, individual qubit addressability, fabrication technology scalability, etc. Towards ecient superconducting quantum circuit based QC systems, signicant research has recently been conducted, ranging from compiler optimization [34, 47] to periphery control hardware support [16, 50] and device innovation [27, 33] .
Despite these system optimizations, the performance of a superconducting quantum processor is still highly limited by the amount of computation resource on it. Researchers have been trying to integrate more qubits and qubit connections on one superconducting quantum processor substrate. For example, IBM's rst superconducting quantum chip on the cloud has 5 qubits with 6 qubit connections, while its latest published chip has 20 qubits with 37 qubit connections [10] . Increasing the number of physical qubits on a superconducting quantum processor allows programs with more logical qubits to be executed. Denser qubit connections can increase the overall chip performance by reducing the overhead of qubit mapping and routing [28, 35, 48, 56] .
Nevertheless, more qubits and qubit connections will, unfortunately, increase the probability of defect occurrence on a chip, leading to lower yield rate and blocking future development of larger-scale superconducting quantum processor. For example, the yield rate of a 17-qubit chip can be lower than 1% under IBM's state-of-the-art technology [43] . Such a low yield rate comes from frequency collision, a unique defect on superconducting quantum processors [6, 30] . The frequencies of physically connected qubits may 'collide' with each other when their values satisfy some specic conditions. More qubit connections naturally increase the probability of frequency collision and lower the yield rate.
To optimize both the yield rate and performance would be desirable, but it is dicult in general due to the inherent trade-o between these two objectives. Most previous eorts on them are direct device-level improvement [26, 27, 33, 44] , Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020, Lausanne, Switzerland Profiling Information Figure 1 . Overview of the Proposed Architecture Design Flow while little attention has been given to the architectural design of a superconducting quantum processor. This paperlls the gap by exploring the possibility of ecient applicationspecic architecture design to reach an optimized balance between yield rate and performance. We vision that an array of QC accelerators, each of which is tailored to a specic application, is much more likely to be adopted in the near term where computation resources are still limited before we can reach a universal quantum computer (i.e., one quantum computer that runs all kinds of quantum programs). Our design shares the same high-level spirit with the hardware architecture designs in classical computing (e.g., machine learning [8, 19] , graph processing [1, 18] ), but faces dierent scenarios because both the program patterns and the hardware design space are dierent in QC.
In particular, we highlight two key challenges to be addressed before the application-specic principle can be applied in superconducting quantum processor design. First, we need to identify and abstract the computation pattern of quantum programs that can guide the hardware architecture design. Prior quantum program analysis studies [21, 24, 38, [53] [54] [55] mainly focused on software or compiler optimization and cannot extract appropriate information for hardware architecture optimization. Second, the abstracted computation pattern must give guidance to ecient architectural designs, which employ fewer computation resources with physical constraints satised to achieve both high yield rate and performance. Existing superconducting quantum processor design schemes cannot handle such irregular/complicated application-specic architecture design tasks [7, 12, 29, 43] .
To overcome these two challenges, we design a systematic design ow to automatically generate ecient superconducting quantum processor architecture designs for dierent quantum programs (shown in Figure 1 ). We rst identify two key computation patterns in quantum programs, coupling degree list and coupling strength matrix. A proler is built to automatically extract them from an input quantum program. Both of them are critical to the program performance and hardware yield rate, and thus optimizing their underlying architecture support can potentially achieve a better balance between the performance and yield rate. We then propose an architecture design ow, which comes with three key subroutines, layout design, bus selection, and frequency allocation. Each subroutine focuses on dierent hardware resources and must cooperate with corresponding proling results and physical constraints. We further propose an array of heuristics to ensure the scalability and eectiveness of the architecture search process. Empirical studies show that these heuristics can nd 'near-optimal' solution in the reduced search space.
In summary, this paper makes the following contributions:
• We are the rst to identify the optimization opportunity from the architecture level to push forward the balance between performance and hardware yield rate for superconducting QC processors.
• We formalize an end-to-end design ow, equipped with a set of novel algorithmic primitives, to automatically generate a series of application-specic architectural designs under dierent hardware resource limits.
• Comprehensive experiments show that our design ow could outperform IBM's general-purpose designs with better Pareto-optimal results, e.g., magnitudes of yield improvement with negligible performance loss.
Background
In this section, we will introduce the necessary QC basics for understanding the following program proling and superconducting quantum processor architecture design.
QC Program Basics
A quantum program can be represented in the well adopted quantum circuit model [36] . We will start from the basic components in a quantum circuit and then illustrate how they compose a quantum circuit.
Session 11B: Quantum computing -Who says you can't watch two talks at once?
ASPLOS '20, March 16-20, 2020 , Lausanne, Switzerland Logical Qubit and Quantum Operation A quantum program consists of some logical qubits as variables and some quantum operations which can modify the state of the qubits. Qubit is the basic information processing unit in QC, which has two basis states denoted as |0i and |1i. One qubit can be not only the basis states themselves but also their linear combinations which can be depicted by a vector in the Hilbert space. The state of the qubits can be modied by quantum operations. The rst type of quantum operation is unitary operation, also known as quantum gates in the circuit model, which can implement a unitary transformation on the qubit state. Quantum gates can be applied on single qubit or multiple qubits. The second type is measurement operation, which forces the qubits to collapse to basis states.
Quantum Circuit Quantum circuit is a model of QC in which the computation is a sequence of quantum gates and measurement operations. The state of the qubits is rst initialized and then manipulated by a sequence of operations. Single-qubit gates and measurement operations are applied on individual qubits while two-qubit gates are applied on two logical qubits. It has been proved that any multi-qubit gate can be decomposed into a series of single-qubit gates and CNOT gates (a specic two-qubit gate) [4] . This is also the basic gate set directly supported on IBM's devices. As a result, this paper assumes that the quantum circuit has been decomposed and gates with three or more qubits are not considered.
Superconducting Quantum Circuit Basics
All the qubits and quantum operations in a quantum circuit must be implemented in a real physical QC system to execute the program. In this paper, we focus on superconducting quantum processors with xed-frequency Josephsonjunction-based transmon qubits [27] and all-microwave crossresonance two-qubit gates [41] adopted by IBM [43] .
Physical Qubit and Frequency Figure 2 shows the physical circuit and energy levels of a transmon qubit [27] . Due to the nonlinearity of the Josephson junction, the gaps between the energy levels in this quantum anharmonic oscillator are dierent, which allows us to use the ground state |0i and the rst-excited state |1i as the computation basis without populating other states. Suppose the energy gap between |0i and |1i for a qubit is ⇢ 01 . The frequency of this qubit 5 01 is dened as 5 01 = ⇢ 01 /⌘, where ⌘ is the Planck constant. Similarly, we use 5 12 to represent the energy gap between |1i and |2i. For a typical qubit design with eective operations [30] , 5 01 and 5 12 are about 5⌧ I and 4.66⌧ I, respectively. The anharmonicity of this qubit is dened to be X = 5 12 5 01 , which is 340" I under this typical design [7, 46] .
Qubit Layout The superconducting physical qubits are conned on a 2-dimensional planar substrate. Although the qubit placement can be exible, major vendors fabricate the qubits in a regularized structure to ensure scalability and [23] placed their qubits on the nodes of 2 ⇥ 8 and 4 ⇥ 5 lattices, respectively. Google's 72-qubit chip placed its qubits on some nodes of an 11 ⇥ 12 lattice [25] . Qubit Connection To enable two-qubit gates between two physical qubits, resonators, also known as qubit buses, are employed to connect nearby qubits [41] . For examples, Figure 2 shows two types of commonly used buses. Therst one is a 2-qubit bus connecting two physical qubits. The second one is a 4-qubit bus, which connects four physical qubits in a square together. The coupling graphs of these two types of buses are shown on the right. Compared with a 2-qubit bus, 4-qubit bus support two-qubit gates on not only the four qubit pairs on the edges but also two qubit pairs on the diagonals.
Qubit Mapping It is usually assumed that a two-qubit gate can be applied on arbitrary two logical qubits in a quantum program but some two-qubit gates may not be executable due to the limited qubit connection on a superconducting quantum processor. On the hardware side, this problem can be relieved by employing more physical qubit connections so that two-qubit gates can be directly supported on more qubit pairs. On the software side, a qubit-remapping compiler [31] can resolve the dependency of the remaining unexecutable two-qubit gates while additional operations must be introduced with longer execution time and higher error rate. Therefore, more physical qubit connections can help with the overall performance by allowing native two-qubit gates on more physical qubit pairs.
Fabrication Variation Variation is inevitable when fabricating a superconducting quantum processor. If a qubit is designed to have frequency 5 , the actual frequency after fabrication will be 5 0 = 5 + = 5 , where = 5 satises Gaussian distribution # (0, f). f is the fabrication precision parameter, which is around 130" I ⇠ 150" I under IBM's state-ofthe-art technology [43] . Such noise makes it hard to predict the post-fabrication frequency precisely, which brings the probability of frequency collision.
Frequency Collision When two or three qubits are connected, frequency collision may happen and cause defects on Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020 , Lausanne, Switzerland
Condition 5, 6, 7 Figure 3 . Frequency Collision Conditions [6, 43] the device. Figure 3 summaries seven qubit frequency collision conditions in IBM's devices [6, 43] . On the left is a 
Quantum Program Proling
The rst step towards the development of an architecturespecic quantum processor for both high performance and yield rate is to determine what program information we should focus on. There are several dierent types of components in a quantum circuit but not all of them will signicantly aect the hardware design. Our target program component(s) should satisfy two conditions: 1) the component's execution is a performance bottleneck which can be dramatically improved with optimized hardware support, and 2) the component's required hardware should signicantly aect the yield rate. We found that two-qubit gates can be a key factor to bridge performance and yield. To execute two-qubit gates on a quantum processor with limited qubit-to-qubit coupling, a large number of additional operations are introduced to satisfy their dependencies. But implementing two-qubit gates on two physical qubits require on-chip qubit connections which can lower the yield rate through increasing the probability of frequency collision. Therefore, we give logical qubits and qubit pairs priorities based on the number of involving twoqubit gates to help with the following architecture design. Critical qubits and qubit pairs will have more hardware support to improve the eciency of the generated architectures.
These remaining components, single-qubit gates, initialization, and measurement operations, do not involve qubitto-qubit interactions and all happen locally on individual qubits when they are implemented on hardware. As a result, hardware support for these components will not aect the chip yield through frequency collision.
Proling Method
As discussed above, our proling will focus on the logical qubits and the two-qubit gates. Figure 4 shows an example to illustrate the proling procedure. Suppose we have a quantum circuit as shown in Figure 4 (a). It has 5 logical qubits denoted by @ 0,1,2,3,4 . All of them are initialized to be |0i. Then some single-qubit gates and two-qubit gates are applied. Measurement operations are at the end.
We rst ignore all single-qubit gates, initialization, and measurement operations. Then we create a logical coupling graph, in which each vertex represents one logical qubit in the circuit. Two vertices are connected by an undirected edge if there exists two-qubit gates applied on the two corresponding logical qubits. The weight of an edge is the number of two-qubit gate instances on the two connected vertices. In this example, Figure 4 (b) shows the generated graph for the example circuit. The weight of the edge between vertex @ 0 and vertex @ 4 is 2 since there are two two-qubit gates on @ 0 and @ 4 . For all other edges, the weight is 1 because there is only one two-qubit gate on each of those qubit pairs. The rst proling result is the weighted adjacency matrix of the logical coupling graph, namely the coupling strength matrix. The element with indices (8,9 ) represents the number of two-qubit gates between @ 8 and @ 9 . Figure 4 (c) shows the coupling strength matrix for the example circuit. Note that coupling strength matrix is always a symmetric matrix.
The second result is coupling degree list. For each qubit, we sum the weights of edges that connect to its corresponding vertex and dene the number of two-qubit gates applied on it as the coupling degree of one qubit. If one qubit is associated with more two-qubit gates in a quantum circuit than other qubits, this qubit will use the physical qubit connections more frequently when executing on the chip. Naturally, we should pay more attention to those qubits with larger coupling degree. Therefore, all qubits are placed in a sorted list, namely the coupling degree list. Figure 4 (d) is the coupling degree list in this example. The rst one in this list is @ 4 because it has the largest coupling degree. All qubits are in a descending order.
Gate Pattern Examples
In this section, we show the existence of distinct two-qubit gate patterns and discuss the opportunity for applicationspecic architecture design with two examples. Figure 5 shows their coupling strength matrices. On the left is an 8-qubit UCCSD ansatz for VQE, a quantum simulation algorithm [39] . The high coupling strength qubit pairs form a 
Qubit id CNOT # Figure 4 . Example of the Proling Method chain structure marked by a red rectangle. & 0 and & 1 have a large number of two-qubit gates between them, as well as
For other qubit pairs, the coupling strength is much lower (only about 10%). On the right is a 15-qubit quantum arithmetic function [52] . ent two-qubit gate patterns. These observations suggest that quantum processors can be customized for dierent programs with dierent patterns. An ecient architecture can focus on supporting the highdensity coupling in a quantum program to reduce the number of connections on-chip. For example, a quantum processor with an 8-qubit chain structure (8 qubits and 7 qubit connections) can immediately support most of the two-qubit gates in the 8-qubit UCCSD ansatz program. The rest two-qubit gates can be supported through remapping without introducing too many additional operations because the total number of the remaining two-qubit gates is relatively small. Such application-specic QC accelerators with simplied architectures can be a more realistic goal in the near term than a general-purpose quantum processor with a large number of hardware resources. 
Architecture Design
After a quantum circuit is proled, a straightforward quantum processor architecture for such a circuit is to organize the on-chip qubits and qubit connections directly based on the logical coupling graph. However, we must consider the physical constraints for a practical architecture. For example, a logical coupling graph may not be perfectly fabricated on hardware since the allowed connections among superconducting qubits are very limited. Moreover, we hope to improve the yield rate by delivering architecture designs with fewer hardware resources. Therefore, the proposed hardware design ow must not only invest more hardware resource on frequent operations based on the proling results, but must also obey the physical constraints on the hardware components arrangement.
To accomplish such a complicated task in a scalable way, we decouple the hardware design procedure into three subroutines and each subroutine focuses on dierent architecture components, i.e., qubit layout, connection, and frequency. For each subroutine, we rst review the diculty and the physical constraints considered. Then we discuss the design objectives, and how they are achieved in the proposed design algorithms.
Layout Design
The rst step is to determine where to place the qubits. To ensure scalability and modularity, we follow the convention from major vendors introduced in Section 2 and will only place qubits on the nodes of a 2D lattice. We start from a large 2D lattice, in which each node is initialized to be empty ( Figure 6 (a) ). Then physical qubits can be placed in the empty nodes and one node can contain at most one qubit. There are many ways to place a given number of qubits on a 2D lattice. For example, 16 qubits can constitute a 4 ⇥ 4 lattice, a 2 ⇥ 8 lattice, or other more irregular structures. But we need to select one qubit layout that is most suitable for executing the program, i.e., most operations can be directly supported or indirectly supported with low overhead. The objectives of this qubit layout design subroutine are summarized as follows.
• Since we need to consider the proling information, we create a pseudo mapping between logical qubits in the proled program and the physical qubits in hardware architecture to be delivered. For two logical qubits with a large number of two-qubit gates between them, we hope to place their corresponding physical qubits in adjacent nodes so that later those two-qubit gates can be directly supported by the connection between the two physical qubits.
• One physical qubit can only have a limited number of directly connected qubits. For those two-qubit gates that cannot be directly supported, we hope to reduce the amount of additional operations introduce for remapping the qubits.
We propose a coupling-based qubit placement algorithm to determine the geometric locations of the qubits on a 2D lattice (pseudocode shown in Algorithm 1). We illustrate the algorithm with an example in Figure 6 . First, we put therst qubit in the coupling degree list, @ 4 , on one node of the 2D lattice. Since the initial 2D lattice is empty, the location of @ 4 does not matter. We set the geometric coordinate of the rst qubit to be (0, 0) and then place the rest qubits around @ 4 . @ 4 has four neighbors, @ {0,1,2,3} , in the logical coupling graph. We need to select the next one to place. By checking the coupling degree list, we can see that @ 0 is the one with the largest coupling degree. The node occupied by @ 4 has four equivalent adjacent nodes and we can place @ 0 on any of them. In this example, we select the node on the north of @ 4 with coordinate (0, 1). Such an algorithm design ensures that the strongly coupled qubit pairs are given higher priority and placed on adjacent nodes, accomplishing the rst objective mentioned above.
Then we need to place @ 1 since its coupling degree is larger than that of @ 2 and @ 3 . @ 1 is connected to both @ 4 and @ 0 so that we need a more sophisticated way to evaluate all potential nodes for @ 1 . We use the function in line 13 of Algorithm 1 to nd the node that can make @ 1 close to its strong coupled neighbors in the logical coupling graph. This function is the summation over all @ 1 's placed neighbors. Each term in the summation is the product of the coupling strength between @ 1 and one logical coupling neighbor @ 0 and the Manhattan distance between the evaluated node location and the location of @ 0 . After evaluating all the empty nodes that are adjacent to placed nodes @ 4 and @ 0 , we willnd that the nodes on the east and west of @ 4 are the best ones Find the qubit q with the largest coupling degree in @D18C_20=3830C4_;8BC;
/* Determine the placement location */
12
for location of the nodes that are empty and connected to at least one occupied node do
end /* @ 0 must be placed neighbor qubits */
15
Place q in the location with the minimal score;
16
'.A4<>E4 (q);
17 end because they are closest to @ 4 but not far away from @ 0 . Here we select the one on the west of @ 4 with coordinate ( 1, 0). This summation function can help reduce the number of operations for later remapping and achieve the second design objective. The remaining qubits can be placed in a similar procedure until all the qubits have been placed on the 2D lattice. In this example, @ 2 and @ 3 are placed on the nodes with coordinates (0, 1) and (1, 0), respectively. All the qubits have their locations (coordinates) on a 2D lattice where we can fabricate one physical qubit on each occupied node. Finally, the nodes with no qubits are removed.
Bus Selection
In the second step, we need to connect the placed physical qubits to enable two-qubit gates. The diculty comes from the large size of the design space. For # qubits, there are Select the square with the highest 5 8;C4A43_F486⌘C;
10
Set the weights of squares (i+1, j), (i, j+1), (i-1, j), and (i, j-1) to be 0 and mark them to be blocked;
after considering the nearest-neighbor coupling constraint in which one qubit can only connect with few qubits around it on the lattice, the size of the design space is still $ (4G? (# ) ). More importantly, more qubit connections will improve the performance but lower the yield rate in general so that we need to identify those connections with the most potential performance benet in a very large design space. This paper simplies the connection design problem by considering two types of common buses, 2-qubit bus and 4-qubit bus (shown in Figure 2 ). These two types of buses naturally t in the 2D lattice qubit layout and can be easily fabricated because at most 4 nearby qubits are connected by one bus. After placing the qubits on a 2D lattice in therst step, 2-qubit buses can be directly generated on the edges that connect two occupied nodes but the qubits on a diagonal is not yet clear where to apply the 4-qubit buses can achieve the Pareto-optimal results. The bus selection subroutine was proposed to identify the locations for 4-qubit buses. Other potential bus designs are left as future research directions and will be discussed in Section 6. Instead of considering the nodes in a 2D lattice, we consider the squares that are naturally formed by the edges in the 2D lattice. Each square can be congured to 2-qubit bus or 4-qubit bus. Now the problem is on which squares we should use 4-qubit buses. The size of search space, even for this 4-qubit bus square selection problem, is still $ (4G? (# ) ). But the simplication allows us to design high-quality heuristics to guide the selection. Before introducing our solution, one additional prohibited condition must be considered.
Prohibited Condition One physical constraint that we must consider when applying 4-qubit buses is that we cannot have 4-qubit buses in two adjacent squares. The reason is explained with the example in Figure 7 (a). Suppose we have two adjacent squares and both of them are using 4-qubit buses. Then there will be two physical connections between qubit 8 and 9. When we use one of the connections, the other one will bring unexpected eects so that employing 4-qubit bus in one square will immediately block using 4-qubit buses in any of its adjacent squares.
Considering the physical constraints mentioned above, the objectives of this step are summarized as follows:
• Since adding more qubit connections will increase the probability of frequency collision and lower the yield, we hope to apply 4-qubit buses on those squares that can benet the performance most. In other words, the additional connections are expected to directly support as many two-qubits gates as possible.
• Applying 4-qubit bus in one square will block adjacent squares, making it impossible to directly support some two-qubit gates in those blocked squares. This eect should also be considered when selecting the 4-qubit squares.
We propose a 4-qubit bus selection algorithm to select some squares for 4-qubit buses (pseudocode shown in Algorithm 2). In each iteration, one square that could benet most from a 4-qubit bus will be selected. Users can specify the maximum number of 4-qubit buses they hope to have. By varying the number of selected squares, a series of architectures can be generated with a trade-o between yield and performance.
To nd the most tting square, we rst need to calculate how much one square could benet from a 4-qubit bus. Since the dierence between a 2-qubit bus square and a 4-qubit bus square is whether the qubit pairs on the diagonals are connected, we dene the cross-coupling weight for each Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS'20, March 16-20, 2020, Lausanne, Switzerland square as the sum of the coupling strength of the qubit pairs on the diagonals. For the example in Figure 7 (c), the crosscoupling weight of the green square is the coupling strength of (@ 0 , @ 3 ) plus that of (@ 1 , @ 2 ). A corner case in the coupling weight computation is the square with only 3 qubits (shown in Figure 7 (b) ). In such squares, 4-qubit buses can naturally reduce to 3-qubit buses which support coupling between any two of the three connected qubits. The weight of a 3-qubit square is only the weight of logical coupling between the two qubits on one diagonal since the other diagonal only has one qubit. For example, the weight of the 3-qubit square in Figure 7 (b) is the (8,9 ) element in the coupling strength matrix. Except for this small modication, 3-qubit squares are treated equally as other 4-qubit squares in our bus selection step. This cross coupling weight can estimate the potential benet of applying 4-qubit bus in one square and realize the rst objective.
However, the cross-coupling weight is not accurate enough to evaluate the benet of 4-qubit for a square because the prohibited condition is not yet considered. We design alter to apply this constraint. For each square, the ltered weight is its original cross-coupling weight minus all its neighbors' weights. For example in Figure 7 (c), the ltered weight of the green square is its original weight minus the weights of the four blue squares. This lter can take the prohibited condition into consideration and achieve the second objective.
After applying the lter, we will select one square with the highest ltered weight. Then we will label the selected square and its adjacent neighbors so that it will no longer be available for future 4-qubit buses. We also change their weights to zero because they should not aect the 4-qubit selection among the remaining squares. The algorithm will iterate again to select the next square until there are not more squares available or we have already applied enough number of 4-qubit buses.
Frequency Allocation
After the two steps above, we now have a complete coupling topology design of a superconducting quantum processor. In the third step, we need to designate the pre-fabrication frequency of each qubit. IBM's 5-frequency scheme is a regular frequency designation [43] . However, the generated qubit layout and connection in our design ow can be irregular since more hardware sources are invested in locations that can benet the performance most. Thus, we need a more exible frequency allocation scheme to leverage this unbalanced qubit layout and connection. The objective of this step is to minimize the probability of post-fabrication frequency collision and improve the yield rate. The physical constraints are the frequency collision conditions in Figure 3 .
Finding the qubit frequency allocation plan to maximize the yield rate is a hard problem. The complex collision conditions make it dicult to nd an analytic expression for the yield rate and a brute-force search over all possible frequency Assign the frequency with maximal yield rate to @ 8 ;
9 until the frequencies of all qubits are determined;
congurations will be very time-consuming. For example, if there are " candidate frequencies for each qubit and we have # qubits in total, the total number of possible frequency congurations is " # . For each of these potential congurations, we need to run a yield simulation (introduced in Section 4.3.1) and then select the one with maximal yield rate. This method is not acceptable due to its high complexity. We propose to optimize the qubit frequency allocation algorithm based on the facts that 1) the physical qubits in the geometric center of the qubit lattice are more likely to involve in a frequency collision since they usually have more qubit connections, and 2) frequency collision only happens among nearby qubits. Our algorithm determines the qubit frequencies from the center to the periphery (pseudocode shown in Algorithm 3). Since this step is purely about hardware, the input of our algorithm is only the qubit location and connection generated from the previous two subroutines. To reduce the manufacturing diculty and help prevent the collision condition 4, we follow the convention from IBM and set an allowed frequency interval 5.00⌧ I to 5.34⌧ I. All pre-fabrication frequencies are limited within this interval. First, we locate the qubit that is closest to the center of the qubit lattice and assign its frequency to be the center of the allowed frequency interval. Then we apply breadth-rst traversal on the coupling graph from the rst qubit in the center. For example, @ 5 is the center qubit in the example shown in Figure 8 . In the breadth-rst traversal, we will rst access @ 4,9,10,6,1 as shown on the right. Each time we access one new qubit, we will immediately determine its frequency. A list of candidate frequencies is prepared. In this paper, the candidate frequencies are 5.00, 5.01, 5.02, . . . , 5.33, 5.34⌧ I to achieve an accuracy of 0.01⌧ I. We can also have more candidate frequencies but it will take more time to evaluate all of them.
To evaluate a candidate frequency on a new qubit, we temporarily assign the candidate frequency to the new qubit and then simulate the yield rate within its local region. The local region of a qubit is dened as a sub-graph of the original chip coupling graph in which a qubit may collide with the new qubit. For example in Figure 8 , when we are searching for the best frequency of @ 12 , the local region is marked in blue. Note that it is necessary to consider two hops when allocating frequency for one qubit because the frequency collision conditions in row 5, 6, and 7 of Figure 3 involve 3 connected physical qubits. Qubits not in this region like @ 5 cannot collide with @ 12 . We will select the frequency with the maximal yield rate and assign it to the new qubit. Now the time complexity of the frequency allocation algorithm is $ ("# ) where " is the number of candidate frequencies and # is the number of qubits.
Yield Simulation.
We developed a yield simulator based on IBM's yield model [6, 43] . The fabrication process can be modeled by adding a Gaussian noise # (0, f) to the pre-fabrication frequency of a qubit to generate its postfabrication frequency where f is the fabrication precision parameter. For a given superconducting quantum processor design, we estimate its yield rate through Monte Carlo simulation. Each time we will simulate if one fabrication is successful. We rst generate the post-fabrication frequencies by adding a random noise sampled from Gaussian distribution mentioned above. Then we check if any frequency collision condition listed in Figure 3 occurs in the post-fabrication frequencies. If so, this fabrication fails. Otherwise, it is successful. All possible cases are taken into account. For example, we will examine the two frequencies of all connected physical qubit pairs for condition 1, 2, 3, and 4. If they meet any one of the inequalities of the conditions, frequency collision is considered to occur in this simulation. This simulation process is repeated many times. The yield rate can be estimated by the ratio between the number of successful simulations and the total number of simulations.
Evaluation
To demonstrate that the proposed application-specic architecture design ow can deliver hardware designs with better Pareto-optimal results in terms of performance and yield rate, we conduct experiments over various benchmarks to show Figure 9 . Baseline Qubit Frequency, Layout, and Connection Designs not only the overall improvement but also the breakdown of benets from each of our hardware design subroutines.
Experiment Setup
Benchmarks Twelve quantum programs are collected from IBM's QISKit [2] and RevLib [52] , or compiled from ScaffCC [24] . These benchmarks cover several important domains (e.g., simulation, arithmetic) and have various sizes (from 7-to 16-qubit) for a versatility test of the proposed design ow.
Metrics To evaluate the eciency of an architecture, we need both the yield rate and performance. An architecture with a higher yield rate can be successfully fabricated with fewer attempts, indicating a lower hardware cost. In our experiments, the yield rate is simulated with IBM's yield model [6, 43] as introduced in Section 4.3.1. For the performance evaluation, we adopt the total post-mapping gate count metric widely used in previous studies [28, 48, 56] . More gates lead to longer execution time and a larger probability of error on QC devices. If a hardware architecture could execute the program with fewer gates, then its performance is considered to be better.
Yield Simulation Conguration The number of trials in the Monte-Carlo simulation for each architecture is 10,000⇠ 100,000, which is 10 ⇠ 100⇥ of that used in IBM's experiments [6, 7, 22] to ensure the simulation accuracy. The fabrication precision parameter f is set to be 30" I, a realistic extrapolation of progress in hardware by IBM [7, 43] . IBM has improved the f from 200" I [42] to 130" I [43] in the last few years and 30" I is a reasonable projection to achieve a useful yield as predicted by IBM [7] .
Experiment Methodology
To illustrate the benet of our design ow, ve experiment congurations are designed to show the overall improvement and the performance/yield trade-o gain at each of the three subroutines in Section 4. Among them, ibm is a set of general-purpose architectures from IBM and they are not tailored for any applications. The remaining four congurations are application-specic architectures generated by the entire or part of the proposed design ow.
ibm We use IBM's design scheme as the baseline conguration. It has two layout options, a 2⇥8 lattice with 16 qubits, and a 4⇥5 lattice with 20 qubits. The qubit connection design can be either 2-qubit bus only or using 4-qubit buses Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020 , Lausanne, Switzerland as many as possible. In total, there are four architectures combining the layout and connection options (shown in Figure 9) . The frequency allocation scheme is a 5-frequency scheme [7, 43] . The ve frequencies are an arithmetic progression from 5⌧ I to 5.27⌧ I and their arrangement is also in Figure 9 .
e-full We apply all three subroutines and generate a series of ecient superconducting quantum processor architectures by varying the number of 4-qubit buses. The number of designs we can obtain for a quantum program depends on the number of qubits as more qubits can provide more squares to apply 4-qubit buses in the generated layout. In this paper, we obtain the e-full data series through iterating over all possible numbers of 4-qubit buses in the second subroutine for bus selection. This experiment can show the overall architecture design improvement when comparing with the baseline ibm.
e-5-freq We only apply the rst two subroutines to generate qubit layout and connection design but the frequency allocation is done with IBM's 5-frequency scheme. The yield benet from the proposed frequency allocation algorithm can be demonstrated by comparing with results from efull.
e-rd-bus We keep the rst and the third subroutines but randomly select some squares to employ 4-qubit buses with the prohibited condition constraint satised. This will demonstrate the eect of our ltered-weight-based 4-qubit bus selection algorithm by comparing with results from efull.
e-layout-only We apply our proling method and perform a layout design. The connection design has two options. One is only using 2-qubit buses. The other is using 4-qubit buses as much as possible. The frequency design follows the baseline ibm. The benet of our layout optimization can be shown when comparing with the results from ibm.
For each benchmark, we run all the ve congurations to generate dierent superconducting quantum processor architectures with dierent yield rates. Then we apply one state-of-the-art qubit mapping algorithm [28] on these architectures to obtain the total number of gates when running the generated or baseline architectures. Figure 10 shows the result of yield and performance for all benchmarks and the ve experiment congurations. There are 12 subgures and one subgure contains the results of the ve experiment congurations for one benchmark. The X-axis represents the normalized reciprocal of post-mapping gate count and data points on the right have better performance. The Y-axis represents the yield rate and data points on the top have higher yield rates. The legend at the bottom of Figure 10 shows the markers for the ve congurations. The data points for the four designs in the baseline are labeled by (1) , (2), (3), and (4), according to Figure 9 .
Overall Improvement
Optimality The optimal solution in this paper means the Pareto-optimal solution in terms of post-mapping gate count and yield rate. A series of architectures with better Pareto-optimal results can be generated by our design ow as the data of e-full is on the upper right of ibm. The most simplied designs (the most left top blue triangle data point in e-full, zero 4-qubit buses) generated by our design ow outperforms the 16-qubit baseline design (data point (1) in ibm ) without 4-qubit buses in both performance (⇠ 7.7%) and yield rate (⇠ 4⇥). Compared with the 16-qubit baseline with four 4-qubit buses (data point (2) in ibm), our designs with zero 4-qubit buses achieve over 100⇥ better yield rate with < 1% performance loss. On the other side, compared with IBM's 20-qubit chip design with six 4-qubit buses (the baseline design with the most hardware resources, data point (4) in ibm), the designs with the maximum number of 4-qubit buses generated from our design ow (the data points on the most bottom right in e-full) have over 1000⇥ yield rate improvement on average with only about 3.5% performance loss.
Controllability The proposed design ow can easily control the trade-o between yield and performance by only changing the number of 4-qubit buses without traversing across, or sampling a large number of designs in, the entire search space. Depending on the number of qubits in dierent target programs, we can trade in around 10⇥ ⇠ 50⇥ yield rate for 10% ⇠ 33% performance improvement.
Special Case.
The results of ising_model are significantly dierent because the logical qubit coupling in this benchmark forms a chain structure. The mapping algorithm can always nd the perfect initial mapping without inserting additional operations. As a result, the post-mapping gate count is the same for all tested hardware architectures. All data points for this program lie in one vertical line. Only one architecture is generated from our design ow because there is no need to add 4-qubit bus. All the two-qubit gates can be executed through the edges on the 2D lattice. There are no two-qubit gates applied on two qubits on a diagonal because of the chain coupling structure. In this case, 4-qubit buses can only lower the yield rate without improving the performance.
Eects from Individual Subroutines
The overall improvement has already been discussed, but one interesting question is how much improvement the layout and connection optimization contribute and how much comes from the optimized yield allocation directly. The ve congurations decouple the proposed design ow and provide a breakdown of the eect of individual subroutines.
Eect of Layout Design. The dierence between
ibm and e-layout-only illustrates the eect of layout design since the rest two subroutines are the same. An architecture with more hardware resources is expected to provide Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020 cm152a_212, 12-qubit
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01 1.E-05
1.E-01 (1)
(1)
(3) (3) Figure 10 . Yield v.s. Normalized Reciprocal of Post-mapping Gate Count higher performance by allowing more exibility in qubit mapping. But our optimized layout design could use comparable or fewer hardware resources while the performance can be even better. For example, we compare the 2-qubit bus only data point (the upper left one) with the 16-qubit baseline with four 4-qubit buses (labeled by (2) in each subgure). e-layout-only provides better or comparable performance most of the time with about 35⇥ yield improvement on average. The improvement at this step depends on the program size and programs with fewer qubits will use fewer qubits and connections in an optimized architecture. This result proves that our layout design could generate qubit layout with high performance but using much fewer hardware resource for dierent programs.
4-qubit Bus Selection
Quality. By comparing the results from e-full and e-rd-bus, we can see that the architectures generated from our bus selection algorithm are better than that of random selection in trading in yield for performance most of the time. The data points of e-rd-bus reveal the distribution of the yield and performance sampled from random bus designs. Note that the performance of e-rd-bus is usually conned by the two data points in e-layout-only because adding connections can improve the performance most of the time. For most benchmarks except qft, the results from e-full are close to the upper bound formulated by the random samples, which shows that our weight-based bus selection could generate a series of near Pareto-optimal hardware architectures with various numbers of qubit connections.
The result of qft is much worse than that of other programs due to the unique uniform two-qubit gate pattern in this program. The number of two-qubit gates between arbitrary two logical qubits is always two in qft, which makes all the logical qubit pairs are the same in the sense the coupling strength during proling. Then in bus selection subroutine, all the squares share the same weight and the weight-based selection is the same as random selection.
For the two small benchmarks, sym6 and UCCSD_ansatz, the number of available squares in the generated qubit layout is small and there are very few options when applying 4-qubit buses. Therefore, most of the architectures generated from the random 4-qubit bus selection are the same as those from the proposed design ow, which makes the results from e-full and e-rd-bus very close.
Frequency Allocation
Optimization. By comparing e-full and e-5-freq, we can see that the proposed Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020 , Lausanne, Switzerland frequency allocation algorithm provides about 10⇥ yield rate improvement on average. This improvement is slightly worse when the yield from the baseline 5-frequency is already high, e.g., results from sym6 and UCCSD_ansatz. The fabrication variance makes the ideal yield 100% unreachable and it is hard to optimize yield when it is already high.
Discussion
This paper studies application-specic ecient superconducting quantum processor design. In particular, we formalize the architecture design for superconducting quantum processors with three key steps, each of which comes with an optimization subroutine. This is the rst attempt, to the best of our knowledge, to identify the optimization opportunity from the architecture level to push forward the balance between QC performance and hardware yield rate. Eort towards this direction can be of signicant demand in the near term QC with limited computation resource and immature fabrication technology.
Although we show that improved Pareto-optimal designs can be generated with a static program analysis and three optimized design algorithms, several future research directions can be explored as with any initial research.
Improving Proling Method This paper focused on the logical qubit coupling topology in a quantum program but other patterns may also be leveraged. We omitted the temporal information of the two-qubit gates and all information about other program components. But the locations of twoqubit gates in a quantum program may also be leveraged for ner-grained evaluation of the coupling strength for dierent logical qubit pairs at dierent times during the execution. The single-qubit patterns can also help with the basic gate set design.
Exploring More Design Space In the proposed design ow, the number of physical qubits is the same as that of logical qubits for higher yield rate. However, we can still add auxiliary physical qubits since they can also be used during the qubit routing, trading in more yield rate for higher performance. How to add auxiliary qubit to appropriate locations and how to connect them are interesting problems to explore in the future. To ensure modularity and scalability, the qubits are forced to be embedded in a 2D lattice and only consider two types of buses lying in the lattice. However, the qubit placement and connection could be more exible if we trade in part of the scalability. For example, one bus could also connect more than four qubits [17] . The design space in this direction is not yet explored.
Optimizing Frequency Allocation This paper tried to optimize the qubit frequency selection from the center to periphery and only searched for the optimal frequency for one qubit, resulting in a sub-optimal frequency allocation. A global optimization like formal methods can be explored to further optimize the frequency allocation result. One alternative approach to resolve the frequency collision issue is to use ux-tunable transmon qubits [26] , of which the frequencies can be dynamically tuned with additional control signals. The design trade-o of dierent types of qubits is not yet explored and additional signals bring more noise and increase the control complexity. The proposed design ow is still valuable even with frequency-tunable qubits because the simplied architectures with fewer the on-chip connections can not only reduce the fabrication complexity but also benet the overall performance by lowering the crosstalk error.
Related Work
This paper ranges across multiple topics, i.e., program proling, superconducting processor design, application-specic design, qubit mapping. We briey introduce related work for all of them.
Application-specic Design The closest related work is SPARQS, a superconducting planar architecture proposed by Wilhelm et al. [12, 29] targeting a specic Fermi-Hubbard model simulation program. However, they only provide an implementation-independent design from theoretical physics level. This paper formalizes a systematic end-to-end design ow with automatic program proling and realistic physical constraints included, for the rst time. With no limitation on the target program, we can generate a series of Paretooptimal hardware architecture designs in a controllable way.
Quantum Program Proling and Analysis Program proling and analysis are very important for software and compiler optimization. Previous works on quantum program analysis [21, 24, 38, [53] [54] [55] have studied entanglement, termination, non-cloning checking, etc. The proling method in this paper is proposed to guide the hardware design, fullling a dierent goal.
Superconducting Quantum Processors As one of the most promising candidate technology to implement QC, superconducting quantum techniques have been employed in two mainstream QC computation models. The circuit model based processors [23, 25, 40 ] support quantum circuit model [36] and the quantum annealers [11] can implement adiabatic QC [15] . Their programming model and hardware architecture are dierent for these two QC approaches. The design ow in this paper is proposed for circuit model based quantum processors while ecient quantum annealer design can be a future research direction.
Qubit Mapping Formal and heuristic methods have been attempted to solve this problem [28, 45, 48, 51, 56] and minimize the total gate count. Recently several studies [3, 34, 49] have applied the actual gate error rates for ne-grained optimization. All these optimizations are pure software-level Session 11B: Quantum computing -Who says you can't watch two talks at once? ASPLOS '20, March 16-20, 2020 , Lausanne, Switzerland modication. This paper attempts to improve the performance by reducing the mapping overhead from the hardware level. We adopt the gate count metric to estimate the mapping overhead since our experiments are performed on articial hardware architectures.
Conclusion
The demand for larger computation capability in a superconducting quantum processor naturally calls for more hardware resources which will also increase the design complexity and lower the yield rate. This paper explored applicationspecic architecture design for superconducting quantum processors to achieve both high performance and higher yield rate. Gate patterns in a quantum program can be extracted by the proposed proling method and then utilized in the follow-up hardware architecture design. Three subroutines are designed to generate the qubit layout, connection, and frequency respectively with physical constraints taken into consideration. Experimental results show that the proposed design ow could deliver architectures with both high yield rate and performance automatically for dierent applications except those with extremely special gate patterns.
