Hierarchical System Mapping for Large-Scale Fault-Tolerant Quantum
  Computing by Hwang, Yongsoo & Choi, Byung-Soo
ar
X
iv
:1
80
9.
07
99
8v
1 
 [q
ua
nt-
ph
]  
21
 Se
p 2
01
8
1
Hierarchical System Mapping for
Large-Scale Fault-Tolerant Quantum Computing
Yongsoo Hwang and Byung-Soo Choi
Abstract—Considering the large-scale quantum computer, it is important to know how much quantum computational resources is
necessary precisely and quickly. Unfortunately the previous methods so far cannot support a large-scale quantum computing
practically and therefore the analysis because they usually use a non-structured code. To overcome this problem, we propose a fast
mapping by using the hierarchical assembly code which is much more compact than the non-structured code. During the mapping
process, the necessary modules and their interconnection can be dynamically mapped by using the communication bus at the cost of
additional qubits. In our study, the proposed method works very fast such as 1 hour than 1500 days for Shor algorithm to factorize
512-bit integer. Meanwhile, since the hierarchical assembly code has high degree of locality, it has shorter SWAP chains and hence it
does not increase the quantum computation time than expected.
Index Terms—system mapping, quantum assembly code, large-scale quantum computing, quantum computer architecture
✦
1 INTRODUCTION
THE era of quantum computing already started. Severalgigantic IT corporations have been devoted to develop
quantum computing devices, and some of them are now
providing quantum computing cloud service to the pub-
lic [1], [2]. But the scale of the devices are still small (dozens
of qubits), and therefore applications for them are also
very limited. While we look forward to seeing a quantum
supremacy with such a small scale quantum device soon,
most of the public have interest in a large scale universal
quantum computer that can run large-sized quantum algo-
rithms to solve real world problems.
Suppose that you have a quantum computing hardware
and a quantum algorithm. What do you have to do to run
the algorithm with the hardware? Quantum algorithm is
a logical description of how to solve a given problem. It
is usually based on ideal quantum computing hardware
with noiseless gate and long qubit interaction. However,
in reality, a quantum computer as a physical entity has a
certain physical and logical limitation. For example, qubits
might be peculiarly arranged and connected with each other
(see Ref. [1], [2]). Therefore, to run a quantum algorithm on
a quantum computer, we have to prepare an architecture-
specific description of a quantum algorithm beforehand.
This is why a quantum system mapping is necessary.
The principle of a quantum system mapping is straight-
forward. Initialize a quantum computer architecture and
recast a quantum algorithm for the architecture. A detailed
procedure completely depends on a quantum algorithm, a
quantum computer architecture and a mapping algorithm.
An architecture-specific description of a quantum algorithm,
the main product of the system mapping, is called a system
code in this work. It is possible to run a quantum algorithm
on the quantum computer by executing the system code.
Another output of the system mapping is an expected perfor-
• Y. Hwang and B.-S. Choi are with Electronics and Telecommunications
Research Institute, Daejeon, Republic of Korea, 34129.
E-mail: bschoi3@etri.re.kr
mance of a quantum computing. Since the system mapping
treats all quantum instructions of a quantum algorithm
under the pre-assumed quantum computer architecture, it
is possible to analyze some performance of a quantum
computing, i.e., a circuit depth and an execution time.
Quantum algorithm for the system mapping is provided
in a quantum assembly code format. A quantum assembly
code (QASM) is an intermediate representation of a quan-
tum algorithm between an abstract description and a phys-
ical machine instruction description [3], [4], [5]. It is a list of
quantum instructions denoting a combination of a quantum
gate and target qubit(s), and is generated through a compile
by taking a programmed quantum algorithm. Due to the
lack of standard for QASM, the specific representations may
be slightly varied [2], [3], [4], [6].
There are two kinds of QASMs in terms of structure:
non-modular type and modular type. A non-modular QASM
naı¨vely enumerates all quantum instructions. On the other
hand, a modular QASM is hierarchically structured with
multiple functions called a module. A module corresponds
to a composite quantum operation composed of multiple
quantum instructions including calling other sub-modules.
In general, a modular QASM is composed of one main
module and multiple sub-modules [4].
To date most quantum system mappings have been per-
formed based on the non-modular QASM. This is because
the mapping is straightforward due to the trivial structure
of the QASM. Since the quantity of the developed qubits
in reality is very small as mentioned above, small-sized
quantum algorithms have been targeted for the system
mapping and for such situations the system mapping with
the non-modular QASM has been working well.
However, when we treat a large-sized quantum algo-
rithm, practically serious problems arise with the non-
modular QASM. Please recall that a non-modular QASM
is a simple list of quantum instructions. As the size of a
quantum algorithm increases, so definitely does the size
of a non-modular QASM. It completely follows along the
2TABLE 1: The QASM size comparison over Shor’s factoring
algorithm. The QASMs are generated by using the compiler
ScaffCC [4], [7].
Input Size Non-Modular Modular
128 1.7 TB 23.5 MB
256 14.2 TB 88.1 MB
512 39.0 TB 338.6 MB
complexity of the algorithm. We obtained a 39 TB sized non-
modular QASM for Shor algorithm to factorize a 512-bit
integer (see TABLE 1). Due to the lack of classical storage
and memory, we could not even attempt to generate a
non-modular QASM for a larger quantum algorithm. Ob-
viously, the size of the most interested problems for a quan-
tum computing are beyond the capacity of classical super-
computing. Therefore, we will definitely have to deal with
a more larger QASM, and in doing so practical problems
caused by such enormous sized QASMwill be one of critical
issues in classical control part of a quantum computing.
In this work, we present a quantum system mapping
for a large-scale fault-tolerant quantum computing. To this
end, we turn our attention to a modular QASM instead.
As mentioned above, a modular QASM is hierarchically
structured as composed of modules, and we found that such
hierarchical structure can suppress the scalability in the size
of QASM. Suppose that there is an n-qubit composite quan-
tum operation U composed of K quantum gates and it is
called as much as N times. To represent suchN iterations, a
non-modular QASM requires K ×N quantum instructions,
but a modular QASM only requires K + N instructions by
defining U as a module. Empirically, the values of K , N are
not small in a non-trivial quantum algorithm1.
In this regards, a modular QASM is much smaller than
a non-modular QASM (see Table 1). Therefore, it is possible
to generate and manage a QASM for a large-sized quan-
tum algorithm. By the way, currently, only a few quantum
compilers support a modular (or hierarchically structured)
QASM. This work is currently compatible with open-source
quantum compiler ScaffCC [4], [7]. In this work, we de-
scribe how to exploit the modular structure of the modular
QASM on a pre-assumed quantum computer, and discuss
the strengths and weaknesses of the proposed mapping.
2 HIERARCHICAL SYSTEM MAPPING
2.1 Hierarchically Structured Qubit Layouts
Systemmapping begins with the initialization of a quantum
computer architecture, i.e., a qubit array. A qubit array is
not specifically limited, but in this work we assume that
it is hierarchically structured. A quantum computer is then
made of modules (computing regions) and a communication
bus connecting them.Modules and bus are composed of log-
ical qubits encoded by quantum error-correcting codes. At
a module, qubits are manipulated by following QASM, and
transmitted between modules through the communication
bus. The bus makes a quantum computing more reliable
because a logical data qubit is only interacted with ancilla
qubits on the bus and therefore a quantum error does not
1. In general, K is O(102) and N is O(104) ∼ O(106) in Shor
algorithm.
propagate to other data qubits. Furthermore, the commu-
nication bus permits parallel qubit movements and there-
fore, as will be discussed later, the hierarchically structured
quantum computer can make a quantum computing more
efficient. Fig. 1 shows an example of a quantum computer
architecture where modules are arranged on the 1D array
and qubits in each module are arranged on the 2D array.
In a modular QASM, qubits are classified into two
types: local qubits and parameter qubits. Local qubits are
initialized, manipulated and measured within a module,
whereas parameter qubits are passed betweenmodules over
a communication bus. Therefore, a qubit array requires
physical space for both qubits. In Fig. 1, the dark grey cells
indicate parameter qubits, and white cells represent local
qubits. Besides, dummy qubits (or empty space), light grey
cells denoted by “NULL”, are sometimes required to form
the 2-dimensional rectangular shape of a module.
Qubit that resides inside a module supports universal
quantum computing. The logical qubit is composed of data
qubits for holding data and ancilla qubits for quantum
error correction and logical operations. On the other hand,
qubits composing a communication bus do not need to
support universal quantum computing, in particular logical
non-transversal gates. Therefore the composition of logical
qubits for computing and communication may be slightly
differ according to quantum error-correcting codes.
2.2 Mapping Algorithm
The proposed systemmapping proceeds module by module
starting from main module. For mapping a module, we first
allocate a physical space for the module next to the previ-
ously allocated module, and arrange the qubits of QASM
on the space. The space is made of cells as much as the
number of qubits in the module. Here2, we simply arrange
the qubits in the first-come-first-served manner. We then
read each quantum instruction from QASM, and conduct
an appropriate mapping process. After mapping a module,
the allocated space for a module is de-allocated.
The proposed mapping requires two kinds of lookup
tables, one global table and several local tables. These tables
will be used to record the performance of modules and
of qubits of each module. On beginning the mapping of
a module M , a local table is initialized as (time[qi] =
0, cycle[qi] = 0) over i = 1 · · ·n, where n is the number
of qubits of the module. Note that time[qi] (cycle[qi]) is a
metric to measure an execution time (a circuit depth) of
a module. As the mapping proceeds, the metric is being
updated as time[qi] = time[qi] + waitt + ut (cycle[qi] =
cycle[qi] + waitc + 1). Note that ut is the execution time of
a quantum gate u, and waitt (waitc) is a required waiting
time (cycle) for a multi-qubit quantum operation. On fin-
ishing the mapping, the performance of the module M will
be recorded on the global table by picking the maximum
values from the local table, time[M ] = maxi{time[qi]} and
cycle[M ] = maxi{cycle[qi]}.
In a modular QASM, there are three kinds of quantum
instructions: 1- and 2-qubit gate and module. The mapping of
the 1-qubit quantum gate is straightforward. It can be done
2. In Section 3, we will discuss an optimized qubit placement of a
module.
3Module Mi Module Mi+1
NULL
②
③
① ④
⑤
⑥
⑦
Module M1Main
…
BUS
Fig. 1: The mapping process for calling a module is com-
posed of seven steps: 1. (forward) move qubits to the bus, 2.
(forward) move to the target module, 3. (forward) move to
the parameter qubit cells (dark grey cells), 4. module oper-
ations, 5. (backward) move qubits to the bus, 6. (backward)
move to the original module, and 7. (backward) move to the
original qubit positions.
independently from other qubits by arranging the gate to a
target qubit at the proper time time[qi]. The execution of the
gate finishes at time time[qi] + ut. On the other hand, the
mapping of a 2-qubit gate requires that two qubits have
to be ready spatially and temporally. To the best of our
knowledge, the feasible implementation of a 2-qubit gate
is local. If two qubits are located at distant, they should
be placed adjacently by moving qubits. Besides, the gate
can act when both qubits are in idle status in common,
max{time[qi], time[qj]}, where time[qi] (time[qj ]) is the
time previous operation on the qubit qi (qj) is finished.
The third type quantum instruction, a module, seems
like a multi-qubit quantum function. Therefore, on the sur-
face, the required mapping process for a module is similar
with that for a 2-qubit gate. The execution of a module
begins when all the qubits are temporally and spatially in
ready. The only difference from the case of a 2-qubit gate is,
as mentioned above, that a physical space is allocated for
a module. Therefore, to perform the mapping of a module,
we have to consider qubit movements between a present
(calling) module and a target (called) module.
Suppose that a module Mi is being mapped now, and
we have to process an instruction “Mk(qa, qb, qc)”, calling
a module Mk with qubits qa, qb and qc. For that, we pause
the mapping of the moduleMi and turn our attention to the
mapping of the module Mk. We allocate a physical space
for the module Mk next to Mi, and pass the qubits to the
parameter qubit area of Mk through a communication bus.
We call this movement a forward qubit passing. After all
qubits are placed for their designated locations, quantum
instructions of Mk are performed. If another module is
called again before the finish, the mapping of the present
module is paused and the above-mentioned processes for
the newly called module are performed recursively. After
executing all instructions of Mk, the passed qubits have to
be back to the original place of Mi. This is called a backward
qubit passing. After the mapping of Mk, we de-allocate the
space for Mk and continue the mapping of Mi. Note that to
trace the mappings of modules, a stack3 can be very useful.
We have mentioned that a module plays as a composite
quantum operation, and is called many times during the
execution of a quantum algorithm. Whenever a module is
called, it always works in the same way for the argument
3. A typical data structure for first-in-last-out in computer science
1.2×10
2
5.9×10
2
3.3×10
3
1.8×10
6
1.5×10
7
1.3×10
8
Modular Shor-128
Modular Shor-256
Modular Shor-512
Non-Modular Shor-128
Non-Modular Shor-256
Non-Modular Shor-512
M
ap
p
in
g
 T
im
e 
(s
ec
s)
102
103
106
107
108
Hierarchical Mapping vs Non-Modular Mapping
Fig. 2: The comparison in the required time between the
hierarchical mapping (red) and the non-modular mapping
(blue). The denoted numerical value indicates mean time
from several repetition.
qubits placed in the same location. Which means that by
mapping a module once, it is possible to get a system code
and a performance of a module completely. Note that we
have to measure the distance for the qubit passings every
time when calling a module. In this regards, during the
mapping if we are faced with a module that has never
been mapped before, we perform the mapping for the called
module. Otherwise, we just refer the mapping result of the
module recorded in the global lookup table instead.
By performing the mapping for a module only once, it is
possible to reduce the duration of the mapping non-trivially
than the non-modular mapping. On considering the above
case that K quantum instructions are repeated N times,
the non-modular mapping requires the mapping as much
as the number of such iterations. As mentioned before, the
numbers of the repetitions N and the bunches of quantum
instructions (corresponding to a module) are so large for a
non-trivial quantum algorithm.
3 DISCUSSION
We discuss the strengths of the proposed mapping against
the non-modular mapping. First, it is possible to perform
the mapping for a large-sized quantum algorithm. As men-
tioned above, for a same quantum algorithm, a modular
QASM is much smaller than a non-modular QASM. There-
fore, it becomes possible to generate and manage QASM for
a larger quantum algorithm (see TABLE 1).
Second, it takes much smaller time to perform the map-
ping. This is due to the less-sized QASM and the one-time
module mapping. Fig. 2 compares the required mapping
time between both QASMs. We have implemented both
mappings with Python, and tested them on a computer
system of 3.5 GHz CPU and 128 GB Memory. The required
time for the non-modular mapping are estimated from the
statistics obtained from the proposed mapping. The map-
ping for Shor-512 with a non-modular QASM requires 1500
days, but the proposed mapping can be done within 1 hour.
Third, it is possible to analyze and optimize a physically
implemented quantum algorithm efficiently. Through the
proposed mapping, we can find out which modules are
4TABLE 2: The quantum resource between the hierarchical
mapping and the non-modular mapping in Shor algorithm.
The qubits in the hierarchical mapping represent “comput-
ing qubits (bus qubits)”. We determine the quantity of the
bus qubits with the assumption that the bandwidth of the
bus equals to the maximum number of parameter qubits.
Algorithm
Non-Modular Hierarchical
Qubits SWAPs Qubits SWAPs
Shor-4 46 4.08× 104 115 (374) 1.23× 106
Shor-8 90 4.71× 105 227 (1,122) 1.02× 107
Shor-16 178 5.82× 106 448 (3,055) 7.66× 107
bottlenecks of the algorithm and/or the most frequently ex-
ecuted. Then, by improving such modules, we can make the
algorithm better. At the same time, a pre-analysis of QASM
required for an optimization can be done very efficiently
in the proposed mapping. As an example, we optimize a
qubit placement [8], [9], [10] for a module. By doing so,
the quantity of SWAP operations required for a CNOT
gate between distant qubits can be reduced. In this work,
we apply linear programming approach [8] for Shor and
BWT (Binary Welded Tree) [4], [11] algorithms. By the way,
unfortunately, the performance gain by the optimization in
case of Shor algorithm is negligible (circuit depth reduc-
tion less than 0.01%), but the degree of the improvement
depends on a quantum algorithm. We observed that the
circuit depth of BWT algorithm can be reduced by 1∼6%,
i.e., 2.42× 105 → 2.26× 105 for BWT-10.
We now discuss the weaknesses of the proposed map-
ping in terms of quantum resource. The proposed map-
ping assumes a hierarchically structured quantum computer
composed of multiple computing regions (module) and a
communication bus connecting them. Without doubt, the
proposed mapping requires more qubits and qubit move-
ments (see Table 2). On average, 2.5 times more computing
qubits and additionally arbitrary bus qubits are necessary,
and more SWAPs are performed on the increased qubits.
However, we observed that surprisingly the length of a
quantum computing does not increase as much as exactly
the increased SWAPs. This is because the proportion of
SWAPs to total gates is small (10%) and most of the SWAPs
(99%) are utilized in the qubit passings where parallel
SWAPs are allowed. We observed that only a half or a
third of SWAP gates affect the quantum computing circuit
depth. Besides, as shown in TABLE 2, the difference in the
quantity of SWAPs between both mappings decreases as
the input size increases (30.3X→21.8X→13.16X). Along this
line, we guess that in the large-sized quantum algorithm the
difference in the quantity of SWAPs between both mappings
will be vanishing, or can be smaller in the best case.
We also need to emphasize that the quantity of SWAPs
for a CNOT within a module is negligible. In the proposed
mapping, qubits for mutual interaction are definitely posi-
tioned in the same module, and the area of each module is
smaller than the space for the non-modular mapping. For
Shor algorithm (N=4, 8, 16), the proposed mapping only
requires on average 0.04, 0.02 and 0.02 SWAPs for a CNOT
gate, whereas the non-modular mapping requires 1.00, 1.42
and 2.18 SWAPs. As the input size increases, the quantity of
SWAPs for the proposed mapping stays as 0.009 (Shor-128),
0.007 (Shor-256) and 0.005 (Shor-512), but the non-modular
mapping requires more SWAPs for a CNOT gate.
In this regards, we believe that, in the large-scale quan-
tum computing regime, the proposed mapping makes a
quantum computing can be done with less running time.
Furthermore, the difference in the quantity of qubits be-
tween both mappings can be reduced by controlling the area
of the communication bus, in particular the bandwidth of
the bus. Reducing the bandwidth may calls for the delay
of SWAPs in some rare cases. Note that the qubit passing
rarely exploits the full bandwidth of the bus.
4 CONCLUSION
Our future work is to apply the most realistic quantum
computer architecture [12] and to improve the mapping
algorithm. In the present work, we assume all qubits work
concurrently. However, in reality, some qubits take longer
time for error correction and/or non-transversal logical
gates. In doing so, the gate scheduling may be in trouble.
A communication bus also has to be controlled more deli-
cately. Currently, we assume the bus works on demand and
qubit passings are done on time. But, we believe practically
network traffic may disturb such ideal communication, and
the bus may be in congestion, in particular for highly paral-
lel quantum application [13]. Lastly, the proposed mapping
currently uses many qubits (physical space), but majority of
them just wait from calling other modules during the most
of the execution time. In the future work, we will improve
the spatial efficiency by the mapping algorithm.
ACKNOWLEDGMENTS
This work was supported by Electronics and Telecommu-
nications Research Institute (ETRI) grant funded by the
Korean government [17ZH1200, Research and Development
of Quantum Computing Platform and its Cost-Effectiveness
Improvement].
REFERENCES
[1] IBM. IBM Quantum Experience.
https://quantumexperience.ng.bluemix.net/qx/experience
[2] Rigetti. Rigetti Forest. https://www.rigetti.com/index.php/forest
[3] A. W. Cross et al. Open Quantum Assembly Language.
https://arxiv.org/abs/1707.03429
[4] A. JavadiAbhari et al. ScaffCC: Scalable compilation and analysis
of quantum programs, Parallel Computing, vol. 45, pp. 2–17
[5] K. M. Svore et al. A layered software architecture for quantum
computing design tools, Computer, vol. 39, no. 1, pp. 74–83
[6] A. S. Green et al. Quipper, in PLDI’ 13. pp. 333–342.
[7] A. JavadiAbhari. ScaffCC. https://github.com/ScaffCC/ScaffCC
[8] A. Shafaei et al, Qubit placement to minimize communication
overhead in 2D quantum architectures, in 2014 19th ASP-DAC.
IEEE, Feb. 2014, pp. 495–500.
[9] M. Siraichi et al. Qubit Allocation, CGO 2018, pp. 113-125
[10] M. Pedram and A. Shafaei, Layout Optimization for Quantum
Circuits with Linear Nearest Neighbor Architectures, IEEE Circuits
and Systems Magazine, vol. 16, no. 2, pp. 62–74
[11] A. M. Childs et al. Exponential algorithmic speedup by a quantum
walk. STOC ’03, pp. 59–68
[12] C.-C. Lin et al. PAQCS: Physical Design-Aware Fault-Tolerant
Quantum Circuit Synthesis, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 23, no. 7, pp. 1221–1234
[13] A. Javadi-Abhari et al. Optimized surface code communica-
tion in superconducting quantum computers,’ in the 50th Annual
IEEE/ACM International Symposium on Microarchitecture. pp. 692–
705.
