Resource Optimized Quantum Architectures for Surface Code
  Implementations of Magic-State Distillation by Holmes, Adam et al.
Resource Optimized Quantum Architectures for
Surface Code Implementations of Magic-State Distillation
Adam Holmes,∗ ,† Yongshan Ding,∗ ,† Ali Javadi-Abhari,‡ Diana Franklin,† Margaret Martonosi,‡ and Frederic T. Chong†
†Department of Computer Science, University of Chicago, Chicago IL 60637, USA
‡Department of Computer Science, Princeton University, Princeton NJ 08540, USA
ABSTRACT
Quantum computers capable of solving classically intractable
problems are under construction, and intermediate-scale de-
vices are approaching completion. Current efforts to design
large-scale devices require allocating immense resources to
error correction, with the majority dedicated to the produc-
tion of high-fidelity ancillary states known as magic-states.
Leading techniques focus on dedicating a large, contiguous
region of the processor as a single “magic-state distillation
factory” responsible for meeting the magic-state demands of
applications.
In this work we design and analyze a set of optimized
factory architectural layouts that divide a single factory into
spatially distributed factories located throughout the proces-
sor. We find that distributed factory architectures minimize
the space-time volume overhead imposed by distillation. Ad-
ditionally, we find that the number of distributed components
in each optimal configuration is sensitive to application char-
acteristics and underlying physical device error rates. More
specifically, we find that the rate at which T-gates are de-
manded by an application has a significant impact on the
optimal distillation architecture. We develop an optimization
procedure that discovers the optimal number of factory distil-
lation rounds and number of output magic states per factory,
as well as an overall system architecture that interacts with
the factories. This yields between a 10x and 20x resource
reduction compared to commonly accepted single factory
designs. Performance is analyzed across representative ap-
plication classes such as quantum simulation and quantum
chemistry.
Keywords
Quantum Computing; ECC; Distributed System; Modeling
1. INTRODUCTION
Quantum computers promise to provide computational
power required to solve classically intractable problems and
have significant impacts in materials science, quantum chem-
istry, cryptography, communication, and many other fields.
Recently, much focus has been placed on constructing and
optimizing Noisy Intermediate-Scale Quantum (NISQ) com-
puters [1], however over the long term quantum error correc-
tion will be required to ensure that large quantum programs
can execute with high success probability. Currently, the
leading error correction protocol is known as the surface code
*Corresponding authors: {adholmes, yongshan}@uchicago.edu.
These two authors contributed equally.
[2, 3], which benefits from low overheads in terms of both
fabrication complexity and amount of classical processing
required to perform decoding.
A common execution model of machines protected by
surface code error correction requires a process called magic-
state distillation. In order to perform universal computation
on a surface code error corrected machine, special resources
called magic states must be prepared and interacted with
qubits on the device. This process is very space and time in-
tensive, and while much work has been performed optimizing
the resource preparation circuits and protocols to make the
distillation process run more efficiently internally [4, 5, 6, 7,
8], relatively little focus has been placed upon the design of
an architecture that generates and distributes these resources
to a full system.
This study develops a realistic estimate of resource over-
heads of, and examines the trade-offs present in, the architec-
ture of a system that prepares and distributes magic states. In
particular, instead of using a single large factory to produce
all of the magic states required for an application, the key
idea of our work is to distribute this demand across several
smaller factories that together produce the desired quantity.
We specifically characterize these types of distributed factory
systems by three parameters: the total number of magic states
that can be produced per cycle, the number of smaller fac-
tories on the machine, and the number of distillation rounds
that are executed by each factory.
The primary trade-off we observe is between the number
of qubits (area/space) and the amount of time (latency) spent
in the system: we can design architectures that use minimal
area but impose large latency overheads due to lower magic-
state output rate, or we can occupy larger amounts of area
dedicated to resource production aiming to maximally alle-
viate application latency. The two metrics, space and time,
are equally important as it is easy to build small devices with
more gates or large devices with few gates. This concept is
closely related to the idea of “Quantum Volume” [9], when
machine noise and topologies are taken into consideration.
To capture the equal importance of both of these metrics,
we use a space-time product cost model in which the two
metrics simply multiply together. This model has been used
elsewhere in similar analysis [7, 8, 10, 11].
Figure 1 illustrates the opposing trends for space and time
when we increase the magic-state production rate. Our goal
is to find the “sweet spot” on the combined space-time curve,
where the overall resource overhead is at its lowest.
In summary, this paper makes the following contributions:
1. We present precise resource estimates for implementing
1
ar
X
iv
:1
90
4.
11
52
8v
1 
 [q
ua
nt-
ph
]  
25
 A
pr
 20
19
Larger	Number	of	Factories
Latency
Space
Space-time
Figure 1: Space and time tradeoffs exist for distributions
of resource generation factories within quantum computers.
These trends are shown assuming same total factory output
capacity. By explicit overhead analysis, we can discover
optimal space-time volume design points.
different algorithms with magic-state distillation on a
surface code error corrected machine. We derive the
estimates from modeling and simulating the generation
and distribution of magic states to their target qubits in
the computation.
2. We quantify the space and time trade-offs of a number
of architectural configurations for magic-state produc-
tion, based on design parameters including the total
number of factories, total number of output states these
factories can produce, and the desired fidelity of the
output magic states.
3. We study different architectural designs of magic-state
distillation factory, and present an algorithm that finds
the configuration that minimizes the space-time volume
overhead.
4. We highlight the nontrivial interactions of factory fail-
ure rates and achievable output state fidelity, and how
they affect our design decisions. We analyze the sensi-
tivity of these optimized system configurations to fluc-
tuations in underlying input parameters.
5. We discover that dividing a single factory into multiple
smaller distributed factories can not only reduce over-
all space-time volume overhead but also build more
resilience into the system against factory failures and
output infidelity.
The rest of the paper is structured as follows. In Section 2,
a basic background of quantum computation, error correction,
magic-state distillation and the Bravyi-Haah distillation pro-
tocol, as well as the block-code state-distillation construction
are described. Section 3 describes previous work in this area.
Sections 4 and 5 discuss important space and time charac-
teristics of the distillation procedures that we consider, and
derive and highlight scaling behaviors that impact full system
overhead analysis. Section 6 describes in detail how these
characteristics interact, and shows how these interactions cre-
ate a design space with locally optimal design points. Section
7 details the system configurations we model, describes a
novel procedure for discovering the optimal design points,
and discusses the simulation techniques used to validate our
model derivations. Section 8 shows our results and the ex-
plains the impacts of optimizing these designs. Sections 9
and 10 conclude and discuss ideas to be pursued as future
work.
2. BACKGROUND
2.1 Quantum Computation
The idea of quantum computation is to use quantum me-
chanics to manipulate information stored in two-level phys-
ical systems called quantum bits (qubits). In contrast to a
bit in a classical machine, each qubit can occupy two log-
ical states, denoted as |0〉 and |1〉, as well as a linear com-
bination (superposition) of them, which can be written as
|ψ〉= α |0〉+β |1〉, where α,β are complex coefficients sat-
isfying |α|2+ |β |2 = 1.
It is sometimes useful to visualize the state of a single
qubit as a vector on the bloch sphere [12, 13], as we can
rewrite the state |ψ〉 in its spherical coordinates as |ψ〉 =
cos(θ/2) |0〉+ exp(iφ)sin(θ/2) |1〉. Any operations (called
quantum logic gates) performed on single qubit can thus be re-
garded as rotations by an angle ϕ along some axis nˆ, denoted
as Rnˆ(ϕ). In this paper we will focus on some quantum gates
that are commonly used in algorithms, such as the Pauli-
X gate (X ≡ Rx(pi)), Pauli-Z gate (Z ≡ Rz(pi)), Hadamard
gate (H ≡ Rx(pi)Ry(pi/2)), S gate (S≡ Rz(pi/2)), and T gate
(T ≡ Rz(pi/4)). For multi-qubit operations, we will con-
sider the most common two-qubit gate called controlled-NOT
(CNOT). It has been shown [14] that the above mentioned
operations form a universal gate set, which implies that any
quantum operations can be decomposed as a sequence of the
above gates.
As quantum logic gates require extremely precise control
over the states of the qubits during execution, a slight per-
turbation of the quantum state or a minor imprecision in the
quantum operation could potentially result in performance
loss and, in many cases, failure to obtain the correct out-
comes. In order to maintain the advantage that quantum
computation offers while balancing the fragility of quantum
states, quantum error correction codes (QECC) are utilized to
procedurally encode and protect quantum states undergoing
a computation. One of the most prominent quantum error
correcting codes today is the surface code [2, 3].
2.2 Surface Code
In a typical surface code implementation, physical qubits
form a set of two-dimensional rectangular arrays (of logical
qubits), each of which performs a series of operations only
with its nearest neighbors. A logical qubit, under this con-
struction, is comprised of a tile of physical qubits, and these
tiles interact with each other differently according to different
logical operations. These interactions on the grid create the
potential for communication-imposed latency, as routing and
logical qubit motion on the lattice must be accomplished.
An important parameter of the surface code is the code
2
distance d. Larger code distance means a larger tile for each
logical qubit. The precise number of physical qubits required
in each tile also depends on the underlying surface code
implementation. Most common implementations assume a
logical qubit of distance d requires ∼ d2 physical qubits [3,
15]. Code distance also determines how well we can protect
a logial qubit. The logical error rate PL of a logical qubit
decays exponentially in d. More precisely:
PL ∼ d(100εin) d+12 (1)
where εin is the underlying physical error rate of a system [7].
In particular, this work will focus on two relatively expen-
sive operations on surface code, namely the logical CNOT
gate and the logical T gate. Our overhead analysis will hold
regardless of the underlying technology, e.g. superconduct-
ing or ion-trap implementations. Earlier work [11] has also
performed such analysis with technology-independent frame-
works. Firstly, a logical CNOT between two qubits can be
expensive, because the two logical qubits can be located far
apart on the lattice and long-distance interaction is achieved
by the topological defect braiding methodology. Secondly,
a logical T gate can also be costly because it requires some
ancillary state to be procedurally prepared in advance, called
the magic-state distillation.
2.2.1 CNOT Braiding
A braid is a path in the surface code lattice, or an area
where the error correction mechanisms have been temporarily
disabled and where no other operations are allowed to use. In
other words, braids are not allowed to cross. A logical qubit
can be entangled with another if the braid pathway encloses
both qubits, where enclosing means extending a pathway
from source qubit to target qubit and then contracting back
via a (possibly different) pathway. It is important to note
that these paths can extend up to arbitrary length in constant
time, simply by disabling all area covered by the path in the
same cycle. Furthermore, each path must remain open for
a constant number of surface code cycles to establish fault
tolerance. More precisely, one CNOT braid takes Tcnot =
2d+2 cycles to be performed fault tolerantly [3, 11].
2.2.2 T Magic-States
Now T (and S) gates, as described earlier, are necessary
for universal quantum computation, and yet are very costly
to implement on the surface code. For simplicity of analy-
sis, we assume all S gates will be decomposed into two T
gates, because of their rotation angle relationship. This is
potentially an overestimate of the actual gate requirements,
as it is also possible to perform an S gate via separate distilla-
tion of a different type of magic state. We are also aware of
another surface code implementation that allows for S gate
to be executed without distillation [16]. These techniques
have different architectural implications which are outside
the scope of the analysis of this work.
To execute these gates, an ancillary logical qubit must
be first prepared into a special state, known as the magic
state [17]. Once prepared, this magic-state is to be interacted
with the target qubit as in [3], via a probabilistic circuit in-
volving the magic state and between 1 or 3 CNOT braids,
each with probability 1/2. The extra 2 CNOTs are required
to perform a corrective S gate in the case that the probabilistic
circuit fails, which we assume to be consisting of 2 CNOT
braids. This circuit is called the state injection circuit. We
can therefore write the expected latency of a T gate as
E[Tt ] = Tcnot +
1
2
(2∗Tcnot) = 4d+4 (2)
where we use Tt to denote latency of a T gate and TCNOT as
latency of a CNOT gate.
Since the task of preparing these states is a repetitive pro-
cess, it has been proposed that an efficient design would
dedicate specialized regions of the architecture to their prepa-
ration [18, 19]. These magic-state factories are responsible
for creating a steady supply of low-error magic states. The
error in each produced state is minimized through a process
called distillation [4], which we will introduce in detail in
section 2.4.
2.3 T-Gates in Quantum Algorithms
Among the different classes of quantum algorithms, quan-
tum simulation and quantum chemistry applications have
drawn significant attention in recent years due to the promises
they show in transforming our understanding of new and com-
plex materials, while still potentially remaining tractable in
near-term intermediate-size machines [20, 21, 22, 23, 24].
The benchmark algorithms studied in this work include the
Ground State Estimation (GSE) [23] of the Fe2S2 molecule
and the Ising Model (IM) [25] algorithms. They are repre-
sentative applications for the purpose of this study as they
present very different demand characteristics for T gate magic
states. A more detailed description of T gate distributions in
these two algorithms can be found in section 5.1. Here we list
in Table 1 the two benchmarks alongside with some of their
T gates statistics, namely the number of qubits (nqubits), total
T count (Tcount), total schedule length (L), average T gates
per time step (Tavg), standard deviation of T gates per time
step (Tstd), and maximum T gates per time step (Tpeak).
Application nqubits Tcount L Tavg Tstd Tpeak
IM 500 9068348 20589 440 107 778
GSE 5 775522 546708 1.419 1.464 12
Table 1: T gate statistics in the Ising Model (IM) and Ground
State Estimation (GSE) benchmarks. For our analysis, we
consider a 500-qubit spin chain in our IM simulation, and
we simulate a small molecule in GSE comprised of 5 spin
orbital states. The reason Tpeak for IM can be more than the
number of qubits is because in this calculation every S gate
in the application has been decomposed into 2 T gates.
The Ising Model and Ground State Estimation applications,
and others in the same application class, have a predictable
structure. Contemporary methods to simulate quantum me-
chanical systems employ Trotter decomposition [26] to dig-
itize the simulation, which involves large numbers of struc-
turally identical Jordan-Wigner Transformation circuits [27],
each of which involves a series of CNOT gates (called the
“CNOT staircase") followed by a controlled rotation opera-
tion. This arbitrary-angle rotation will often be decomposed
to sequences of H, S, and T operations in a procedure called
gate synthesis [28].
3
Take as an example finding molecular ground state ener-
gies of the molecule Fe2S2 requires approximately 104 Trotter
steps for “sufficient" accuracy, each comprised of 7.4×106
rotations [29]. Each of these controlled rotations can be de-
composed to sufficient accuracy using approximately 50 T
gates per rotation [30]. All of this can amount to a total
number of T gates of order 1012, which is also the number
of prepared magic-states needed. In these types of applica-
tions, magic-state distillation will be responsible for between
50%− 99% of the resource costs when executing an error-
corrected computation [8]. Because of this, the number of
T gates present in an algorithm is often used as a metric for
assessing the quality of a solution [31, 32].
2.4 Bravyi-Haah Distillation Protocol
In order to execute T gates fault tolerantly, an interaction
is required between a target logical qubit and an ancillary
magic state qubit. The fidelity of the operation is then tied to
the fidelity of the magic state qubit itself, which requires that
magic states are able to be reliably produced at high fidelity.
This is achieved through procedures known as distillation
protocols.
Distillation protocols are circuits that accept as input a
number of potentially faulty raw magic states (n) and output
a smaller number of higher fidelity magic states (k). The
input-output ratio n→ k is generally used to assess the effi-
ciency of a protocol. Because many distillation protocols are
extremely resource-intensive, a key design issue of quantum
architectures is to optimize them.
In this work we restrict our focus to a popular low-overhead
distillation protocol known as the Bravyi-Haah distillation
protocol that has received much attention in the field recently
[6, 7, 33]. Here we describe in detail the process for preparing
and distilling the magic-states. Bravyi-Haah state distillation
circuits [4] take as input 3k+8 low-fidelity states, and output
k higher fidelity magic-states, and thus are denoted as the
3k+ 8→ k protocol. Notably, if the raw input (injected)
states are characterized by error rate εinject (which could be
different from the physical input error rate εin as in equation
1 depending on hardware implementations), the output state
fidelity is improved with this procedure to:
εoutput = (1+3k)ε2inject, (3)
or in other words, a second-order suppression of error.
This imposes a tolerance threshold on the underlying input
error rate that can be precisely written as:
εthresh ≈ 13k+1 (4)
because when εinject ≥ εthresh, the output error rate is no better
than where we started before distillation.
Moreover, this process is imperfect. For any given im-
plementation of this circuitry, the true yield could be lower
than expected. The success probability of the protocol that
attempts to output k high fidelity states is, to the highest order,
given by:
Psuccess ≈ 1− (8+3k)εinject. (5)
In performing a rigorous full system overhead analysis,
these effects will become extremely significant.
…
	
…
	
…
	
…
	
…
	
…
	
…
	
…
	
14	
14	
14	
14	
…
	
…
	
…
	
14	
14	
Round	1	 Round	2	
Figure 2: The recursive structure of the block code protocol.
Each block represents a module for Bravyi-Haah (3k+8)→
k protocol, and lines indicate the magic-state qubits being
distilled, and dots indicates the extra 3k+6 ancillary qubits
used, totaling to 6k+ 14. This figure shows an example of
2-level block code with k = 2. So this protocol takes in total
(3k+8)2 = 142 states, and outputs k2 = 4 states with higher
fidelity. The qubits (dots) in round 2 are drawn at bigger
size, indicating the larger code distance d required to encode
the logical qubits, as they have lower error rate than in the
previous round [33].
2.5 Block Codes
In certain types of applications, the second-order error sup-
pression achieved by single round of Bravyi-Haah distillation
is not enough. To overcome this, multiple rounds (also re-
ferred to as levels in our work) of the distillation protocol
can be concatenated to obtain higher and higher output state
fidelity.
To ensure successful execution of a program, systems must
be able to perform all of the gates in the computation with
an expected value of logical gate error rate less than 1. So
the success probability desired for a specific application (Ps)
relates to the required logical error rate per gate PL as follows:
PL ≤ PsNgates (6)
where Ngates is the number of logical gates in the computation.
PL therefore sets a bound on the fidelity of generated magic
states. Many circuits contain of order 1010 logical gates or
more [29], while physical error rates may scale as poorly as
10−3 [7]. In these cases, clearly squaring the input error rate
will not achieve the required logical error rate to execute the
program. Instead, we can recursively apply the Bravyi-Haah
circuit ` times, with permutations of the intermediate output
states in between distillation rounds. Throughout this work,
we use the terminology “round” and “level” to both refer to a
single iteration of the Bravyi-Haah distillation protocol within
a factory. Constructing high fidelity states in this fashion is
known as Block Code State Distillation [6]. As shown in
Figure 2, realizing Bravyi-Haah block code protocols would
require 6k+14 total logical qubits [33].
2.5.1 Magic-State Factory Error and Yield Scaling
4
To perform a rigorous full system overhead analysis, it is
necessary to quantify the behavior of multi-level block code
factories in terms of output state fidelity and production rate.
By construction, the error rate of the produced magic-states
will be squared after each round. So the final output states
error rate after ` rounds of distillation will be ∼ ε2`inject.
Since the output states from the previous round will be fed
into the next round, the success probability of a distillation
module at round r depends on the output error rate of the
previous round εr−1, i.e. P
(r)
success = 1−(3k+8)εr−1. The suc-
cess probability for the entire `-level factory will be explicitly
derived later in Section 4.
2.5.2 Magic-State Factory Area Scaling
Within any particular round r of an `-round magic-state
factory (where 1≤ r ≤ `), the required number of physical
qubits defines the space occupied by the factory during that
round. However, we will often use logical qubit as unit area,
since translating to physical qubits will simply pick up a d2r
multiplicative factor as shown in section 2.2.
In general, any particular round requires several modules
each comprised of several distillation protocol circuits. A
generic n→ k protocol, under a `-level block code construc-
tion, will need a total number of protocols as follows:
Ndistill =
`
∑
r=1
Nr =
`
∑
r=1
kr−1n`−r (7)
2.5.3 Magic-State Factory Time Overhead
Each round of distillation can be shown to require 11dr
number of surface code cycles[33]. Suppose dr is the code
distance for round r (which depends upon the input and out-
put error rates), we arrive at the total time to execute full
distillation as:
Tdistill = 11
`
∑
r=1
dr (8)
A full assessment of the area and time costs under our
proposed architecture designs,will be presented in more detail
in Section 4 and Section 5. Specifically, we discuss how
factory capacity, distillation rounds of each factory, and the
input physical error rate all affect the output state yield rate
and resulting space and time overhead.
3. RELATED WORK
A number of prior work has been focused upon design-
ing efficient magic-state distillation protocols [4, 34, 35, 17].
There are also some work that aim to concatenate different
protocols together to reduce the overall cost or improve out-
put rate and fidelity [6, 36]. The problem of scheduling and
mapping the distillation circuit is tackled in this work [8] by
taking advantages of the internal structures in the distillation
protocol, and by minimizing CNOT-braid routing congestions.
The aim of that work is also to more effciently implement
the distillation process, which is different from ours, as we
instead aim to optimize a full system architecture built around
these protocols and construct factory arrangements that ef-
(a) Single unified factory with large
capacity
(b) A number of distributed factories,
each with smaller capacity
Figure 3: The concept of a unified versus distributed fac-
tory architecture, embedding factories (green blocks) within
computational surface code region (blue circles).
ficiently deliver output magic states to their intended target
qubits.
Prior work on this subject has often assumed either that
magic states will be prepared offline in advance [7, 19], or
that the production rate is set to keep up with the peak con-
sumption rate in any given quantum application, and any
excess states will be placed in a buffer [33, 37]. This pa-
per operates with the different assumption that magic-state
factories will be active during the computation, and states
will not be able to be prepared offline or in advance. We do
this to characterize the performance of the machine online,
and introduce the complexity of resource state distribution
throughout the machine, a problem that has been studied well
in classical computing systems but has received less of a
focus in this domain.
Other works closely related to architectural design opti-
mized the ancilla production factories that operate in different
error correcting codes [38, 39], or analyzed the overhead of
CNOT operations which dominate other classes of applica-
tions like quantum cryptography and search optimization
[11]. Our work focuses instead on quantum chemistry and
simulation applications that are likely to represent a large
proportion of quantum workloads in both the near and far
term.
4. FACTORY AREA OVERHEAD
To describe a magic-state distillation factory, we first make
a distinction between a factory cycle and a distillation round
or level. A distillation round refers to one iteration of the
distillation protocol, a subroutine that is repeated ` times for
a particular factory. A cycle refers to the total time required
for the factory to operate completely, taking n input states
and creating k` output states. All ` distillation rounds are
performed during a cycle.
A magic-state distillation factory architecture can be
characterized by three parameters: the total number of
magic states that can be produced per distillation cycle K,
number of factories on the lattice X , and the total number
of distillation rounds that are performed per cycle `. For
simplicity, we assume uniform designs where all K output
states are to be divided equally into X factories, all of which
operate with ` rounds of distillation. We now analyze the
relationships presented in Section 2 to derive full factory
scaling behaviors with respect to these architectural design
variables. These behaviors interact non-trivially, and lead to
5
Parameter Descriptions Parameter Descriptions
K Factory total capacity n Number of input states in distillation protocol
X Number of distributed factories k Number of output states in distillation protocol
` Block-code levels of a factory Nr Number of protocols at round r under block-code
r Distillation round, 1≤ r ≤ ` Koutput Number of effective output magic-states due to yield rate
d Surface code distance Tdistill Time to execute one full iteration of distillation
Ps,Psuccess Target success probability Tt Time to deliver magic-state to target qubit
PL Logical fidelity ndistill Distillation iterations to support one timestep of a program
εinject Physical error rate of raw magic-state Afactory Total area of factories (in physical qubits)
εin/target/r Physical error rate at input, at output, or at round r
Table 2: List of system parameters involved in the analysis and the optimization procedure.
space-time resource consumption functions that show optimal
design points.
4.1 Role of Fidelity and Yield in Area Overhead
First we examine the fidelity of the produced magic-states
that is attainable with a given factory configuration, along
with expected number of states that will in fact be made
available. Once again, we use the terminology “round” and
“level” to both refer to a single iteration of the Bravyi-Haah
distillation protocol within a factory. Applying the block
code error scaling relationship described by equation 3 recur-
sively, as the number of total rounds (`) of a magic-state fac-
tory increases, the output error rates attainable scale double-
exponentially with the total number of rounds in a factory:
`. In fact, for a given round r (between 1 and `) of a factory,
the explicit form of the output error rate can be written by
directly applying r copies of equation 3:
εr = (1+3(K/X)
1/` )2
r−1ε2
r
inject (9)
where (K/X) denotes the capacity of each factory on a lattice.
The yield rate of a particular factory can be expressed as a
product of the yield rate functions describing each individual
round, as in equation 5. The effective output capacity can
be written as the product of the success probabilities of all `
rounds of a factory as:
Koutput = K ·
`
∏
r=1
[
(1− (3(K/X)1/` +8)εr−1
]
(10)
Here Koutput refers to the realized number of produced
states after adjusting for yield effects, while K refers to the
desired or specified number of output states. Equation 10
actually imposes a yield threshold on the system. For a
given K, X , and `, a system will have a maximum error rate
which, if exceeded, will cause the factory to malfunction and
stop producing states reliably. This threshold can be seen
by examining the product term, and noting that yield must
be positive in order to produce any states. The terms in the
sequence of equation 10 are decreasing in magnitude, so the
threshold is determined by the leading term which requires:
1− (3(K/X)1/`+8)εinject > 0, and thus:
εthresh <
1
3(K/X)1/`+8
(11)
Figure 4c shows the yield rate scaling behavior of single
factories of consisting of `= 1,2,3 with fixed X = 1. In order
to reliably produce some fixed amount of states, the yield ef-
fects determine the required number of rounds of distillation
that must be performed. On the other side, any given number
of distillation rounds has a maximum output capacity K for
which the expected number of produced states becomes van-
ishingly small. Increasing the number of distillation rounds
will increase the maximum supportable factory capacity.
4.2 Full Area Costs
We now use these relationships to derive the true area
scaling of these factories. For all ` level factories, the area of
the first round exceeds the area required for all other rounds.
Using this as an upper bound, we can write the area required
for a specific round explicitly in terms of physical qubits as:
Ar = X · kr−1(3k+8)`−r(6k+14) ·d2r (12)
≤ X(3k+8)`−1(6k+14) ·d21 (13)
Where k≡ (K/X)1/` . The inequality in the last line arises due
to the fact that the first round always uses the largest area
by block-code construction, i.e. Ar ≤ A1 for all 1 ≤ r ≤ `.
Here we have used several relationships, namely that the total
number of protocols and modules scales as in equation 7, a
single protocol requires 6k+14 logical qubits [33], and the
area of a single logical surface code qubit scales as d2 [15].
Although in an aggressively optimized factory design then,
one could conceivably save space within the distillation pro-
cedure by utilizing the space difference between successive
rounds of distillation for other computation, we will assume
in this work that this cannot be done, and instead the first
round area of any given factory defines the area required by
that factory over the length of its entire operation, and locks
out the region for distillation only. As a result, Figure 5a de-
scribes the scaling of factory area both by increasing output
capacity and increasing the total number of factories.
5. FACTORY LATENCY OVERHEAD
This section presents a systematic study of the time over-
head of realizing magic-state distillation protocols. First, we
will examine the characteristics of the T gate demand in our
benchmark programs, by introducing the concept of the T
distribution. Next, we will study the latency overhead caused
by delivering magic states to wherever T gates are demanded
by looking at the contention and congestion factors. Finally,
we will arrive at an analytical model for the overall distil-
lation latency integrating the information from the program
distribution.
5.1 Program Distributions
6
L = 1
L = 2
L = 3
0 200 400 600 800 1000
10-20
10-15
10-10
10-5
1
Number of Factories X
O
u
tp
u
t
E
rr
o
r
R
a
te
(a) Error rate attainable by number of factories
L = 1
L = 2
L = 3
0 200 400 600 800 1000
0.04
0.05
0.06
0.07
0.08
0.09
Number of Factories
In
p
u
t
E
rr
o
r
T
h
re
sh
o
ld
X
(b) Error rate tolerable by number of factories
L = 1
L = 2
L = 3
1 10 100 1000 104 105
0.997
0.998
0.999
1.000
Factory Ouput Capacity K
Y
ie
ld
R
a
te
(c) Yield rate of L-level factory with capacity K
r = 1
r = 2
r = 3
r = 4
r = 5
0 200 400 600 800 1000
0
1×1018
2×1018
3×1018
4×1018
5×1018
Factory Ouput Capacity K
N
u
m
b
e
r
o
f
Q
u
b
it
s
(d) Area scaling within each round of a 5-level factory
Figure 4: (a) Higher fidelity output states are achievable with increasing number of factories at a fixed output capacity. (b)
Increasing the number of factories in an architecture allows for higher tolerance of input physical error rates. (c) Increasing
factory output capacity puts pressure on the factory yield rate, and increasing the number of levels pushes the yield dropoff point.
(d) Maximum area to support multi-level factory is required of the lowest level of the factory, all higher levels require less area
support.
While the majority of the prior works on this subject have
been abstracting algorithm behavior into a single number, the
total T gate count, we argue that the distribution of T gate
throughout a algorithm has a significant impact on the per-
formance of the magic-state factory. For example, a program
with bursty T distribution, where a large number of T gates
are scheduled in a few time steps, puts significant pressure on
the factory’s capability of producing a large amount of high
fidelity magic states quickly.
In order to quantify this behavior, we choose two quantum
chemistry algorithms that represent the two extremes of T
gate parallelism. On one hand, the Ground State Estimation
algorithm is an application with very low T gate parallelism.
An algorithm attempting to find the ground state energy of
a molecule of size m, this application can be characterized
by a series of rotations on single qubits [23]. Ising model,
on the other hand, is a highly parallel application demanding
T gates at a much higher rate. This application simulates
the interaction of an Ising spin chain, and therefore requires
many parallelized operations on single qubits, along with
nearest neighbor qubit interactions [25]. To capture applica-
tion characteristics, we use the ScaffCC compiler toolchain
that supports application synthesis from high-level quantum
algorithm to to physical qubit layouts and circuits [40].
The majority of the time steps in Ising Model algorithm
has a large number of parallel T gates with a mean T load
of 440, where as Ground State Estimation has no more than
12 T gates at each time steps. As opposed to just using the
single T gate count to characterize algorithms, we will from
now on use the T load distribution.
5.2 T-Gate Contention and Congestion
In order to fully assess the space-time volume overhead
of the system, we require a low level description of how the
produced magic-states are being consumed by the program.
As discussed in the Section 2, a T gate requires braiding
between the magic-state qubit in the factory and the target
qubit that the T gate operates on. Now suppose our factory is
able to produce K high-fidelity magic states per distillation
cycle, and at some time step the program requests for t T
gates. If we demand more than the factory could offer at
once (i.e. t > K), then naturally only K of those requests can
be served, while the others would have to wait for at least
another distillation cycle. So we will say that the network has
contention when the demand exceeds the supply capacity. By
contrast, we define network congestion to capture the latency
introduced by the fact the some braids may fail to route from
the target to the factory on the 2D surface code layout, due to
high braiding traffic.
To estimate the overhead of network congestion, we will
perform an average case analysis without committing to a par-
ticular routing algorithm. Ideally, in the contention free limit
7
1 5 10 50 100 500 1000
5×1041×10
5
5×1051×10
6
5×106
Factory Capacity K (Magic States)
To
ta
lA
re
a
(Qub
its
) X = 1
X = 4
X = 16
X = 64
X = 256
(a) Area Scaling
1 5 10 50 100 500 1000
1×106
5×1061×107
5×1071×108
5×1081×109
Factory Capacity K (Magic States)To
ta
lL
at
en
cy
(Surf
ac
e
C
od
e
C
yc
le
s)
X = 1
X = 4
X = 16
X = 64
X = 256
(b) Latency Scaling
X = 1
X = 4
X = 16
X = 64
X = 256
1 5 10 50 100 500 1000
0.99975
0.99980
0.99985
0.99990
0.99995
1.00000
Factory Ouput Capacity K
Y
ie
ld
R
at
e
(c) Yield Rate Scaling
Figure 5: (a) Area required to implement a 2-level factory of varying numbers of factories X . As the distribution intensity
increases, the total area increases significantly faster as factory output is scaled up. Notice that some regions are not feasible due
to the constraint K/X ≤ 1. (b) Latency as it scales with factory output capacity. For factories of a fixed capacity, increasing
the number of factories on the lattice reduces latency overall and speeds up application execution time, thanks to reductions in
contention and congestion. The flat tails at high K values are due to the fact that the capacity has exceeded the amount that a
application ever demanded. (c) Yield as it scales with factory output capacity and number of factories. For a fixed capacity K0,
increasing the number of factories can significantly increase the success probability and yield rate of the factory.
L = 1
L = 2
L = 3
1 5 10 50 100 500 1000
5×1071×10
8
5×1081×10
9
Factory Capacity K (Magic States)Tota
lL
at
en
cy
(Surf
ac
e
C
od
e
C
yc
le
s)
(a) Ising Model
L = 1
L = 2
L = 3
1 2 5 10
2×107
5×107
1×108
2×108
Factory Capacity K (Magic States)Tota
lL
at
en
cy
(Surf
ac
e
C
od
e
C
yc
le
s)
(b) Ground State Estimation
Figure 6: (a)-(b) Total number of surface code cycles required by Ising Model and Ground State Estimation applications. Both
figures are plotted for three different factory block-code levels, i.e. X = 1 and L = 1,2, and 3.
where the number of requests t is less than K, all requests
could be scheduled and executed in parallel. However, often
times the requests will congest due to limitations of routing
algorithms. We define a congestion factor Cgthat represents
the total latency required to execute all of the T gate requests
at any given time.
We model congestion as a factor that scales proportional
to the number of t requests made at any given time, within
a particular region serviced by a factory. This assumes a
general topology in which a factory is placed in the center of
a region, and all of the surrounding data qubits are served by
this factory alone. Naturally, the center of the region is quite
dense with T gate request routes. In general, for a reasonable
routing algorithm, the number of routing options increases as
area available increases. However, because all of the routes
have their destination in the center of the region, increasing
area of the region has no such effect. In fact, the distance of a
T request source from the factory increases the likelihood of
congestion from a simple probabilistic argument. There may
be other T requests blocking available routes, and the number
of these possible requests that block pathways increases as
the distance between a request and the factory increases. The
combination of these effects interacts with the complexity
of a routing algorithm, and results in a scaling relationship
proportional to both the T request density t and the maximum
distance of any T request within any of these regions:
Cg ∼ c
√
t (14)
for some constant c, depending upon the routing algorithm.
We validated this congestion model in simulation using
simulation tools and compiler toolchains of [11], and find
that they do indeed agree. Section 7 discusses this in greater
detail.
5.3 Resolving T-gate Requests
For any given program, characterized as a distribution D
of the T load, we denote D[t] the number of timesteps in
the program that t parallel T gates are to be executed. Then
the number of iterations that the factory needs to resolve the
8
t requests can be computed based on the following latency
analysis. In particular, in order to maximize the utilization
of the factory, we would execute as many outstanding T
gate requests as possible in parallel. When the number of
requests t exceeds the factory yield K, we will need to stall
the surpassed amount of requests. We denote s = bt/Kc
the number of fully-utilized iterations. So, we are serving
at full capacity for s number of times, and at each time a
congestion factor is being multiplied, as discussed in Section
5.2. It follows that the first sK requests are completed in
s
√
K number of distillation cycles. And finally the rest (t−
sK) outstanding requests are then being executed in
√
t− sK
cycles. Notice that the time it takes to execute the T gate is
typically shorter than the factory distillation cycle time. So
under the buffer assumption made earlier, we can stage the
execution of requests within a distillation cycle such that no
data dependencies are violated, as long as there are magic-
states available in the factory. The time required to produce
some constant number k of states is Tdistill, while the time
required to deliver k states in parallel is Tt
√
k due to network
congestion. So the number of distillation cycles needed to
supply a single cycle of k T gate requests is given by the ratio
Tt
√
k/Tdistill. Substituting k = K/X and k = (t− sK)/X as
described earlier, we can calculate the number of distillation
iterations we need to serve t T gates in a particular timestep,
as:
ndistill =
Tt
Tdistill
·
(
s ·
√
K
X
+
√
t− sK
X
)
(15)
where K is again the yield of each iteration from Equation
10.
Putting it together, we obtain our final time overhead of an
application:
Ttotal = Tdistill ·
(Tpeak
∑
t=0
ndistill ·D[t]
)
(16)
where Tpeak is the maximum number of parallel T gates sched-
uled at one timestep. Notice that this is independent of Tdistill,
as the distillation cycle time has been captured by the ratio
Tt/Tdistill. The scaling of this function is shown in Figure 5b,
and is compared in Figure 6 across different applications.
6. AREA AND LATENCY TRADE-OFFS
In this section, we will discuss some of the motivations of
our proposed algorithm for optimizing space-time resource
overhead, based on the area and latency analysis that we built
up in the previous sections.
The Bravyi-Haah protocol shows an area expansion when
a single factory is “divided” into many smaller factories, that
is, the total area of x number of factories each with some
capacity k is larger than the area of a factory with capacity
x · k. Figure 5a illustrates this trend, arising from the original
area law equation 12.
Why do we want a distributed factory architecture?
Although it might first seem undesirable to divide a single
factory into many factories due to the area expansion, there
are many advantages when doing so. One such advantage is
that smaller factory can produce states with higher fidelity.
So, for a fixed output capacity K, incrementing the number
of factories used to produce in total that K allows for all of
those K states to have higher fidelity. The output error rate
scales inversely with the number of factories on the lattice
for a fixed output capacity K as seen in Equation 9.
This provides us with the unique ability to actually ma-
nipulate the underlying physical error rate threshold. In
particular, substitution of K/X for K in all of the previous
equations shows that the yield threshold now also has inverse
dependence upon the number of factories used.
As Figure 4b shows, for a fixed output capacity and block
code level `, increasing the number of factories on the lattice
can greatly increase the tolerable physical error rate under
which the factory architecture can operate.
With this knowledge, we are immediately presented with
architectural tradeoffs. Using the representation of programs
as distributions of T gate requests, any application can be
characterized by a Tpeak, again defined as the highest number
of parallel T gate requests in any timestep of an application.
For a “surplus” configuration, a system may set the factory
output rate K = Tpeak, so as to never incur any latency during
the program execution. However, as the threshold in equation
4 indicates, this sets an upper bound on the tolerable input
error rate εin. With a distributed factory architecture, this
provides a system parameter enabling systems to be designed
that will be able to tolerate higher error rates, and still achieve
the same output capacity K, at the expense of area as seen in
the area law relationship from Figure 5a. Conversely, systems
that are constructed with great knowledge of low underlying
physical error rates may be able to reduce overall area of
a surplus factory configuration by reducing the number of
individual factories to a certain point. These are the tradeoffs
in the design space that this work explores, and in fact we
can find for representative benchmarks, configurations that
are lower in capacity that can save orders of magnitude in
space-time overhead overall.
7. EVALUATION METHODOLOGY
7.1 System Configuration
Here we lay out all of the assumptions made about the
underlying systems that we are studying.
First, we assume that the factories will be operated continu-
ously. This means that each Tdistill, the factories will produce
another Koutput states. This abstracts away the time needed to
deliver these states to their destinations, which would have to
Configuration Description
Surplus
One central factory that can produce enough
states to always meet the demand at each time-
step of the program as in [39, 37, 33].
Singlet One central factory that uses minimalarea and produces only one state per cycle.
Optimized-Unified One central factory that outputs an optimizednumber of output states per distillation cycle
Optimized-Distributed A optimized set of factories that togetheroutput an optimized number of output states
Table 3: List of architecture configurations explored in this
work.
9
be performed in a real system before the next distillation itera-
tion begins. In such real systems, we imagine an architecture
that supports a limited, fixed size buffer region so that the
subsequent distillation cycle will not overwrite the previously
completed states. However, this is a small constant offset in
time that applies to all studied designs symmetrically, so it is
omitted. Because the factories are always online and produc-
ing magic states, the overall time overhead is then equal to
the number of distillation cycles required to execute all the
scheduled T gate requests, multiplied by the time taken to
perform a distillation iteration Tdistill from Equation 8.
Next, we assume three different levels of uniformity in
these designs: all distributed factories are laid out uniformly
on the surface code lattice as in figure 3b (i.e. they are an
equal distance apart), all factories in a distributed architecture
are identical (i.e. they all operate with the same parameters
such as K and `), and within each factory each block code
round is identical (i.e. they are composed of identical n→ k
protocols). Note that Campbell et. al. in [33] allows varying
k within a single factory, across different rounds.
In performing our evaluations we consider four different
system configurations: surplus architectures that minimize
application latency by setting the magic-state output capacity
to the peak T gate request count in an application, singlet
architectures that minimize required space for the factory by
producing only a single state per distillation cycle, optimized-
unified architectures that use one central factory with an opti-
mized choice of output capacity K and number of distillation
rounds `, and optimized-distributed architectures that choose
an optimum output capacity K distributed into an optimum
number X of factories, each utilizing ` distillation rounds.
These architectures are summarized in Table 3.
7.2 Optimization Algorithm
As keen readers may have already observed from Figure
6 and Figure 4d, for fixed output capacity K, it costs us both
in time latency and in factory footprint to implement a high
` block-code factory. The only reason we design for high
` is to achieve the desired target error rate. This relation is
best captured in the bottom half of Figure 7, where the L = 1
factory is not feasible for K ≥ 1 since its output error rate is
higher than the target error rate, while the L = 2 factory is
feasible for K ∈ [1,50], and the L = 3 factory is feasible for
the entire plotted range.
We combine all of the details of the explicit overhead esti-
mation derived above in order to find optimal design points
in the system configuration space. To do this, we must ensure
that designs are capable of producing the target logical error
rate for an application. Additionally, there exists a set of
constraints C that K,X ∈ Z+ have to satisfy: (i) 1≤ X ≤ K;
(ii) K/X ≤ (1−8εinject)/(3εinject), due the Bravyi-Haah pro-
tocol error thresholds. With the feasible space mapped out,
standard nonlinear optimization techniques are employed to
explore the space and select the space-time optimal design
point.
With these constraints in mind, we explore the space by
first selecting the lowest ` possible. As the area law and full
volume scaling trends of the previous sections indicate, if
there are any feasible design points with ` = `0, then any
feasible design points for systems with `i > `0 will be strictly
Figure 7: Space-time volume minimization under error thresh-
old constraints imposed by target error rate for each block
code level. An application will set a target error rate (black)
that the factory must be able to achieve in output state fi-
delity. On the lower plot, levels 2 and 3 are the only levels
available that can satisfy this. In the upper plot, we find
that the lowest volume in the feasible area is located on the
level 2 factory feasibility line. Recall the volume shapes are
explained earlier in section 5. Here the tails after K ≈ 800
show an increase in volume, as the added capacity grows the
factory areas while maintaining constant latency.
greater in overall volume. This is somewhat intuitive, as
concatenation of block code protocols is very costly.
With the lowest ` selected, we check to see if there exists
any feasible design points for this ` by checking for solutions
to the equation:
(1+3k1/` )2
`−1ε2
`
inject ≤
Ps
Ngates
(17)
If the K that solves this equation is greater than or equal to 1,
then there does exist feasible design space along this `, and
the algorithm continues. Otherwise, ` is incremented.
Next, nonlinear optimization techniques are used to search
within the mapped feasible space for optimal design points
in both K and X .
7.3 Simulation and Validation
This section explores the validity of our models through
empirical evaluation of the space-time resources. To do this,
we improve the surface code simulation tool from [11] to
accurately assess the latency and qubit cost of fully error-
corrected applications with various magic-state distillation
factory configurations. Specifically, we added support for
10
Algorithm 1 Space-time Optimization Procedure
Input: Ps, Ngates, εinject, distribution D and constraints C
Output: K, X
1: procedure OPTIMIZE
2: K← 1, X ← 1, `max = 5
3: εtarget← Ps/Ngates
4: for ` ∈ [1, `max] do
5: k`← (K/X)1/`
6: n`← 3k`+8
7: for r ∈ {1, · · ·`} do
8: if r == ` then εr← εtarget
9: else εr← (1+3k`)2r−1ε2rinject
10: end if
11: dr← Solve{dr · (100εin)(dr+1)/2 = εr, dr}
12: end for
13: R≡ K/X ← Solve{ε` = εtarget, R}
14: if R≥ 1 then
15: Koutput← K ·∏`r=1
[
(1−n` · εinject)εr
]
16: s← bt/Koutputc
17: Tt← 4d`+4
18: Tdistill← 11∑`r=1 dr
19: ndistill← TtTdistill ·
(
s ·
√
K
X +
√
t−sK
X
)
20: Ttotal← Tdistill
(
∑
Tpeak
t=0 ndistill
)
·D[t]
21: Afactories← X ·n`−1l · (6k`+14) ·d21
22: (K,X)← argmin(K,X):C Afactories ·Ttotal
23: else
24: `← `+1
25: end if
26: end for
27: return K,X
28: end procedure
arbitrary factory layouts, which manifests as black boxed
regions dedicated to factories that cannot be routed through
during computation, combined with sets of locations of pro-
duced magic states. The result is a cycle precise simulator that
accurately performs production and consumption of magic
states, including all necessary routing.
One implementation detail that is supported is the ability
to dynamically reallocate specific magic-state assignments
during runtime. Statically, each T gate operation is prespeci-
fied with a particular magic-state resource, located along the
outer edge of a factory. During runtime, this can introduce
unnecessary contention, as two nearby logical qubits can po-
tentially request the same magic state. This is avoided by
implementing online magic-state resource shuffling, so that if
the particular state that was requested is unavailable, the sys-
tem selects the next nearest state that is available. If no such
states exist, this T gate is stalled until the next distillation
cycle is completed.
Figure 8 shows simulation results superimposed on top of
those driven analytically. We can see that the model shows
the same trend as the simulation behavior (blue line), and thus
we will be able to show relative tradeoffs between capacity
and latency. For simplicity the validation is performed on
single unified factory located at the center of the surface code
●
● ●
●
●
●
● ●
■
■
■
■
■
■ ■ ■
◆
◆ ◆
◆
◆
◆
◆ ◆
● Simulation Data
■ Model Predictions
◆ Lower Bound Model Prediction
10 50 100 500 1000
5×104
1×105
5×105
1×106
5×106
1×107
Capacity (Magic States)
La
te
nc
y
(Surf
ac
e
C
od
e
C
yc
le
s)
Simulation Data Model Verification
Capacity (M gic States)
La
te
nc
y 
(S
ur
fa
ce
 C
od
e 
C
yc
le
s)
Capacity (Magic States)
 10                                      50             100                                   500          1000
Upp r Bound Model Prediction
●
● ●
●
●
●
● ●
■
■
■
■
■
■ ■ ■
◆
◆ ◆
◆
◆
◆
◆ ◆
● Simulation Data
■ Model Predictions
◆ Lower Bound del Prediction
10 50 100 500 1000
5×104
1×105
5×105
1×106
5×106
1×107
Capacity (Magic States)
La
te
nc
y
(Surf
ac
e
C
od
e
C
yc
le
s)
Simulation Data Model Verification
Capacity (M gic States)
La
te
nc
y 
(S
ur
fa
ce
 C
od
e 
C
yc
le
s)
Capacity (Magic States)
 10                                      50             100                                   500          1000
Upp r Bound Model Prediction
La
te
nc
y 
(S
ur
fa
ce
 C
od
e 
Cy
cle
s) 107
106
105
Capacity (Magic States)
10 50 100 500 1000
Figure 8: Model validation by simulation. The simulation
data (blue line) lies between the upper bound model pre-
diction that overestimates congestion(orange line), and the
congestion-free lower bound (green line).
lattice. The results extend well to multiple factories, because
in the distributed case, each factory will be responsible for
magic-state requests in a sub-region of the lattice.
We can validate this by simulating optimal operating points
in the space-time trade-off spectrum and comparing them to
our expectation from the model. Using simulation data, we
re-plot our idealized tradeoff in Figure 1 for the Ising Model
Application and show the results in Figure 9. We see that as
factory capacities increase, the applications time improves at
the expense of its qubit numbers. In this figure, the space-time
volume is sketched in green, and has two near-optimal points:
one with relatively few qubits but high latency, and vice versa.
The worst performance occurs in the middle of this spectrum,
when transition from level 1 to level 2 distillation needs to
occur (causing a sudden jump in qubits, but not much latency
improvement).
8. RESULTS
In this section we present the resource requirements of
various magic-state factory architectures, and show that by
considering the scaling behaviors that we have highlighted
and searching the design space with our optimization algo-
rithm, we can discover system configurations that save orders
of magnitude of quantum volume.
We first compare the overheads of the surplus and singlet
architectures that represent baselines against which we com-
pare our optimized architectures. We then compare the sur-
plus architecture with the optimized-distributed design found
with our optimization algorithm. We look at two represen-
tative benchmarks for the quantum chemistry and quantum
simulation fields, the Ising Model [25] and Ground State
Estimation [23] algorithms, as well as how performance of
these architectures changes as the benchmarks scale up in
size. Next, we detail the space and time trade-off that is made
in our resource optimized design choices, and show that the
latency induced by a design is a more dominant factor in these
applications. We then present a full design space comparison,
showing the performance of the surplus design against the
singlet design, as well as the optimized-unified factory design,
11
��������
●
●
●
●
●
●
● ●
●
● Latency
5 10 50 100 500 1000
5 ×105
1 ×106
5 ×106
1 ×107
Capacity (K)
N
um
be
ro
fc
yc
le
s
■ ■
■
■
■
■
■
■
■
■ Area
1000
5000
1 ×104
5 ×104
1 ×105
N
um
be
ro
fq
ub
its
▲
▲
▲ ▲
▲ ▲ ▲
▲ ▲
▲ Volume
��������
●
●
●
●
●
●
● ●
●
● Latency
5 10 50 100 500 1000
5 ×105
1 ×106
5 ×106
1 ×107
Capacity (K)
N
um
be
ro
fc
yc
le
s
■ ■
■
■
■
■
■
■
■
■ Area
1000
5000
1 ×104
5 ×104
1 ×105
N
um
be
ro
fq
ub
its
▲
▲
▲ ▲
▲ ▲ ▲
▲ ▲
▲ Volume
1x 07
5x106
1x106
5x105N
um
be
r o
f c
yc
le
s
Num
ber of qubits
1x105
5x104
1x104
5 00
1 00
5 10 50 100 500 1000
Capacity K
Figure 9: Space-time tradeoff observed empirically in sim-
ulation for varying factory capacities. A space-time volume
(green line) can be chosen at K ≈ 40, which is an optimal,
minimized value on this curve. It corresponds here to a
low-qubit, high latency configuration. Notice that another
configuration at K ≈ 300) could be chosen, corresponding
to a high-qubit, low-latency configuration. In this case, the
former of these choices is more resource optimized, as the
space-time cost is lower.
Surplus
Singlet
5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.
5 ×1012
1 ×1013
5 ×1013
1 ×1014
5 ×1014
1 ×1015
-Log(Physical Error Rate)
N
um
br
e
of
Su
rfa
ce
C
od
e
C
yc
le
s
5.0 4.5 4.0 3.5									 3.0
-Log(Physical	Error	Rate)
Nu
m
be
r	o
f	S
ur
fa
ce
	C
od
e	
Cy
cle
s
015
1014
1013
Figure 10: (Color online) Comparing surplus and singlet
designs. There are regions where each outperforms the other,
showing great sensitivity to the underlying physical error rate
and the corresponding required `. Recall that the step-like
shape is due to level transitions explained in section 7.2.
all compared to the performance of optimized-distributed de-
sign. Lastly, we analyze the sensitivity of these designs to
fluctuations in the underlying physical error rates, and show
that building out a distributed factory design adds robustness
that makes the architecture perform well for a wider range of
input parameters.
8.1 Comparing Surplus and Singlet Architectures
We begin with Figure 10 by comparing two architectures
that aim solely to minimize application latency or required
space. This comparison represents the range between two
ends of the design space spectrum for single factory archi-
tectures, and each shows a particular error rate range over
which it performs more optimally. Initially, at the highest in-
put error rate, the space optimal singlet design requires more
resources than the time optimal surplus design, as the applica-
tion suffers from excessive latency from magic-state factory
access time. Note the inflection points at 10−3.5 and 10−4.5
input error rates. At these points, the singlet factory is able to
reduce the number of rounds of distillation it must perform,
as input error rates are sufficiently low. Over this region, the
reduction in area compensates the expansion in computation
time, and the design outperforms the much larger surplus
factory configuration. At 10−4.5, the surplus factory is able to
operate with fewer distillation rounds as well, enabling this
configuration to outperform the singlet design.
This behavior is surprising, as it indicates that with re-
spect to a high-parallelism application, there are input error
rate regions where intuitively conservative, space minimizing
designs are able to outperform what seem like aggressively
optimized designs. We see this because we are comparing
space and time simultaneously, which allows us to see that
the trade-off is asymmetric and these factors interact non-
trivially.
8.2 Optimized Design Performance
We now move to comparing the surplus design against the
optimized-distributed design discovered by our optimization
algorithm, that is allowed to subdivide factories across the
machine. Figures 11a and 11b depict the detailed results of
our optimization procedure on the Ising Model and Ground
State Estimation applications, respectively. Ising Model is
intrinsically very parallel, which leads to a higher optimal
capacity choice for the optimized-distributed factory. Note
however that it is able to choose a distribution level that saves
approximately 15x in space-time volume. Ground State Esti-
mation is very serial, yet for sufficiently low error rates the
optimized-distributed design is able to incorporate distribu-
tion of factories into the lattice to lower the required block
code concatenation level `, resulting in a 12x reduction in
volume across these points.
The reason that the distributed factory design is able to
outperform the surplus design is that the feasibility regions
of the two designs differ. Because the distributed factory
utilizes many small factories on the machine it can achieve a
higher output state fidelity than a single factory design, which
enables it to operate with a smaller number of distillation
rounds. The optimization algorithm respects this character-
istic, which is why it searches iteratively from the lowest
number of distillation rounds possible, one by one until it
discovers a feasible factory configuration.
8.2.1 Optimized Design Performance Scaling
Figures 11c and 11d detail these trends as larger and larger
quantum simulation applications are executed. For extremely
large simulations, we find that the volume reductions that
optimizing a factory design yields become even more pro-
nounced, resulting in between a 15x and 18x full resource
reduction. These designs also show sensitivity to physical er-
ror rates that require designs to change block code distillation
level.
8.3 Distributed Factory Characteristics
As Figure 12a describes, an optimized-distributed set of
factories is able to save between 1.2x and 4x in total space-
time volume over the optimized-unified factory. Large vol-
ume jumps occur primarily between 10−3.5 and 10−3.4 phys-
12
Spacetime Optimized
Time Optimal
5.` 4.9`4.8`4.7`4.6`4.5`4.4`4.3`4.2`4.1` 4.` 3.9`3.8`3.7`3.6`3.5`3.4`3.3`3.2`3.1` 3.`
1×1012
5×1012
1×1013
5×1013
1×1014
5×1014
-Log(Physical Error Rate)
S
p
a
ce
tim
e
V
o
lu
m
e
O timized-Distributed
Surplus
5.   4.9  4.8   4.7  4.6  4.5  4.4  4.3  4.2  4.1    4.  3.9   3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1   3.
-Log(Physic l rror Rate)
S
pa
ce
tim
e 
Vo
lu
m
e
5.0																			 4.5																			4.0																			3.5																			3.0
-Log(Physical	Error	Rate)
Sp
ac
et
im
e
Vo
lu
m
e
1012
1013
1014
(a) Ising Model N=500
Spacetime Optimized
Time Optimal
5.` 4.9` 4.8` 4.7` 4.6` 4.5` 4.4` 4.3` 4.2` 4.1` 4.` 3.9` 3.8` 3.7` 3.6` 3.5` 3.4` 3.3` 3.2` 3.1` 3.`
5×1011
1×1012
5×1012
1×1013
5×1013
-Log(Physical Error Rate)
S
p
a
ce
tim
e
V
o
lu
m
e
O timized-Distributed
Surplus
  5.    4.9  4.8     4.    4.5   4.4  4.    4.2   4.1    4.   3.9     3.    3.6   3.5  3.    3.3   3.2   3.1   3.
-Log(Phys l Er or Rate)
S
pa
ce
tim
e 
Vo
lu
m
e
5.0																				4.5																			4.0																				3.5																			3.0
-Log(Physical	Error	Rate)
Sp
ac
et
im
e
Vo
lu
m
e
1011
1012
1013
Sp
ac
et
im
e
Vo
lu
m
e
(b) Ground State Estimation
Time Optimal
Fully Optimized
5.` 4.9` 4.8` 4.7` 4.6` 4.5` 4.4` 4.3` 4.2` 4.1` 4.` 3.9` 3.8` 3.7` 3.6` 3.5` 3.4` 3.3` 3.2` 3.1` 3.`
1×1010
5×1010
1×1011
5×1011
1×1012
-Log(Physical Error Rate)
S
p
a
ce
tim
e
V
o
lu
m
e
Optimized-Distributed
S rplus
.    .9  .    4.7  .    4.5  .4   4.3  .2  .     4.   .    3.8  .7   3.6  .5  .      .      
-Log(Physic l Error Rate)
S
pa
ce
tim
e 
Vo
lu
m
e
-Log(Physical	Error	Rate)
5.0																		4.5																			4.0																			3.5																			3.0
Sp
ac
et
im
e
Vo
lu
m
e
10 2
10 1
10 0
(c) Ising Model N=1000
Time Optimal
Fully Optimized
5.` 4.9` 4.8` 4.7` 4.6` 4.5` 4.4` 4.3` 4.2` 4.1` 4.` 3.9` 3.8` 3.7` 3.6` 3.5` 3.4` 3.3` 3.2` 3.1` 3.`
1×1010
5×1010
1×1011
5×1011
1×1012
-Log(Physical Error Rate)
S
p
a
ce
tim
e
V
o
lu
m
e
Optimized-Distributed
S rplus
.    .9  .    4.7  .    4.5  .    4.3  .2      4.      3.8  .    3.6  .     .         3.
-Log(Physic l Error Rate)
S
pa
ce
tim
e 
Vo
lu
m
e
5.0																		 4.5																			4.0																			3.5																			3.0
-Log(Physical	Error	Rate)
Sp
ac
et
im
e
Vo
lu
m
e
1010
1011
1012
(d) Ising Model N=2000
Figure 11: (Color online) (a)-(b) Resource reductions of optimized-distributed designs over surplus designs for both Ising Model
and Ground State Estimation. While Ising Model is intrinsically more parallel which leads to high choices of output capacity,
both applications still show between a 12x and 16x reduction in overall space-time volume. (c)-(d) Ising Model with varying
problem sizes, comparing time optimal factories against fully space-time optimized configurations. We see that the trend of
between 15x and 20x total volume reduction extends to larger molecular simulations.
Optimized-Unified
Optimized-Distributed
5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.
104
105
106
107
-Log(Physical Error Rate)
N
um
be
ro
fQ
ub
its
5.0																		4.5																	4.0																			3.5																	3.0
-Log(Physical	Error	Rate)
Nu
m
be
r	o
f	Q
ub
its
104
105
106
107
(a) Space tradeoff
Optimized-Unified
Optimized-Distributed
5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.
106
107
108
109
-Log(Physical Error Rate)
N
um
br
e
of
S
ur
fa
ce
C
od
e
C
yc
le
s
5.0																		 4.5																		4.0																		3.5																		3.0
-Log(Physical	Error	Rate)
Nu
m
be
r	o
f	S
ur
fa
ce
	C
od
e	
Cy
cle
s
106
107
108
109
(b) Time tradeoff
                      
         



    
 Optimized
 Unified
1.×10-5 5.×10-5 1.×10-4 5.×10-4 10-3
1
5
10
50
100
500
Error Rate
C
a
p
a
ci
ty
K
-Log(Physical Error Rate)
5.0 4.0 3.0
Ca
pa
cit
y 
K
50
10
50
10
5
1
(c) Output capacities procedurally selected
Figure 12: (Color online) Space-time volume reduces by moving from an optimized-unified factory to an optimized-distributed
factory, as the designs trade space for time. Magic-state access latency is a dominating effect in these applications, as can be
seen by the large capacity values chosen by the optimized factory configuration.
ical error rate, and this again corresponds to a requirement
by this application to increment to a higher block code level
`, which happens for both the unified and distributed factory
schemes.
These optimized designs trade space for time, as Figures
12a and 12b indicate, and the net effect is an overall vol-
ume reduction. This is indicative that for these highly paral-
lel quantum chemistry applications, the magic-state factory
access latency is a much more dominating effect than the
number of physical qubits required to run these factories.
Figure 12c depicts the output capacities chosen by the op-
timization procedure, and how they differ when the system is
unified or distributed. Notably, at both ends of the input error
rate spectrum we find that both factory architectures choose
the same output capacity, as in the high error rate case this
is driven by high ` requirement, while in the low error rate
limit both factory architectures can afford to be very large
and not suffer from any yield penalties. However, through
13
Optimized-Distributed
Singlet
Surplus
10-6 10-5 10-4 0.001 0.010
1010
1011
1012
1013
1014
1015
1016
Input Error Rate
F
a
c
to
ry
S
p
a
c
e
ti
m
e
V
o
lu
m
e
Figure 13: Factory architectures and their sensitivities to
fluctuations in underlying physical error rates
the center of the error rate spectrum the unified factory de-
sign must lower the chosen output capacity, as supporting
higher capacity would require a very expensive increase in
the number of distillation rounds.
8.4 Full Design Space Comparison
Figure 14 depicts the full space-time volume required by
different factory architectures across the design space. Shown
are the four main configurations: a surplus factory configured
with output capacity K = Tpeak, a singlet factory with K = 1,
an optimized-unified factory, and an optimized-distributed
factory.
Distinct volume phases are evident visually on the graph,
due to the different feasibility regions of the architectures.
Sweeping from high error rates to low error rates, large vol-
ume jumps occur as observed before, for specific configura-
tions when that configuration can operate with fewer rounds
of distillation in order to convert the input error rate to the
target output error rate. Notice that this jump occurs earliest
for the singlet, optimized-unified, and optimized-distributed
designs, at 10−3.5 input error rate. All of these designs show
an inflection point here, where the configurations can achieve
the target output error rate with a smaller number of block
code distillation levels. This is not true of the surplus factory,
which in fact has the largest output capacity of the set. Be-
cause the output capacity is so high, the lowest achievable
output error rate is much higher than that of the other designs.
This forces the block code level to remain high until the input
error rate becomes sufficiently low, which occurs at 10−4.5.
8.5 Sensitivity Analysis
Now we turn to analyzing how these designs perform if
the environment in which they were designed changes. Sup-
posing that a design choice has been made specifying the
desired factory capacity K, number of factories X , and block
code distillation level `, different types of architectures show
varying sensitivity to fluctuations in the underlying design
points around which the architectures were constructed. For
example, Figure 13 details an instance of this occurrence. The
figure shows the surplus, singlet, and optimized-distributed
factory designs, in this case setting K ∼ 600 and X ∼ 200
for the distributed architecture. All of these factories were
designed under the assumption that the physical machine will
operate with 10−5 error rate.
We see that while these applications perform similarly
over the range from 10−5 to 10−4, just after this point the
surplus factory encounters a steep volume expansion due to
the yield threshold equation 11. For this design the threshold
of tolerable physical error rates is quite high, significantly
higher than that of the other designs. Because of this, it can
tolerate a smaller range of fluctuation in the underlying error
rate before it ceases to execute algorithms correctly.
9. CONCLUSION
We present methods for designing magic-state distillation
factory architectures that are optimized to execute applica-
tions that present with a specific parallelism distribution. By
considering applications with different levels of parallelism,
we design architectures to take advantage of these charac-
teristics and execute the application with minimal space and
execution time overhead.
By carefully analyzing the interaction between various
magic-state factory characteristics, we find that choosing the
most resource optimized magic-state distribution architecture
is a complex procedure. We derive and present these trade
offs, and compare the architectures that have been commonly
described in literature. These comparisons show a surprising
picture: namely that even a modest factory capable of pro-
ducing just a single resource state per distillation cycle can
outperform the more commonly described surplus factory in
particular input error rate regimes. We also propose a method
of distributing the total number of magic states to be produced
into several smaller factories uniformly distributed on a ma-
chine. In doing this, we see that these types of architectures
are capable of achieving higher output fidelities of their pro-
duced states with added resilience against fluctuations of the
underlying error rate, when compared to unified architectures
composed of a single factory. While these designs are tai-
lored to specific applications, we conjecture that distributed
systems would in fact be more flexible in their abilities to
execute applications with different amounts of parallelism.
Intrinsic to their design is the ability to optionally compile
smaller applications to various subunits of the machine. Be-
cause of this, these designs can be used to support a much
wider range of application types than those comprised of a
single factory.
These systems also show that the trade off in space and
time is asymmetric. In quantum chemistry and simulation
applications, we notice that the resource optimized designs
can use upwards of 2 orders of magnitude more physical
qubits to be implemented, while they end up saving over
3 orders of magnitude in time. Magic-state access time,
or latency induced specifically by delays due to stalling as
magic states are produced, we find is a dominating effect in
the execution of these applications. In order to mitigate these
effects in a resource-aware fashion, designing a distributed
system of several factories allows for efficient partitioning
of the magic-state demand across the machine, at the cost of
physical area.
These conclusions can have physical impacts on near-term
designs as well. Specifically, the construction of a factory
14
Surplus
Singlet
Optimized-Unified
Optimized-Distributed
5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.
1012
1013
1014
1015
-Log(Physical Error Rate)
S
pa
ce
tim
e
V
ol
um
e
Figure 14: (Color online) Full volume comparison across distillation factory architectures.
architecture can imply the location of physical control sig-
nals on an underlying device. What we are showing then
is the effect of several theoretical long-term designs, and
the conclusion that distributed sets of factories outperform
other designs should help motivate device fabrication teams
as they decide which physical locations should be occupied
by rotation generating control signals. As a general principle,
long term architectural design and analysis can help guide
the study and development of near term devices, which ul-
timately will help hasten the onset of the fault-tolerant era
[1].
10. FUTURE WORK
There are a number of immediate extensions to this study:
• Comparing distributed factory topologies. Choosing
an optimal layout for a distributed factory design is
potentially very difficult, and requires an ability to es-
timate the overheads associated with different layouts.
Using architectural simulation tools and adapted net-
work simulation mechanisms, we can foresee evaluation
of two new architectures: peripheral and asymmetric-
mesh placement. Peripheral placement refers to facto-
ries surrounding a central computational region, while
asymmetric-mesh placement refers to embedding the
factories throughout the machine itself.
• Embedding data qubits within magic-state factories.
While the designs presented here assume that magic-
state factory regions are to be considered black boxes
that are not to be occupied by data qubits, because of
their massive size requirements we imagine a system
that embeds the relatively smaller number of data qubits
within the factories themselves. A study of the effect of
various embedding techniques on factory cycle latency
could determine the efficiency of such a design.
• Advanced factory pipeline hierarchy. We envision a
concatenation of clusters of the magic-state factories,
targeting continuous outputs in time, and hence reduc-
tion in contention caused by the distillation latency. In
particular, each sub-region in the mesh contains multi-
ple small, identical factories that were turned on asyn-
chronously. So at each time step, there will always be
a factory that completes a distillation cycle, and thus
serving magic state continuously.
• Generalization to other distillation protocols. Although
the Bravyi-Haah protocol studied in this paper is among
the best known protocols, little analysis has been done
on other techniques discovered recently [5].
• Optimizing the internal mapping and scheduling of
magic-state factories. This work has modeled facto-
ries as black-boxed regions that continuously produce
resources. A realistic implementation of those factories
that optimize for internal congestion would significantly
reduce factory overhead, in conjunction with designs
proposed in this work that optimize for external conges-
tion. This was studied in [8].
• Flexibility of Distributed Magic-State Architectures.
While these designs are tailored to applications of a cer-
tain parallelism distribution, a study could analyze de-
signs that balance domain specific optimization against
general application compatibility.
Acknowledgements
This work was funded in part by NSF Expeditions in Comput-
ing grant 1730449, Los Alamos National Laboratory and the
U.S. Department of Defense under subcontract 431682, by
NSF PHY grant 1660686, and by a research gift from Intel
Corporation.
11. REFERENCES
[1] J. Preskill, “Quantum computing in the nisq era and beyond,” arXiv
preprint arXiv:1801.00862, 2018.
[2] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, “Topological
quantum memory,” Journal of Mathematical Physics, vol. 43, no. 9,
pp. 4452–4505, 2002.
[3] A. G. Fowler, “Surface codes: Towards practical large-scale quantum
computation,” Physical Review A, vol. 86, no. 3, 2012.
[4] S. Bravyi and J. Haah, “Magic-state distillation with low overhead,”
Phys. Rev. A, vol. 86, p. 052329, Nov 2012.
[5] J. Haah, M. B. Hastings, D. Poulin, and D. Wecker, “Magic state
distillation with low space overhead and optimal asymptotic input
count,” arXiv preprint arXiv:1703.07847, 2017.
[6] C. Jones, “Multilevel distillation of magic states for quantum
computing,” Physical Review A, vol. 87, no. 4, p. 042305, 2013.
15
[7] A. G. Fowler, S. J. Devitt, and C. Jones, “Surface code implementation
of block code state distillation,” Scientific Reports, vol. 3, p. 1939, jun
2013.
[8] Y. Ding, A. Holmes, A. Javadi-Abhari, D. Franklin, M. Martonosi, and
F. T. Chong, “Magic-state functional units: Mapping and scheduling
multi-level distillation circuits for fault-tolerant quantum architectures,”
arXiv preprint arXiv:1809.01302, 2018.
[9] L. S. Bishop, S. Bravyi, A. Cross, J. M. Gambetta, and J. Smolin,
“Quantum volume,” tech. rep., Technical report, 2017. URL: https://dal.
objectstorage. open. softlayer.
com/v1/AUTH_039c3bf6e6e54d76b8e66152e2f87877/community-
documents/quatnum-volumehp08co1vbo0cc8fr. pdf,
2017.
[10] A. Paler, I. Polian, K. Nemoto, and S. J. Devitt, “Fault-tolerant,
high-level quantum circuits: form, compilation and description,”
Quantum Science and Technology, vol. 2, no. 2, p. 025003, 2017.
[11] A. Javadi-Abhari, P. Gokhale, A. Holmes, D. Franklin, K. R. Brown,
M. Martonosi, and F. T. Chong, “Optimized surface code
communication in superconducting quantum computers,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture, pp. 692–705, ACM, 2017.
[12] F. Bloch, “Nuclear induction,” Physical review, vol. 70, no. 7-8, p. 460,
1946.
[13] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum
information. Cambridge university press, 2010.
[14] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Margolus,
P. Shor, T. Sleator, J. A. Smolin, and H. Weinfurter, “Elementary gates
for quantum computation,” Physical Review A, vol. 52, no. 5, p. 3457,
1995.
[15] C. Horsman, A. G. Fowler, S. Devitt, and R. Van Meter, “Surface code
quantum computing by lattice surgery,” New Journal of Physics,
vol. 14, no. 12, p. 123011, 2012.
[16] D. Litinski and F. von Oppen, “Lattice surgery with a twist:
simplifying clifford gates of surface codes,” Quantum, vol. 2, p. 62,
2018.
[17] S. Bravyi and A. Kitaev, “Universal quantum computation with ideal
clifford gates and noisy ancillas,” Physical Review A, vol. 71, no. 2,
2005.
[18] A. Steane, “Space, time, parallelism and noise requirements for
reliable quantum computing,” arXiv preprint quant-ph/9708021, 1997.
[19] N. C. Jones, R. Van Meter, A. G. Fowler, P. L. McMahon, J. Kim, T. D.
Ladd, and Y. Yamamoto, “Layered architecture for quantum
computing,” Physical Review X, vol. 2, no. 3, p. 031007, 2012.
[20] A. Montanaro, “Quantum algorithms: an overview,” npj Quantum
Information, vol. 2, p. npjqi201523, 2016.
[21] R. Babbush, N. Wiebe, J. McClean, J. McClain, H. Neven, and G. K.
Chan, “Low depth quantum simulation of electronic structure,” arXiv
preprint arXiv:1706.00023, 2017.
[22] I. D. Kivlichan, J. McClean, N. Wiebe, C. Gidney, A. Aspuru-Guzik,
G. K.-L. Chan, and R. Babbush, “Quantum simulation of electronic
structure with linear depth and connectivity,” Physical review letters,
vol. 120, no. 11, p. 110501, 2018.
[23] J. D. Whitfield, J. Biamonte, and A. Aspuru-Guzik, “Simulation of
electronic structure hamiltonians using quantum computers,”
Molecular Physics, vol. 109, no. 5, pp. 735–750, 2011.
[24] N. C. Jones, J. D. Whitfield, P. L. McMahon, M.-H. Yung,
R. Van Meter, A. Aspuru-Guzik, and Y. Yamamoto, “Faster quantum
chemistry simulation on fault-tolerant quantum computers,” New
Journal of Physics, vol. 14, no. 11, p. 115023, 2012.
[25] R. Barends, A. Shabani, L. Lamata, J. Kelly, A. Mezzacapo,
U. Las Heras, R. Babbush, A. G. Fowler, B. Campbell, Y. Chen, et al.,
“Digitized adiabatic quantum computing with a superconducting
circuit,” Nature, vol. 534, no. 7606, pp. 222–226, 2016.
[26] H. F. Trotter, “On the product of semi-groups of operators,”
Proceedings of the American Mathematical Society, vol. 10, no. 4,
pp. 545–551, 1959.
[27] C. Batista and G. Ortiz, “Generalized jordan-wigner transformations,”
Physical review letters, vol. 86, no. 6, p. 1082, 2001.
[28] N. J. Ross and P. Selinger, “Optimal ancilla-free clifford+ t
approximation of z-rotations,” arXiv preprint arXiv:1403.2975, 2014.
[29] D. Wecker, B. Bauer, B. K. Clark, M. B. Hastings, and M. Troyer,
“Gate-count estimates for performing quantum chemistry on small
quantum computers,” Physical Review A, vol. 90, no. 2, p. 022305,
2014.
[30] V. Kliuchnikov, D. Maslov, and M. Mosca, “Fast and efficient exact
synthesis of single qubit unitaries generated by clifford and t gates,”
arXiv preprint arXiv:1206.5236, 2012.
[31] P. Selinger, “Quantum circuits of t-depth one,” Physical Review A,
vol. 87, no. 4, 2013.
[32] M. Amy, D. Maslov, and M. Mosca, “Polynomial-time t-depth
optimization of clifford+ t circuits via matroid partitioning,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 33, no. 10, pp. 1476–1489, 2014.
[33] J. O’Gorman and E. T. Campbell, “Quantum computation with
realistic magic-state factories,” Phys. Rev. A, vol. 95, p. 032338, Mar
2017.
[34] H. Anwar, E. T. Campbell, and D. E. Browne, “Qutrit magic state
distillation,” New Journal of Physics, vol. 14, no. 6, p. 063006, 2012.
[35] A. M. Meier, B. Eastin, and E. Knill, “Magic-state distillation with the
four-qubit code,” arXiv preprint arXiv:1204.4221, 2012.
[36] E. T. Campbell, H. Anwar, and D. E. Browne, “Magic-state distillation
in all prime dimensions using quantum reed-muller codes,” Physical
Review X, vol. 2, no. 4, p. 041021, 2012.
[37] R. Van Meter and C. Horsman, “A blueprint for building a quantum
computer,” Communications of the ACM, vol. 56, no. 10, pp. 84–93,
2013.
[38] A. Paetznick and B. W. Reichardt, “Fault-tolerant ancilla preparation
and noise threshold lower bounds for the 23-qubit golay code,” arXiv
preprint arXiv:1106.2190, 2011.
[39] N. Isailovic, M. Whitney, Y. Patel, and J. Kubiatowicz, “Running a
quantum circuit at the speed of data,” in ACM SIGARCH Computer
Architecture News, vol. 36, pp. 177–188, IEEE Computer Society,
2008.
[40] A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T. Chong,
and M. Martonosi, “Scaffcc: a framework for compilation and analysis
of quantum computing programs,” in Proceedings of the 11th ACM
Conference on Computing Frontiers, p. 1, ACM, 2014.
16
