Parallelising the Queries in Bucket Brigade Quantum RAM by Paler, Alexandru et al.
Parallelising the Queries in Bucket Brigade Quantum RAM
Alexandru Paler+,∗, Oumarou Oumaroupi, Robert Basmadjianpi
+University of Transilvania, B-dul Eroilor 29, 500036, Bras,ov, Romaˆnia
∗Johannes Kepler University, Altenberger Str. 69, 4040, Linz, Austria and
piDepartment of Informatics, Clausthal University of Technology, 38678 Clausthal-Zellerfeld, Germany ∗
Quantum algorithms often use quantum RAMs (QRAM) for accessing information stored in a
database-like manner. QRAMs have to be fast, resource efficient and fault-tolerant. The latter
is often influenced by access speeds, because shorter times introduce less exposure of the stored
information to noise. The total execution time of an algorithm depends on the QRAM access
time which includes: 1) address translation time, and 2) effective query time. The bucket brigade
QRAMs were proposed to achieve faster addressing at the cost of exponentially many ancillae. We
illustrate a systematic method to significantly reduce the effective query time by using Clifford+T
gate parallelism. The method does not introduce any ancillae qubits. Our parallelisation method is
compatible with the surface code quantum error correction. We show that parallelisation is a result of
advantageous Toffoli gate decomposition in terms of Clifford+T gates, and after addresses have been
translated, we achieve theoretical O(1) parallelism for the effective queries. We conclude that, in
theory: 1) fault-tolerant bucket brigade quantum RAM queries can be performed approximately with
the speed of classical RAM; 2) the exponentially many ancillae from the bucket brigade addressing
scheme are a trade-off cost for achieving exponential query speedup compared to quantum read-only
memories whose queries are sequential by design. The methods to compile, parallelise and analyse
the presented QRAM circuits were implemented in software which is available online.
I. Introduction
Quantum random access memory (QRAM) was proposed
for querying (e.g. reading and writing) information dur-
ing quantum information processing. However, their
practical utility is still under study due to the exponen-
tial costs associated to explicitly storing and retrieving
information [1]. Nevertheless, QRAM and their read-
only-variant QROM (quantum read-only memories) [2]
are still capturing the attention of researchers due to
their conceptual similarity to classical RAMs, and due to
the fact that they can be used in subroutines of relevant
quantum algorithms, such as the one for linear systems
of equations [3].
Whenever QRAM queries are frequent, the speed of the
QRAMs influences the total execution time of the quan-
tum algorithms. Moreover, the total execution time of
an algorithm impacts the amount of resources necessary
for achieving fault-tolerance [4]. Fast QRAMs may prove
useful in implementing large scale fault-tolerant quantum
computations. Implementations of QRAM circuits were
presented in [5] and available online1.
The access time of a QRAM includes: 1) the time
necessary to determine, based on an input set of ad-
dresses, the memory locations to query, and 2) the time
required to perform the effective queries (e.g. retrieve or
store information). The bucket brigade QRAM variant
[6] was proposed to reduce the address translation time
necessary for computing the memory pointers where the
queries should be executed.
∗ alexandrupaler@gmail.com
1 https://github.com/quantumresource/quantify
FANOUT QUERY FANIN
/---------------------\ /--------------\ /--------------------\
a1: ----------------@---@---------------------------------@---@-----------
| | | |
a0: --------@-------|---|---------------------------------|---|-------@---
| | | | | |
b_00: ------X---@---@---|----X-------------@--------X-----|---@---@---X---
| | | | | | | | |
b_01: ----------X---|---@----|X----@-------|--------|X----@---|---X-------
| | || | | || | |
b_10: --------------X---|----@|----|-------|---@----@|----|---X-----------
| | | | | | |
b_11: ------------------X-----@----|---@---|---|-----@----X---------------
| | | |
m11: ------------------------------|---|---|---@--------------------------
| | | |
m10: ------------------------------|---|---@---|--------------------------
| | | |
m01: ------------------------------|---@---|---|--------------------------
| | | |
m00: ------------------------------@---|---|---|--------------------------
| | | |
target: ---------------------------X---X---X---X--------------------------
\----------------------/ \-------------/ \----------------------/
FANOUT QUERY FANIN
FIG. 1. A bucket brigade circuit consisting of CNOTs and
Toffoli gates. The controlled gates from the diagram use @
for controls and the X for target wires. The address wires are
named ai, the memory cells holding single bits are mi. The
classical bits stored on the wires bi determine which memory
wires to read/write.
QRAMs can be described in the form of quantum cir-
cuits. Information is stored to and retrieved from the
wires/qubits (e.g. analogue to classical memory cell),
while quantum gates are used to implement address
translation and the queries (e.g. effective read and write
operations). In this work, we focus on retrieving informa-
tion from the bucket brigade QRAM because, due to their
structure, reading is more expensive than writing. Once
the addresses have been translated, the bucket brigade
requires Toffoli (double-controlled-X, or CCX) gates for
reading, but only CNOT (controlled-NOT, or controlled-
X) gates for writing. We assume the information is al-
ready stored in the QRAM database, and that the quan-
tum algorithm queries frequently the QRAM.
For a QRAM with 2q distinct memory wires (and ex-
ar
X
iv
:2
00
2.
09
34
0v
2 
 [q
ua
nt-
ph
]  
29
 Ju
l 2
02
0
2ponential number of ancillae wires, each representing a
memory cell that stores a single bit, such as in [7]), there
are q bits necessary to address any of the wires. A QRAM
circuit takes as input a superposition |adrj〉 addresses to
be queried (αjs are complex amplitudes), and outputs
the contents of the memory cell mj addressed by |adrj〉.
2q−1∑
j=0
αj |adrj〉|0〉 QRAM−−−−−→
2q−1∑
j=0
αj |adrj〉|mj〉
A bucket brigade QRAM circuit consists of three sub-
circuits (regions): 1) FANOUT where an exponential
number of ancilla bits bi is used to store control bits
for pointing to the corresponding memory cells mi; 2)
QUERY where the memory bits mi are queried and the
results are stored on the ancilla wire |target〉; 3) FANIN
(the uncomputation of FANOUT) in order to maintain
reversibility, where the bi bits are returned to their initial
|0〉 state. The three sub-circuits can be easily recognised
in Fig. 1. The QUERY region includes only Toffoli gates.
From the quantum circuit perspective: 1) address-
translation duration is equivalent to the circuit depth
of the FANOUT/FANIN regions; 2) the effective query
speed is analogous to the depth of the QUERY region.
The trade-off cost for speeding-up of the address trans-
lation is the exponentially large number of bi ancillae,
which translates to additional resource requirement and
hence increase in circuit’s width. In the standard bucket
brigade QRAM model, the queries are considered strictly
sequential. Thus, the depth of the QUERY region is dic-
tated by the number of queries. In the worst case, where
the entire QRAM is queried, the depth of QUERY is ex-
ponentially long [7, 8].
The state-of-the-art Clifford+T formulation of the
bucket brigade quantum circuits [7] was obtained from
the original formulation [6] and decomposing the Tof-
foli gates into known Clifford+T gate sequences. The
authors of [7] mentioned that important resource sav-
ings could be possible if the CCZ (double-controlled-Z
gate)/Toffoli transformation is employed in an advanta-
geous manner for the compilation of QRAM. The CCZ
gate is obtained from a Toffoli and two Hadamard gates.
This work continues on that line of thought and il-
lustrates some of the possible improvements. First, the
depth of the QUERY region is exponentially reduced by
parallelising the Toffolis. Second, the herein presented
method achieves a linear width improvement compared
to the state-of-the-art from [7], where parallelisation was
obtained by introducing four ancillae for each Toffoli
gate. We show that our method is ancillae free, and
reduces the width by a factor of four. Third, the pre-
sented QRAMs have a T-count which is almost double
to QROM [2], and approximately half compared to the
QRAMs from [7] (c.f. Tbbs, Trom, Tbbp in Sec. II).
The Toffoli construction used in this paper can be in-
terpreted as the enabler of trading depth for width be-
tween QROM and QRAM: it exponentially increases the
FIG. 2. Comparison between different versions of QRAM.
The proposed QRAM is marked green. The parallel bucket
brigade (light red) [7] introduces an exponential number of
ancillae to achieve parallelism. This is not the case with our
decomposition. The original (sequential) bucket brigade cir-
cuit (dark red) is sequential and has the same number of wires
like the one proposed herein, but requires an exponential num-
ber of control signals. The QROM [2] circuit was specifically
designed with sequential queries, but does not require the
control signals.
width of the QRAM circuit, while it exponentially de-
creases the depth of the circuit. Fig. 2 illustrates the
improvements obtained by parallelising the QRAM.
In the following, the Results section presents exact re-
source counts for the proposed bucket brigade QRAMs,
and includes a detailed comparison to other QRAM vari-
ant layouts. The Methods section discusses how quan-
tum gate level parallelism is achieved, and illustrates how
Clifford+T decompositions of Toffoli gates can be used
in an advantageous manner for each QRAM sub-circuit.
Finally, we present empiric evidence to conjecture that
the presented optimisation technique could be applica-
ble to many types of circuits based on multi-controlled-
operations [9], which are the common building blocks of
quantum adders and multipliers.
In order to achieve a fair comparison between QRAM
circuit implementations, we consider n≤q, with 2n being
the number of memory queries, and 2q denoting the num-
ber of memory cells mi, such that 0≤i<2q. The worst
case for the number of queries, as used in [7], is for n=q.
II. Results
The advantage of bucket brigade QRAM queries is that
these can be parallelised. The presented method is im-
plemented in an open-source software2[5].
The gate level parallelism used in this work is con-
ceptually very similar to classical instruction level par-
allelism (see Sec. III A). In practice, the QRAM gate
parallelism was lost when translating the circuits to
Clifford+T – an often necessary step for estimating
the computational resources required for quantum error-
correction [4]. The width (number of wires, not including
the exponential number of memory cells mi), T-count,
2 https://github.com/quantumresource/quantify
3and the depth of the QRAM circuits [7] are:
Qbbs=q+2
q+5
Tbbs=21·2q−28
Dbbs=21·2q+2q−26
Since their initial formulation, a disadvantage of bucket
brigade QRAMs is the fact that an exponentially high
number of ancilla bi bits has to be computed in order
to achieve FANOUT parallelism. The exponential over-
head is far from ideal, when considering that quantum
hardware resources are and will remain very scarce. For
this reason, the QROM was proposed in [2]. It does
not require the bi bits, but queries are strictly sequen-
tialised (one after the other). By using optimised Toffoli
decompositions, the T-count of the QROM is linear in
the number of queries 2n (the authors of [2] used L for
the number of queries). The formulas for width (Qrom)
and depth (Drom) do not refer to the memory cells mi.
Qrom=q+1
Trom=4·2n−4
Drom=10·2n+c·O(2n)
We present a bucket brigade construction that has a
reduced T-count, and a depth exponentially shallower
than the circuits from [7]. Our construction is achieved
using an advantageous parallelisable CCZ/Toffoli decom-
position.
Qbbp=q+2
q+1 (1)
Tbbp=Tfanout+Tquery=(4·2q)+(6·2n) (2)
Dbbp=Dfanout+Dquery+Dfanin=(10·q)+10+(4·q) (3)
Our construction has a Qbbp which is the same as the
one from [6, 7]. We do not modify the width of the
QRAM, but we optimise the depth of the QRAM. The
equivalent quantum volume of the circuit being protected
by the surface code is not evaluated in this work, al-
though the parallelism is achieved by using a surface-
code compliant parallel CNOT construction. The hard-
ware footprint of bucket brigade QRAM is prohibitively
high – this is dictated mostly by the width of the circuit.
Having exponentially reduced the QRAM depth and the
T-count, maintains the exponentially large width (com-
pared to QROM, Qrom≪Qbpp). For future work, it will
be reasonable to estimate the resources necessary for sur-
face code protection, especially when these QRAMs are
included in practical quantum algorithms.
III. Methods
This work assumes that, in general, two gates G1 and G2
can be parallelised if their commutator [G1,G2]=0. From
a functional perspective, it does not matter in which or-
der the gates are applied: in an ideal setting, the gates
can be executed without one depending on the output of
FIG. 3. Comparison between Dbbs and Dbbp. This plot as-
sumes the worst case where the number of queries 2n equals
the number of memory cells 2q. The exponent q is plotted
along the horizontal axis.
FIG. 4. The T-count of Tbbp. The number of queries (2
n) is
varied by the parameter n, and the number of memory cells
2q by the parameter q, such that n≤q. The T-count increases
exponentially with q, and the depth (cf. Fig. 3) is linear in q.
the other. This approach to quantum gate parallelism is
very similar to classical computing instruction-level par-
allelism.
A. Parallel CNOTs
For the purpose of this work, two CNOTs are parallel if
they either share the controls, or no wires at all (see Fig.
6). CNOTs can also share the target, but this fact will
not be used herein. Most of the Clifford+T decompo-
sitions are compiled and optimised due to the inherent
necessity for quantum error-correction, and this kind of
CNOT parallelism is supported by the surface code [10]
(braided and lattice surgery variants), for example.
B. Parallelisable CCZ
The three qubit Toffoli gate is in general derived from
the CCZ gate, which is a controlled application of the
two-qubit CZ gate. The CCZ flips the sign of phase of
a state if there is a |1〉 on all of the gate’s three input
wires (qubit and wire are used interchangeably). Thus,
the phase flip can be expressed as (−1)xyz, where x,y,z
are the bit values of the inputs: the phase is multiplied
by −1 if all three bits are 1, and no phase flip is applied
otherwise ((−1)0=1). It has been shown by [9], and more
recently in [11], that the following Boolean formula is
useful for expressing the CCZ through CNOTs and T
4FANOUT QUERY FANIN
/-------------------------------------------------------------\ /----------------------------------------------------------------\ /-----------------------------\
a1: -------------------@---------------------@------------------------------------------------------------------------------------------------------------@---------------
| | |
a0: -------@-----------|---------------------|------------------------------------------------------------------------------------------------------------|-----------@---
| | | | |
b_00: -----X---@-------|-----------@---------|-----------@----------X---------T------X-----Tˆ-1-------X---T-------------X-----Tˆ-1---X--------X-------H---X---H---@---X---
| | | | | | | | | | | | |
b_01: ---------X-------|-----------|@--------|-----------|@---------|X--------T----X-|-----Tˆ-1-------X---T-----------X-|-----Tˆ-1---X--------|X------H---X---H---X-------
| || | || || | | | | | | ||
b_10: ---------H---T---X---Tˆ-1----X|----T---X---Tˆ-1----X|----H----@|--------T----|-|X----Tˆ-1-------X---T-----------|-|X----Tˆ-1---X--------@|---HM---------------------
| | | | | | || | | || | |
b_11: ---------H---T---X---Tˆ-1-----X----T---X---Tˆ-1-----X----H-----@--------T----|X||----Tˆ-1-------X---T-----------|X||----Tˆ-1---X---------@---HM---------------------
|||| | |||| |
m11: -------------------------------------------------------------------------T----|||@---------------X---Tˆ-1---X----|||@-----------|------------------------------------
||| | | ||| |
m10: -------------------------------------------------------------------------T----||@----------------X---Tˆ-1---X----||@------------|------------------------------------
|| | | || |
m01: -------------------------------------------------------------------------T----|@-----------------X---Tˆ-1---X----|@-------------|------------------------------------
| | | | |
m00: -------------------------------------------------------------------------T----@------------------X---Tˆ-1---X----@--------------|------------------------------------
| | |
target: ------------------------------------------------------------------H---T----T-------T------T---@----------@-------------------@---H--------------------------------
\-------------------------------------------------------------/ \----------------------------------------------------------------/ \-----------------------------/
FANOUT QUERY FANIN
FIG. 5. The equivalent Clifford+T representation of the circuit from Fig. 1. The depth of the circuit is constant for a fixed q
by FANIN and FANOUT, while QUERY has a constant depth irrespective of q or n. The circuit operates on 2q memory cells
and executes 2n queries with n≤q. The sequence of four T gates on the target qubit will be cancelled.
q0: ---@---@---@--- q0 : ---@---
| | | |
q1: ---X---|---|--- q1 : ---X---
| | |
q2: -------X---|--- q2 : ---X---
| |
q3: -----------X--- q3 : ---X---
FIG. 6. Parallel CNOTs are allowed to share the control wire.
gates.
4xyz=x+y+z−(x⊕y)−(y⊕z)−(x⊕z)+(x⊕y⊕z) (4)
The T gate rotates the phase of a state by pi4 , and
−1=ω4=eipi4 , such that (−1)xyz=ω4xyz. Thus, seven T
gates are necessary, each conditioned on one of the parity
sums from Eq. 4.
Eq. 4 is a recipe for generating valid CCZ gate de-
compositions. Seven parity sums are computed by using
CNOTs and T gates, while ensuring that two conditions
are met. The first condition is for the T gates to be ap-
plied at the right moment, when the necessary bit parities
are stored on any of the wires. Second, the parities on
each of the three wires have to be uncomputed in order
to reflect a correct functionality. Ancillae used for parity
computations have to be uncomputed, too. Due to the
form of Eq. 4, containing seven parities, the number of
ancillae seems to be bounded by seven, but in practice
the maximum is four, because three of the parities are
formed by single bit values. Consequently, the literature
includes a large number of decompositions of the CCZ in
terms of Clifford+T gates.
In the following, we use a CCZ decomposition that
maintains the Toffoli gate parallelism when decomposed
into Clifford+T. The decomposition is obtained after
making the observation that two Toffoli gates are par-
allel whenever they are arranged like in Fig. 7. There are
two non-trivial situations: a) one wire is shared; b) two
wires are shared. Whenever three wires are shared, the
gates cancel each other.
Another practical observation is that, whenever two
Toffoli gates are parallel, these can be formulated as CCZ
gates which share one or two controls out of the three.
q0: ---@---@--- q0 : ---@---@---
| | | |
q1: ---@---@--- q1 : ---@---|---
| | | |
q2: ---X---|--- q2 : ---|---@---
| | |
q3: -------X--- q3 : ---X---X---
FIG. 7. Toffoli gate parallelism. Two parallel Toffoli gates can
either be applied to; left) the same qubits acting as controls,
or right) the target qubit and a shared control. Note: In the
left diagram, the second Toffoli is not necessary and can be
replaced with two CNOT gates, one before and another one
after the first Toffoli gate.
q0: -------@---@------- q0 : -------@-----------@-------
| | | |
q1: -------@---@------- q1 : -------@-----------|-------
| | | |
q2: ---H---X---|---H--- q2 : -------|-----------@-------
| | |
q3: ---H-------X---H--- q3 : ---H---X---H---H---X---H---
FIG. 8. CCZ gate parallelism is whenever at least one con-
trol is shared. In this figure two wires are shared. The two
Hadamards in the center of the right-most circuit will cancel.
In order to implement parallel Clifford+T decompo-
sitions of Toffoli gates, there has to exist a Clifford+T
parallelisable decomposition of CCZ. The decomposition
in Fig. 9 can be used whenever two CCZ/Toffoli gates
share a single wire. It can be noticed that the wire |q1〉
acts only as a control for the CNOTs, which are applied
to the wires |q0〉 and |q2〉, and the T gate commutes with
the control of the CNOTs.
q0: --T---X---Tˆ-1---X-------T----------X---Tˆ-1---X--
| | | |
q1: --T---|----------@---@----------@---|----------@--
| | | |
q2: --T---@--------------X---Tˆ-1---X---@-------------
FIG. 9. CCZ gate Clifford+T decomposition that maintains
Toffoli gate parallelism when a single wire is shared. The |q1〉
wire can be shared by two parallel Toffoli gates.
The wire ordering in Fig. 9 does not play any role, be-
cause the output will still reflect the (−1)xyz phase flip.
Effectively, a Toffoli gate can be obtained by surrounding
the CCZ with two Hadamards on any of the wires. Ad-
vantageous configurations are whenever the Hadamards
are placed such that the wire corresponding to |q1〉 is
5c_1: ---@------- -------T---X---Tˆ-1---X-------T----------X---Tˆ-1---X----------------------------------------------------------------
| | | | |
c_2: ---|---@--- -----------|----------|------------------|----------|-----------T---X---Tˆ-1---X-------T----------X---Tˆ-1---X-------
| | | | | | | | | |
c_s: ---@---@--- -------T---|----------@---@----------@---|----------@-----------T---|----------@---@----------@---|----------@-------
| | | | | | | | | |
t_1: ---X---|--- ---H---T---@--------------X---Tˆ-1---X---@--------------H-----------|--------------|----------|---|------------------
| | | | |
t_2: -------X--- ------------------------------------------------------------H---T---@--------------X---Tˆ-1---X---@--------------H---
/--\ /--\
c_1: ---@------- ---T--------X-----Tˆ-1---X---T-----------X-----Tˆ-1---X---
| | | | |
c_2: ---|---@--- ---T--------|X----Tˆ-1---X---T-----------|X----Tˆ-1---X---
| | || | || |
c_s: ---@---@--- ---T---T----||-----------@----------@----||-----------@---
| | || | | ||
t_1: ---X---|--- ---H---T----@|-----------X---Tˆ-1---X----@|----H----------
| | | | |
t_2: -------X--- ---H---T-----@-----------X---Tˆ-1---X-----@----H----------
\--/ \--/
FIG. 10. Two Toffoli gates sharing a control wire. Parallelisation with canonical Toffoli decomposition.
shared. The presented technique is similar to template
based quantum circuit optimisation [12], in the sense that
the most advantageous Toffoli rewrite rule is chosen from
the three possible decompositions based on the expected
gate parallelism. During the writing of the manuscript,
we found out that the decomposition from Fig. 9 has
been also presented in [13], but its effect on the paral-
lelisation of Toffoli/CCZ gates has not been described in
the subsequent literature.
C. FANOUT: Shared Control
Whenever two Toffoli gates share a control, the CCZ gate
is decomposed such that the shared wire is the one that
enables parallelism (i.e. |q1〉 in Fig. 9). This scenario
appears in the FANOUT region of the QRAM.
D. QUERY: Shared Target
The QUERY is formed by a sequence of Toffoli gates
conditioned by distinct pairs of (bi,mi) wires. The only
wire shared is the target. Therefore, the decomposition
from Fig. 9 can be used, but by making |q1〉 correspond
to the target. For two consecutive Toffolis the H gates
on the target wire will cancel, as illustrated in Fig. 10.
T-count optimisation is a side-effect of this kind of par-
allelism because along the shared target wire the T gates
can pairwise be transformed into S gates.
E. FANOUT: Compute Logical AND
Approximate gates have been discussed since [9], but
their relevance for quantum circuit design was recognised
once T-count optimisation became urgent. This is the
case for the approximate CCZ/Toffoli which uses four
T gates like in Fig. 12. The approximate Toffoli is also
called a logical AND, when the target is an ancilla ini-
tialised to |0〉. However, after a logical AND the state
is left with a phase shift, which has to be reversed, once
the computed AND bit is not necessary anymore in the
circuit (see following section).
Due to their lower T-count it is preferable, whenever
possible, to use logical AND gates instead of more general
CCZ/Toffoli gates. Two logical ANDs sharing a control
can be decomposed like in Fig. 13.
F. FANIN: Uncompute Logical AND
The inverse of the logical AND is its uncomputation. Be-
cause the bi qubits are ancillary, their usage is not nec-
essary after the queries have been executed. Thus, these
can be measured and a correctional CZ gate can be ap-
plied conditionally on the measurement result.
G. The Parallel Bucket Brigade QRAM
The parallelised QRAM circuit is obtained by concate-
nating the parallelised circuits for FANOUT, QUERY
and FANIN (e.g. Fig. 5). One of the surprising results is
that the depth of the FANOUT is constant with respect
to q (cf. Eq. 3): the depth is reduced from O(QlogQ)
where Q=2q to just O(logQ). This indicates that there
is a significant speedup in computing Q sums of the form
bj=
∑q
i=0(ai·2i) where ai are the bits of the address state
vectors used as input to the QRAM (even in superposi-
tion).
Another interesting observation is that the T-count
Tbbp can be reduced with additional 2
n T gates if, after
applying the QUERY, the memory is not entangled to the
rest of the computation. In that case, the T gates on the
mi wires will affect only the global phase of the mi states.
This can be seen if the decomposition from Fig. 9 is used
in reversed gate order, such that, for example, in Fig. 11
half of the T gates applied to ci in the leftmost column
appear at the end of the circuit. Thus, by using parallel
Toffoli gate decompositions one could obtain a T-count
comparable to logical AND formulations. However, the
major disadvantage of logical ANDs is their sequential
nature, and being uncomputed through measurements.
IV. Discussion: Practical Speedups
The presented exponential speedups in Sec. II are theo-
retical. In practice, quantum circuits need to be mapped
to hardware, or be compiled to error-corrected struc-
tures. The underlying physical architecture determines
6c_1: ---@------- -------T---X---Tˆ-1---X-------T----------X---Tˆ-1---X----------------------------------------------------------------
| | | | |
c_2: ---@------- -------T---@----------|---X---Tˆ-1---X---@----------|----------------------------------------------------------------
| | | | |
c_3: ---|---@--- ----------------------|---|----------|--------------|-----------T---X---Tˆ-1---X-------T----------X---Tˆ-1---X-------
| | | | | | | | | |
t_4: ---|---@--- ----------------------|---|----------|--------------|-----------T---@----------|---X---Tˆ-1---X---@----------|-------
| | | | | | | | | |
t_s: ---X---X--- ---H---T--------------@---@----------@--------------@---H---H---T--------------@---@----------@--------------@---H---
c_1: ---@------- ---T-------X---Tˆ-1---X---T----------X---Tˆ-1---X--------
| | | | |
c_2: ---@------- ---T-------@----------X---Tˆ-1---X---@----------|--------
| | | |
c_3: ---|---@--- ---T-------X---Tˆ-1---X---T------|---X---Tˆ-1---X--------
| | | | | | |
t_4: ---|---@--- ---T-------@----------X---Tˆ-1---X---@----------|--------
| | | | |
t_s: ---X---X--- -------H---T---T------@----------@--------------@---H----
FIG. 11. Two Toffoli gates sharing a target wire. Two of the T gates on the target wire can be reduced to an S gate.
Simplification is not illustrated. Furthermore, half of the column of T gates at the beginning of the circuit could be eliminated
in some circumstances (see Sec. III G).
a0: -----------@------------------@------------------
| |
a1: -----------|----------@-------|----------@-------
| | | |
a2: ---H---T---X---Tˆ-1---X---T---X---Tˆ-1---X---H---
FIG. 12. The parallelisable logical AND has two wires acting
always as control. This circuit is similar to the one from [2].
The target wire cannot be shared for parallelisation. If the
qubit |a2〉 is initialised into |0〉, at the end of this computation
will be in the state (−i)a0a1 |a0a1〉 which represents up to a
phase shift the correct value of the Boolean AND operation
between a0 and a1.
the achievable speedup which can be less than the ide-
ally observed (e.g. theoretical) quantum gate level par-
allelism. Consequently, the available level of parallelism
in a circuit may not result in exactly the same execution
speedup.
A concern could be that the physical realisations
of QRAM will not be able to achieve an exponential
speedup using the CNOT parallelisation scheme we pre-
sented. The question is if it is practically feasible to par-
allelise a single control multiple target CNOT, called par-
allel CNOT in Sec. III A, while maintaining the speedup
in practice.
Similar to the theoretical speedup, the practical one is
exponential as well. However, the speedup is scaled by
some factor (e.g. overhead). Independent of the design
choices of the QRAM circuit per se (main focus of this pa-
per), the physical realisation (e.g. implementation which
is not the main focus of this paper) of the QRAM in prac-
tice requires the usage of additional ancillae. Although
we focus on the theoretical speedup, we demonstrate that
CNOT parallelism is practically feasible by considering
two perspectives: 1) fault-tolerant QRAM protected by
the surface code. 2) un-error corrected QRAM;
A. Fault-tolerant QRAM Speedup
This work and the herein presented QUERY parallelisa-
tion was originally formulated with the surface code in
mind. The surface code (in all its known variants and
implementations) supports parallel CNOTs (Sec. III A).
The achieved QUERY speedups are O(1) in the case
of surface code protected quantum circuits. Moreover,
CNOT logical gates are transversal in the surface code,
and the classical post-processing time of these logical
gates is constant (see [14]).
The physical realisation of the surface code requires
nearest neighbour physical connectivity. Mapping sur-
face code circuits to a two-dimensional lattice or 3D clus-
ter state [15] is an efficient procedure both in terms of
resulting number of gates (from logical to physical im-
plementation - the number of qubits and gates to imple-
ment a logical operation is approximated by a polynomial
function), as well as algorithmic (the mapping procedure
has polynomial complexity).
The remaining concern could be related to the arrange-
ments of the surface code protected qubits such that the
exponential parallel CNOT complexity is achieved. How-
ever, this can be solved without introducing any overhead
by ordering the logical qubits in pairs, similarly to the one
presented in [6]: a0,...,aq,...bi,mi,bi+1,mi+1...target
With respect to the surface code implementation of the
bucket brigade QRAM, it should be noted that QRAM
query time is exponentially shorter after exponentially
reducing the depth of the circuits. The estimations in
[7] mention 0.35 ms for querying a 4KB QRAM (15 bit
addresses) that is protected by the surface code. The
presented optimisation would reduce the query time of
the surface code error corrected QRAM by a factor of
215 to approximately 11 ns.
B. Un-Error Corrected QRAM Speedup
If the QRAM circuit would be executed un-error-
corrected directly on a quantum chip, the connectivity of
the chip determines the exact achievable speedup. How-
ever, the practical speedup would still be exponential, by
making the strong assumption that sufficient ancillae
are available.
The theoretical exponential speedups are scaled only
by a factor on the order of q, which is the number of
address qubits (see Sec. I). In particular, for a given ad-
dress length q, the value of q is a constant : once the
memory size is known and configured for an algorithm,
this cannot be changed anymore.
A parallel CNOT can be implemented using a binary
7c_1: ---@------- -----------@------------------@----------------------------------------------------------------
| | |
c_2: ---|---@--- -----------|------------------|--------------------------@------------------@------------------
| | | | | |
c_s: ---@---@--- -----------|----------@-------|----------@---------------|----------@-------|----------@-------
| | | | | | | | | |
t_1: ---X---|--- ---H---T---X---Tˆ-1---X---T---X---Tˆ-1---X---H-----------|----------|-------|----------|-------
| | | | |
t_2: -------X--- -------------------------------------------------H---T---X---Tˆ-1---X---T---X---Tˆ-1---X---H---
/--\ /--\
c_1: ---@------- ------------@---------------------@----------------------------
| | |
c_2: ---|---@--- ------------|@--------------------|@---------------------------
| | || ||
c_s: ---@---@--- ------------||-----------@--------||-----------@-------@-------
| | || | || | |
t_1: ---X---|--- ---H---T----X|----Tˆ-1---X---T----X|----Tˆ-1---X------ |---H---
| | | | |
t_2: -------X--- ---H---T-----X----Tˆ-1---X---T-----X----Tˆ-1-----------X---H---
\--/ \--/
FIG. 13. Two Toffoli gates sharing a control wire. Parallelisation with logical AND versions of the Toffoli gate.
c_1: ---@------- ---H---X---H--------------- ---H---X----H---
| | |
c_2: ---|---@--- -------|-------H---X---H--- ---H---X----H---
| | | | |
c_s: ---@---@--- -------@-----------@------- -------@--------
| |
t_1: ---X---|--- --HM----------------------- --HM------------
|
t_2: -------X--- --------------HM----------- --HM------------
FIG. 14. Logical AND uncompute. HM represents the mea-
surement in the X basis (a Hadamard gate followed by a mea-
surement in the computational basis). Depending on the mea-
surement result a CZ gate correction is applied (here shown
always and not conditioned on measurement result). The CZ
gate is decomposed with the target being on the wire that
is not shared. All measurements can be parallelised and the
correction can be applied depending on the individual mea-
surement results. In this figure, a parallel CNOT is applied
for the case when both corrections would be required.
tree where the CNOT control is the root, and the CNOT
targets are the leaves. The new ancillae are initialised in
|+〉 and used to construct in log(2q)=q steps a large GHZ
(Greenberger–Horne–Zeilinger) state to encode the state
of the control qubit. Afterwards, the parallel CNOT is a
transversal application of CNOTs between the GHZ state
and the original targets implement. This naive construc-
tion illustrates that theoretical O(1) speedups: 1) are
delayed by a factor of q which is the time necessary to
construct the trees; 2) are not impacted by the number
of additional ancillae, although the resource efficiency of
this scheme is reduced.
It should be noted that the GHZ construction doubles
the number of qubits in the QRAM, from Qbbp to 2·Qbpp.
It does not introduce an overhead of the form 2Qbpp .
The naive tree construction is structurally very similar
to the FANOUT region of the bucket brigade QRAM.
The difference is that the FANOUT region is performing
logical-AND operations (Toffoli gates) which are more
complex than the logical-XOR operations necessary for
the parallel CNOT.
It is possible to implement the parallel CNOT using
less ancillae, compared to the naive construction, by
using non-fault-tolerant cluster states [14]. The clus-
ter state and measurement-based formalism are another
proof that the exponential speedup is scaled by a con-
stant factor, because “gates from the Clifford group do
not contribute to the complexity of a quantum algo-
rithm” [14]. The parallel CNOT is a Clifford gate.
V. Conclusion
Quite a few quantum algorithms require access to in-
formation stored in a database like manner. To this end,
QRAMs were proposed in general, and the bucket brigade
QRAM model in particular.
A bucket brigade QRAM includes three stages: 1)
FANOUT, where the input addresses are used to com-
pute exponentially many memory pointers to all possible
memory cells; 2) QUERY, where the memory cells are
queried using Toffoli gates, and in this process the mem-
ory pointers are used to control the queries; 3) FANIN,
where the memory pointers are uncomputed, leaving the
ancillae wires in their original state before FANIN. The
exponentially many memory pointer wires increase the
speed of the addressing, but, as shown in this paper,
have also another advantage: can be used to massively
parallelise the queries when considering the Clifford+T
gate set decomposition of the Toffoli gates.
We construct bucket brigade QRAM circuits having
a constant QUERY depth. We achieve this by using ad-
vantageous Toffoli gate decompositions, and do not intro-
duce any additional ancilla qubits into the QRAM circuit
(except the already available memory pointers/control
signal wires). The depth, when formulated with Clif-
ford+T gates, is independent of the number of queries,
and it depends solely on the number of bits necessary
to address the memory. Compared to state of the art
bucket brigade implementations, we reduced the depth
of QUERY from O(2q) to O(q) in the worst case when
any of the 2q memory cells QRAM are being queried.
Incidentally, the presented construction has a T-count
more than half smaller, compared to the existing Clif-
ford+T formulations. Our construction reduces signif-
icantly the depth of the topological assembly [10] rep-
resenting the bucket brigade circuit and, thus, also the
distance of the surface code to protect the assembly.
The parallel QUERY construction shows that, if quan-
tum hardware would not be a scarce resource, exponen-
tial query speedups are possible compared to state-of-the
art QROM designs (Sec. IV). The QROM uses exponen-
8q0: -------@------- ---@-----@----- q0: -------@--------- ---@-----@---
| | | | | |
q1: ---@---|---@--- ---@-----(0)--- q1: ---@---|-----@--- ---(0)---@---
| | | | | | | | | |
q2: ---X---@---X--- ---(0)---@----- q2: ---X---(0)---X--- ---(0)---@---
| | | | | |
q3: -------X------- ---X-----X----- q3: -------X--------- ---X-----X---
FIG. 15. Parallelising operations by introducing Toffolis ((0)
represents a negative control) reduces computational depth.
In the worst case, where no other optimisations are available,
it increases T-count. These circuit identities are inverse to
the ones from [17].
tially less wires than bucket brigade, but has exponen-
tially slower querying in the worst case. The QROM
circuits execute queries sequentially, and to the best of
our knowledge no parallel construction seems possible.
The parallelisation of bucket brigade was achieved
without introducing any additional ancillae, and future
work will parallelise quantum circuits using templates
like the one presented in Fig. 15. It may seem counter
intuitive, that it may be more resource efficient to in-
clude T gates while not introducing ancilla: in Fig. 15 a
single Toffoli is transformed into two Toffoli gates, thus
the number of T gates increases from 7 to 14. However,
the introduced T gates may cancel, due to efficient Tof-
foli gate parallelism in the overall circuit. We argue, that
the cost of large scale quantum error-corrected quantum
circuits is not necessarily related to T-counts and state
distillations, but to the error-corrected associated with
the error-corrected Clifford operations [16]. This includes
identity operations on unused wires and ancillae, as for
example in carry save adders.
Acknowledgments
A.P. was supported by a Google Faculty Research Award,
and the NUQAT project funded by the University Tran-
silvania Brasov. We thank Olivia Di Matteo for her very
valuable feedback during the preparation of the circuits
and the writing of the manuscript, and Simon Devitt for
feedback on the final manuscript.
[1] M. Schuld and F. Petruccione, Supervised learning with
quantum computers, Vol. 17 (Springer, 2018).
[2] R. Babbush, C. Gidney, D. W. Berry, N. Wiebe, J. Mc-
Clean, A. Paler, A. Fowler, and H. Neven, Encoding elec-
tronic spectra in quantum circuits with linear t complex-
ity, Physical Review X 8, 041015 (2018).
[3] P. J. Coles, S. Eidenbenz, S. Pakin, A. Adedoyin,
J. Ambrosiano, P. Anisimov, W. Casper, G. Chen-
nupati, C. Coffrin, H. Djidjev, et al., Quantum al-
gorithm implementations for beginners, arXiv preprint
arXiv:1804.03719 (2018).
[4] A. Paler, D. Herr, and S. J. Devitt, Really small shoe
boxes: On realistic quantum resource estimation, Com-
puter 52, 27 (2019).
[5] O. Oumarou, A. Paler, and R. Basmadjian, QUANTIFY:
A framework for resource analysis and design verification
of quantum circuits, in Computer Society Annual Sym-
posium on VLSI, Vol. 1 (IEEE, 2020) pp. 1–13.
[6] S. Arunachalam, V. Gheorghiu, T. Jochym-O’Connor,
M. Mosca, and P. V. Srinivasan, On the robustness of
bucket brigade quantum ram, New Journal of Physics
17, 123010 (2015).
[7] O. D. Matteo, V. Gheorghiu, and M. Mosca, Fault-
tolerant resource estimation of quantum random-access
memories, IEEE Transactions on Quantum Engineering
1, 1 (2020).
[8] J. Bang, A. Dutta, S.-W. Lee, and J. Kim, Optimal usage
of quantum random access memory in quantum machine
learning, Physical Review A 99, 012326 (2019).
[9] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo,
N. Margolus, P. Shor, T. Sleator, J. A. Smolin, and
H. Weinfurter, Elementary gates for quantum computa-
tion, Physical review A 52, 3457 (1995).
[10] A. Paler, A. G. Fowler, and R. Wille, Synthesis of arbi-
trary quantum circuits to topological assembly: System-
atic, online and compact, Scientific reports 7, 1 (2017).
[11] C. Gidney, Halving the cost of quantum addition, Quan-
tum 2, 74 (2018).
[12] D. Maslov, C. Young, D. M. Miller, and G. W. Dueck,
Quantum circuit simplification using templates, in De-
sign, Automation and Test in Europe (IEEE, 2005) pp.
1208–1213.
[13] M. Amy, D. Maslov, and M. Mosca, Polynomial-time t-
depth optimization of clifford+ t circuits via matroid par-
titioning, IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 33, 1476 (2014).
[14] R. Raussendorf, D. E. Browne, and H. J. Briegel,
Measurement-based quantum computation on cluster
states, Physical review A 68, 022312 (2003).
[15] A. Paler, S. J. Devitt, K. Nemoto, and I. Polian, Mapping
of topological quantum circuits to physical hardware, Sci-
entific reports 4, 1 (2014).
[16] A. Paler and R. Basmadjian, Clifford gate optimisa-
tion and t gate scheduling: Using queueing models for
topological assemblies, in 2019 IEEE/ACM International
Symposium on Nanoscale Architectures (NANOARCH)
(IEEE, 2019) pp. 1–5.
[17] M. Z. Rahman and J. E. Rice, Templates for positive and
negative control toffoli networks, in International Con-
ference on Reversible Computation (Springer, 2014) pp.
125–136.
