Efficient ancilla-free reversible and quantum circuits for the Hidden
  Weighted Bit function by Bravyi, Sergey et al.
Efficient ancilla-free reversible and quantum circuits
for the Hidden Weighted Bit function
Sergey Bravyi, Theodore J. Yoder, and Dmitri Maslov
IBM Quantum, IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
July 13, 2020
Abstract
The Hidden Weighted Bit function plays an important role in the study of classical models of
computation. A common belief is that this function is exponentially hard for the implementation
by reversible ancilla-free circuits, even though introducing a small number of ancillae allows a
very efficient implementation. In this paper we refute the exponential hardness conjecture by
developing a polynomial-size reversible ancilla-free circuit computing the Hidden Weighted Bit
function. Our circuit has size O(n6.42), where n is the number of input bits. We also show
that the Hidden Weighted Bit function can be computed by a quantum ancilla-free circuit of
size O(n2). The technical tools employed come from a combination of Theoretical Computer
Science (Barringtons theorem) and Physics (simulation of fermionic Hamiltonians) techniques.
1 Introduction
The origins of the Hidden Weighted Bit function go back to the study of models of classical
computation. This function, denoted HWB, takes as input an n-bit string x and outputs the k-th
bit of x, where k is the Hamming weight of x; if the input weight is 0, the output is 0. It is best
known for combining the ease of algorithmic description and implementation by classical Boolean
circuits with the hardness of representation by Ordered Binary Decision Diagrams (OBDDs)
[1]—a popluar tool in VLSI [2]. The difference between logarithmic-depth implementations of
HWB by circuits (recall that HWB∈NC1 but HWB 6∈AC0) and an exponential lower bound
for the size of the OBDD [3] is startling two exponents. Relaxing the constraints on the type
of Binary Decision Diagram considered or restricting the computations by circuits enables a
multitude of implementations with polynomial cost [4].
The Hidden Weighted Bit function was first introduced in the context of reversible and
quantum computations about 15 years ago by I. L. Markov and K. N. Patel (unpublished), and
the earliest explicit mention dates to the year 2005 [5]. The original specification is irreversible,
and required a slight modification to comply with the restrictions of reversible and quantum
computations. Specifically, the Hidden Weighted Bit function was redefined to become the
1
ar
X
iv
:2
00
7.
05
46
9v
1 
 [q
ua
nt-
ph
]  
10
 Ju
l 2
02
0
cyclic shift to the right by the input weight. We denote this reversible specification as hwb.
Formally, hwb(x) is defined as the cyclic shift of its input x to the right by W positions,
where W =x1+x2+ . . .+xn is the Hamming weight of x. The following shows the truth table
of 3-input hwb:
x 000 100 010 110 001 101 011 111
hwb(x) 000 010 001 101 100 011 110 111
Since its introduction, hwb was used by numerous authors focusing on the synthesis and opti-
mization of reversible and quantum circuits as a test case.
Despite a stream of improvements in the respective circuit sizes by various research groups
[6, 7, 8, 9], the best known ancilla-free reversible circuits exhibit exponential scaling in the
number of gates. The synthesis algorithms benefiting from the inclusion of additional gates,
such as multiple-control multiple-target Toffoli, Fredkin, and Peres gates [5, 8, 10] also failed
to find an efficient implementation without ancillae. In 2013, this culminated with the hwb
receiving the designation of a “hard” benchmark function [11]. A recent asymptotically optimal
synthesis algorithm over the library with NOT, CNOT, and Toffoli gates [12], introduced
in the year 2015, was also unable to find an efficient ancilla-free implementation. An ancilla-
free quantum circuit can be obtained by employing an asymptotically optimal quantum circuit
synthesis algorithm such as [13], but the quantum gate count appears to remain exponential
and larger than what is possible to obtain through the application of the asymptotically optimal
reversible logic synthesis algorithm [12].
The introduction of even a small number of ancillae changes the picture dramatically. Just
O(log(n)) ancillary (qu)bits suffice to develop a reversible circuit with O(n log2(n)) gates [14].
Barrington’s theorem [15] allows one to obtain a polynomial-size reversible circuit using three
ancillae. This polynomial-size three-ancilla reversible circuit can be obtained by computing
the individual bits of the input weight through Barrington’s theorem, and using such bits
logarithmically many times to control-SWAP the respective input (qu)bits into their desired
positions. Finally, the existence of a polynomial-size quantum circuit using a single ancilla
follows from [16].
State of the art, in both the classical reversible and quantum settings, thus points to an
exponential difference in the gate count between circuits with no ancillae and circuits with
a constant number of ancillae. In this paper, we demonstrate efficient implementations of the
hwb function by ancilla-free reversible and quantum circuits, thereby reducing these exponential
differences to polynomial. Specifically, our reversible ancilla-free circuit requires O(n6.42) gates
and our quantum ancilla-free circuit requires O(n2) gates. These results refute the exponential
hardness belief and remove hwb from the class of hard benchmarks.
We next sketch main ideas behind our ancilla-free circuits. We begin with the reversible
circuit. Our construction works as follows. First, we show that the n-bit hwb function can be
decomposed into a product of O(n log(n)) gates denoted C5(f(x);B), where f(x) is a symmetric
Boolean function and B⊂x is a subset with 5 input bits. The gate C5(f ;B) cyclically shifts
the 5-bit register B if f(x)=1, and does nothing when f(x)=0. To implement C5(f ;B), we first
break it down into a product of 6 gates of the form C5|Mi(f(x\B);B), where i∈{1, 2, 3, 4, 5, 6},
each Mi is a fixed set of Boolean 5-tuples, and f are symmetric Boolean functions. The gate
C5|Mi restricts the operation of the corresponding gate C5 onto the set Mi and simultaneously
separates the set B of bits being cycle-shifted from the set x\B controlling these shifts. This
allows to employ Barrington’s theorem [15] to implement the gates C5|Mi(f(x\B);B) in the
ancilla-free fashion by expressing them as polynomial-size branching programs with the input
2
x\B and computing into B. Each instruction in such program realizes a permutation of 5-bit
strings controlled by a single bit and it can thus be mapped into a reversible circuit over 6 = 5+1
wires.
Next we introduce our quantum ancilla-free circuit. Let Uhwb be the n-qubit unitary opera-
tor implementing the hwb function. By definition, Uhwb|x〉 = Cx1+x2+...+xn |x〉, where C is the
cyclic shift of n qubits. Suppose we can find an n-qubit Hamiltonian H such that C= eiH and
H commutes with the Hamming weight operator W =
∑n
j=1 |1〉〈1|j . Then Uhwb = eiHW . Thus
it suffices to construct a quantum circuit simulating the time evolution under the Hamiltonian
HW . Since the cyclic shift C is analogous to the translation operator for a particle moving on a
circle, the Hamiltonian H generating the cyclic shift C is analogous to the particle’s momentum
operator. This observation suggests that H can be diagonalized by a suitable Fourier transform.
We formalize this intuition using the language of fermions and the fermionic Fourier transform,
which is routinely used in Physics and quantum simulation algorithms [17, 18]. The desired
Hamiltonian H such that C= eiH is shown to have the form H =V †H ′V , where V is a (modi-
fied) fermionic Fourier transform and H ′ is a simple diagonal Hamiltonian. We also show that
V commutes with the Hamming weight operator W , so that Uhwb = e
iHW =V †eiH′WV . We
demonstrate that each layer in this decomposition of Uhwb can be implemented by a quantum
circuit of size O(n2).
The rest of the paper is organized as follows. Section 2 introduces a simple modification of
the known O(n log2(n))-gate O(log(n))-ancilla reversible circuit that requires O(n log(n)) gates
and O(log(n)) ancillary bits. Section 3 describes an O(n6.42)-gate ancilla-free reversible circuit.
Section 4 reports an ancilla-free O(n2)-gate quantum circuit. These sections are independent of
each other and can be read in any order. Appendices A and B prove technical lemmas stated
in Section 4.
2 Reversible circuit of size O(n log n) using ancillas
We start with the description of a modification of the previously reported classical/reversible cir-
cuit that implements hwb with O(n log(n)) gates and O(log(n)) ancillae [14]. Compared to [14],
our circuit features favorable asymptotics. However, it uses twice the computational/ancillary
space.
Similarly to [14], we break down the computation into three stages:
1. Compute the input weight W = x1+x2+ . . .+xn.
2. Apply controlled-SWAP gates to SWAP inputs into their correct position as specified by
the hwb.
3. Restore the value of ancillary register to |0〉 by appending the inverse of the stage 1.
Note that the stage 3. is omitted in [14], allowing a direct comparison to our circuit illustrated in
Fig. 1. The difference between our construction and [14] is how we compute the input weight.
Specifically, we use the same “plus-one” approach to calculate the weight into the ancillary
register, however, we implement the integer increment function differently. Given input xi,
1≤i≤n, the resister w1, w2, . . . , wblog(i)c+1, where the input weight is being computed into, and
temporary storage t1, t2, . . . , tblog(i)c−1, “increment by one” works as follows. If i= 1, apply
CNOT(x1;w1). For i> 1:
3
Figure 1: 10-stage reversible circuit applying the 7-bit hwb to |x1x2x3x4x5x6x7w1t1w2w3〉. Each of
first 7 CNOT/Toffoli gate stages increments |w1w2w3〉 by one depending on the value of input
variable, next 3 Fredkin gate stages perform controlled-SWAP. Vertical red lines separate these
10 stages. Not shown is Garbage uncomputation that can be performed by appending the inversion
of the weight calculation circuit (CNOT/Toffoli gate part).
1. if i>3: apply Toffoli gate to |xi, w1, t1〉; for j from 2 to blog(i)c−1 apply the Toffoli gate
Toffoli(tj−1, wj ; tj);
2. if i= 2 or i= 3 apply Toffoli(xj , w1;w2)CNOT(xj ;w1);
else apply Toffoli(tblog(i)c−1, wblog(i)c;wblog(i)c+1)CNOT(tblog(i)c−1;wblog(i)c).
3. if i>3: for j from blog(i)c−1 down to 2 apply the half adder, computed by the circuit
Toffoli(tj−1, wj ; tj)CNOT(tj−1;wj). Apply Toffoli(xi, w1; t1)CNOT(xi;w1).
In our implementation, the t register is used to store necessary digit shifts. Advertised asymp-
totics follow by inspection of the above construction. We furthermore illustrated our circuit in
Fig. 1 for n=7.
3 Ancilla-free reversible circuit of size O(n6.42)
In this section we show how to construct an ancilla-free classical reversible circuit of size poly(n)
implementing hwb. We focus on n≥ 5, noting that optimal circuits with n up to 4 are already
known.
Let n be the total number of bits, and x = (x1, x2, ..., xn) ∈ {0, 1}n be the input. In
some discussions where it is convenient, we label these bits by the integers {0, 1, . . . , n−1}=Zn.
Suppose B⊆Zn is a subset of 5 bits and f : {0, 1}n → {0, 1} is a symmetric Boolean function
(that is, f(x) depends only on the Hamming weight of x). Define a reversible gate
C5(f ;B) : {0, 1}n → {0, 1}n,
where the output is obtained from the input x by applying the cyclic shift to the register B if
f(x)=1. Otherwise, when f(x)=0, the gate does nothing. Note that, because the symmetric
function f does not depend on the order of the bits, C5(f ;B) is a permutation of the set {0, 1}n.
Moreover, C5(f ;B) is an even permutation, since it is a product of length-5 cycles and each
length-5 cycle is an even permutation.
Define C(f ; (i0, i1, ..., it−1)) to be a reversible gate that applies the cyclic shift of some t bits
defined by the cycle (i0, i1, ..., it−1) (where i0, i1, ..., it−1 ∈ Zn are all distinct) if the symmetric
4
01
2
3
4
5
6
7
8
Figure 2: Implementation of the 9-bit cyclic shift C(f ; (0, 1, 2, 3, 4, 5, 6, 7, 8)) using the gates
C5(f ; (4, 5, 6, 7, 8)) and C5(f ; (0, 1, 2, 3, 4)).
function f evaluates to one and does nothing otherwise. We call i0, i1, ..., it−1 the targets. We
call a collection of C-type gates a layer when the sets of their targets do not overlap.
We next construct hwb by first expressing it as a circuit with the C-type gates, then breaking
down the C-type gates into elementary reversible gates and C5-type gates, and finally expressing
the C5-type gates in terms of the elementary reversible gates.
Lemma 1. The n-bit hwb function can be implemented by an ancilla-free circuit with blog(n)c+
1 layers of C-type gates.
Proof. We will create a circuit with k layers numbered 0, 1, ..., blog(n)c. At each layer, the C
gates take the form C(fk; ∗). Select the symmetric functions fk as follows: let fk(x) = 1 iff the
kth power of 2 in the binary expansion of the weight W = x1+x2+...+xn equals one. Note that
fk are symmetric functions since the calculation of weight does not depend on the order the
bits are added in. The function hwb can now be expressed as
hwb = C2
0
((0, 1, ..., n−1); f0)C21((0, 1, ..., n−1); f1) · · ·C2blog(n)c((0, 1, ..., n−1); fblog(n)c). (1)
For any k = 0, 1, . . . , blog(n)c, let g := GCD(n, 2k) and Ci := C(fk; (i, i+2k mod n, ...,
i+(ng−1)2k mod n)). Then by elementary modular arithmetic,
C2
k
((0, 1, ..., n−1); fk) = C0C1 . . . Cg−1,
and the targets of any two distinct Ci in this product do not overlap. This shows that each of
the blog(n)c+1 factors in Eq. (1) can be written as a layer of C-type gates.
We next implement each of blog(n)c+1 layers of cyclic shift gates in Lemma 1 as circuits with
O(n) C5-type gates by expressing the cycles (i0, i1, ..., it−1) as products of length-5 cycles. Note
that a length-5 cycle is always an even permutation and (i0, i1, ..., it−1) is an odd permutation
when t is even. It is not possible to implement an odd permutation as a product of even
permutations. However, with one exception, the C-type gates Ci come in pairs (recall that
their number, g, is a power of two) and thus they can usually be paired up to form an even
permutation that can then be decomposed into a product of length-5 cycles. The one exception
is the leftmost gate in Eq. (1), C(f0; (0, 1, ..., n−1)), when n is even. We handle this case first.
Lemma 2. C(f0; (0, 1, ..., n−1)) can be implemented by a reversible circuit with O(n) elementary
gates.
5
Figure 3: Implementation of C(f0; (x1, x2, x3, x4, x5, x6, x7)), where f0(x) = x1 ⊕ x2 ⊕ x3 ⊕ x4 ⊕ x5 ⊕
x6 ⊕ x7.
Proof. The Boolean function f0(x) = x1 ⊕ x2 ⊕ ... ⊕ xn can be implemented on the top bit to
control all bit SWAPs on the bottom bits, and it can be implemented on the bottom bit to
control all bit SWAPs on the top bits. The number of controlled-SWAP gates required is n−1,
and the total number of the CNOT gates required to compute/uncompute the control register
is 4(n−1). We illustrated this construction in Fig. 3 for n=7.
Lemma 3. For n≥ 5:
1. for t≤ 4, pairs of two C(f ; (i0, i1, ..., it−1)) gates can be implemented by an ancilla-free
circuit using constantly many gates C5(f ;B);
2. for odd t> 4 the C(f ; (i0, i1, ..., it−1)) gate can be implemented by an ancilla-free circuit
using O(t) gates C5(f ;B).
3. for even t> 4 pairs of C(f ; (i0, i1, ..., it−1)) gates can be implemented by an ancilla-free
circuit using O(t) gates C5(f ;B);
Proof. 1. There are three cases to consider: t= 2, t= 3, and t= 4.
t= 2. C(f ; (x1, x2)) and C(f ; (y1, y2)) can be implemented simultaneously by the circuit
C5(f ; (y1, x1, y2, a, x2))C5(f ; (a, y1, x1, y2, x2)). This is equivalent to saying that the fol-
lowing permutation equality holds: (x1, x2)(y1, y2) = (y1, x1, y2, a, x2)(a, y1, x1, y2, x2).
Note that the bit ‘a’ can be found since n≥ 5. We will show only the permutation equalities
in the rest of the proof, since it is trivial to translate those to circuits.
t= 3. To implement a pair of gates C(f ; (x1, x2, x3)) and C(f ; (y1, y2, y3)) rely on the cycle
product equality (x1, x2, x3)(y1, y2, y3) = (x1, y1, x2, y2, y3)(x3, x1, y1, x2, y2).
t= 4. Cycles (x1, x2, x3, x4) and (y1, y2, y3, y4) can be obtained by the equality
(x1, x2, x3, x4)(y1, y2, y3, y4)
= (x1, x2)(x1, x3, x4) · (y1, y2)(y1, y3, y4)
= (x1, x2)(y1, y2) · (x1, x3, x4)(y1, y3, y4),
where first and second part require two C5 gates each, as described in the cases t= 2 and
t= 3, for a total of four C5 gates.
2. The goal is to develop a circuit with C5 gates implementing the gate C(f ; (0, 1, ..., t−1)),
where t is odd. There are two cases to consider, t= 4p+1 and t= 4p+3.
6
Case 1: t= 4p+1, p ≥ 1. We want to implement the integer permutation given by the cyclic
shift (0, 1, ..., 4p) by the cyclic shifts of length 5. This can be done as follows,
(0, 1, ..., 4p) = (4p−4, 4p−3, 4p−2, 4p−1, 4p)(4p−8, 4p−7, 4p−6, 4p−5, 4p−4) · · · (0, 1, 2, 3, 4).
This decomposition uses p length-5 cycles, resulting in the ability to implement C(f ; (0, 1, ..., t−1))
gate using p= t−14 C5(f ;B) gates. This construction is illustrated in Fig. 2 for n= 9.
Case 2: t= 4p+3, p ≥ 1. Use the formula
(0, 1, ..., 4p+2) = (4p, 4p+1, 4p+2) · (0, 1, ..., 4p)
= (4p+2, 4p, 2, 1, 0)(4p+1, 4p+2, 0, 1, 2) · (0, 1, ..., 4p).
Since we already implemented (0, 1, ..., 4p) with p C5 gates in Case 1 above, this implementation
requires t+54 C5 gates.
3. The goal is to implement a pair of C(f ; (x1, x2, ..., xt)) and C(f ; (y1, y2, ..., yt)) where t> 4
is even. Write
(x1, x2, ..., xt) · (y1, y2, ..., yt)
= (x1, x2)(x1, x3, x4, ..., xt) · (y1, y2)(y1, y3, y4, ..., yt)
= (x1, x2)(y1, y2) · (x1, x3, x4, ..., xt) · (y1, y3, y4, ..., yt)
Here, (x1, x2)(y1, y2) requires two C5 gates per item 1. case t= 2, and each of (x1, x3, x4, ..., xt)
and (y1, y3, y4, ..., yt) requires O(t) gates per item 2.
Observe how the above proof implies that the number of C5 gates required to implement
each of C2
k
((0, 1, ..., n−1); fk) stages in Eq. (1) for k = 1, 2, ..., blog(n)c is between n4 + Const
and n2 + Const. Thus, per Lemma 2, the total number of elementary and C5 gates required to
implement hwb over n qubits is between n log(n)4 +O(n) and
n log(n)
2 +O(n).
We next show how to implement C5(fk;B) as a branching program, using Barrington’s
theorem [15], by closely following the original proof. In preparation for using Barringon’s
theorem, we first remove the dependence of the functions fk in C5(fk;B) on the variables inside
the set B, to allow the desired cyclic shift to be controlled by the values of n−5 variables outside
the set B itself. To accomplish this, note that C5(fk;B) acts trivially on the strings 00000 and
11111; those can be ignored. This leaves 30 non-fixed by the operation 5-bit strings that can be
partitioned into six disjoint subsets M1,M2,M3,M4,M5, and M6, with 5 strings each. Every
subset Mi contains 5 cyclic shifts of some fixed 5-bit string, and is defined as follows:
M1 :={10000, 01000, 00100, 00010, 00001}, (2)
M2 :={01111, 10111, 11011, 11101, 11110},
M3 :={11000, 01100, 00110, 00011, 10001},
M4 :={10100, 01010, 00101, 10010, 01001},
M5 :={00111, 10011, 11001, 11100, 01110},
M6 :={01011, 10101, 11010, 01101, 10110}.
We implement C5(fk;B) by performing the cyclic shifts of a single subset Mi per time.
First, let us introduce some more notations. Given a bit string x∈{0, 1}n, write x = (y, b),
where b∈{0, 1}5 is the restriction of x onto the register B and y ∈{0, 1}n−5 is the rest of x. Let
7
wi ∈{1, 2, 3, 4} be the Hamming weight of bit strings in Mi (note that all strings in the same
subset Mi have the same weight). Define a Boolean function fk,i : {0, 1}n−5 → {0, 1} such that
fk,i(y)=1 iff 2
k appears in the binary expansion of |y|+wi. Then
fk(x) = fk(y, b) = fk,i(y) for any b ∈Mi.
Define a gate
C5|Mi(fk;B) : {0, 1}n → {0, 1}n
that maps an input x= (y, b) to an output x′= (y, b′) according to the following rules:
• if fk,i(y) = 0 then b′= b;
• if fk,i(y) = 1 and b /∈Mi then b′= b;
• if fk,i(y) = 1 and b∈Mi then b′ ∈Mi is obtained from b by cyclically shifting the elements
of Mi.
By definition, the cyclic shift of bits in the register B can be realized by cyclically shifting
elements of each subset Mi for i = 1, 2, 3, 4, 5, 6. Thus
C5(fk(x);B) =
6∏
i=1
C5|Mi(fk(y);B). (3)
Here the order in the product does not matter because the gates C5|Mi(fk;B) pairwise commute.
Note that the dependence of function fk on the variables inside the set B has now been removed,
and we can proceed to implementing C5|Mi(fk;B) as a branching program, and finally mapping
the instructions used by the branching program into reversible gates.
Recall some relevant notation used in Barrington’s paper [15]. Let S5 be the group of
permutations of 5 numbers, {1, 2, 3, 4, 5}. Given a 5-tuple of distinct integers a1, a2, a3, a4,
and a5, we write (a1, a2, a3, a4, a5) to denote the 5-cycle. Let e be the identity permutation.
A branching program of length L with m Boolean input variables y1, y2, ..., ym is a list of
instructions 〈yi, σi, τi〉 with i = 1, 2, ..., L and σi, τi ∈ S5, such that σi is applied if yi=1, and
τi is executed when yi=0. Given a permutation σ ∈ S5, the branching program is said to σ-
compute a Boolean function f(y) if executing the list of all instructions in the program results
in e (the identity permutation) for all inputs y such that f(y)=0 and permutation σ for all
inputs y such that f(y)=1.
Barrigton’s theorem asserts that any function in the class NC1 can be (1, 2, 3, 4, 5)-computed
by a branching program of polynomial size [15]. We next specialize the proof of the theorem to
explicitly develop a short branching program that (1, 2, 3, 4, 5)-computes the Boolean function
fk,i(y). Recall that fk,i(y)=1 iff 2
k appears in the binary expansion of y1+y2+ . . .+yn−5+wi
with wi ∈ {1, 2, 3, 4} being the weight of bit strings in Mi. It suffices to develop a branching
program computing the Boolean function fk(y) with y ∈ {0, 1}m and m=n−5 by appending
at most two constant binary variables 1 encoding wi to the bit string y.
While the original proof [15] explored the mapping of logarithmic-depth classical circuits
over {AND,OR} library, we focus on the classical circuits over 3-input 1-output MAJ(a, b, c) :=
ab⊕ bc⊕ ac and XOR(a, b, c) := a⊕ b⊕ c gates. Recall that the library {MAJ,XOR} is universal
for classical computations if constant inputs are allowed.
Lemma 4. Suppose y is an m-bit string and fk(y) is the k-th bit in the binary representation of
W = y1+y2+...+ym. The function fk(y) can be (1, 2, 3, 4, 5)-computed by a branching program
of size O(m5.42).
8
Proof. First, we describe a logarithmic-depth classical circuit that computes functions fk(y) for
the range of applicable values k, and second, report expressions for MAJ and XOR in the form
of a branching program that can be used in the recursion [15, Proof of Theorem 1]. The length
of the branching program computing fk(y) is upper bounded by taking the maximal length of
the program implementing MAJ or XOR to the power of the circuit depth.
First, construct a classical circuit with MAJ and XOR gates that implements fk(y). To do
so, we develop a circuit that computes all bits of the W (y), and for the purpose of implementing
a given single Boolean component, discard all gates that compute the bits we are not interested
in. Such operation does not increase the depth of the circuit, and may, in fact, decrease it
slightly.
To find W (y), we employ a circuit consisting of two stages. First, compose a circuit of depth
log3/2(m)+O(1) with 3-input 2-output Full Adder gates FA(a, b, c) := (MAJ(a, b, c),XOR(a, b, c))
by grouping as many triples of digits of same significance at each step as possible (note that
MAJ and XOR are implemented in parallel). We finish this first stage when the output contains
two log(m)-digit integer numbers u and v such that W = u+ v. To analyze this circuit, it is
convenient to group all bits needing to be added into the smallest set of integer numbers, and
count the reduction in the number of integers left to be added by treating layers of FA gates
as Carry-Save Adders [19, 20]. A Carry-Save Adder is defined as the 3-integer into 2-integer
adder, which is implemented by applying the Full Adders to the individual components of the
three integer numbers at the input. Since the number of integers left to be added changes by
a factor of 23 at each step, and every step is implemented by a depth-1 MAJ/XOR circuit, the
depth of the first stage is log3/2(m) +O(1). To find the individual components of W (y), the
second stage adds two log(m)-digit integer numbers u and v. This can be accomplished by any
logarithmic-depth integer addition circuit in depth O(log log(m)), such as [21]. The total depth
is thus log3/2(m) +O(log log(m)).
Next, construct S5-programs computing the MAJ and XOR functions:
〈z1, (1, 4, 3, 2, 5), e〉 〈z2, (1, 3, 5, 4, 2), e〉 〈z3, (1, 2, 5, 3, 4), e〉 〈z1, (1, 2, 3, 4, 5), e〉
〈z2, (1, 2, 4, 5, 3), e〉 〈z3, (1, 4, 3, 5, 2), e〉 〈z1, (1, 5, 4, 3, 2), e〉 〈z1, (1, 5, 2, 3, 4), e〉
=
{
e if MAJ(z1, z2, z3) = 0
(1, 2, 3, 4, 5) if MAJ(z1, z2, z3) = 1,
(4)
〈z2, (1, 2, 3, 5, 4), e〉 〈z3, (1, 2, 4, 5, 3), e〉 〈z2, (1, 3, 5, 4, 2), e〉 〈z3, (1, 4, 5, 3, 2), e〉
〈z1, (1, 2, 3, 4, 5), e〉 〈z2, (1, 3, 4, 2, 5), e〉 〈z2, (1, 3, 2, 4, 5), e〉 〈z3, (1, 3, 4, 2, 5), e〉
〈z3, (1, 3, 2, 4, 5), e〉
=
{
e if XOR(z1, z2, z3) = 0
(1, 2, 3, 4, 5) if XOR(z1, z2, z3) = 1.
(5)
The branching program that (1, 2, 3, 4, 5)-computes fk(y) is created by recursively replacing
gates MAJ and XOR in the circuit constructed above with the branching programs Eq. (4)
and Eq. (5), where each zi is either one of the primary input variables y1, y2 . . . , ym or one of
the intermediate variables in the circuit computing fk(y), until all instructions are controlled
by constants and primary variables y1, y2 . . . , ym. The recoding of branches of the program
τ -computing a desired intermediate variable z∗ when τ 6=(1, 2, 3, 4, 5) (note how Eq. (4) and
Eq. (5) (1, 2, 3, 4, 5)-compute the gates, but not τ -compute them for arbitrary τ) is accomplished
9
in accordance with [15, Lemma 1]. The total length of the branching program is thus upper
bounded by the size of longest branching program implementation of the basic gates used (MAJ
and XOR) raised to the power the depth of the circuit it encodes,
9log3/2(m)+O(log log(m)) = O(m5.4190225... log(m)O(1)) = O(m5.42).
We conclude this section by summarizing the main result in a Theorem.
Theorem 1. The n-bit hwb function can be implemented by an ancilla-free reversible circuit
of size O(n6.42).
Proof. First, implement each instruction 〈z∗, (a1, a2, a3, a4, a5), e〉 where z∗ is either a primary
variable or a constant and the sets {a1, a2, a3, a4, a5} are defined per Eq. (2), using constantly
many basic reversible gates. This can be accomplished by employing a reversible logic synthesis
algorithm, e.g., [9]. Next, use Lemma 4 with m = n−5 and x = y unionsqB to implement all necessary
C5|Mi(fk(y);B) gates, using a branching program with
O
(
9log3/2(n−5) +O(log log(n−5))
)
= O
(
9log3/2(n) +O(log log(n))
)
instructions. Each such branching program requires O(9log3/2(n)+O(log log(n))) basic reversible
gates since every instruction requires constantly many basic reversible gates. Use six
C5|Mi(fk(y);B) gates to implement one C5(fk(x);B) gate, using Eq. (3). Each C5(fk(x);B)
thus costs O(9log3/2(n)+O(log log(n))) basic reversible gates. Combine Lemma 1, Lemma 2, and
Lemma 3 to implement hwb using O(n log(n)) C5(fk(x);B) gates, implying the total basic
reversible gate count of
O
(
9log3/2(n) +O(log log(n)) · n log(n)
)
= O
(
n6.4190225... log(n)O(1)
)
= O(n6.42).
4 Ancilla-free quantum circuit of size O(n2)
Consider a register of n qubits and let C be the cyclic shift operator,
C|x1, x2, . . . , xn−1, xn〉 = |x2, x3, . . . , xn, x1〉.
The hidden weighted bit function Uhwb may be written as
Uhwb|x〉 = Cx1+x2+...+xn |x〉 for all x ∈ {0, 1}n. (6)
In other words, Uhwb implements the k-th power of C on the subspace with the Hamming weight
k. Here we show that Uhwb can be implemented by an ancilla-free quantum circuit of the size
O(n2). The circuit is expressed using Clifford gates and single-qubit Z-rotations.
Let
W =
n−1∑
j=0
|1〉〈1|j (7)
be the Hamming weight operator. Our starting point is
10
Lemma 5. Suppose C = eiH for some n-qubit Hamiltonian H that commutes with W . Then
Uhwb = e
iHW . (8)
Proof. Indeed, let Lk be the subspace spanned by all basis states |x〉 with the Hamming weight
k. The full Hilbert space of n qubits is the direct sum L0 ⊕ L1 ⊕ . . . ⊕ Ln. Let us say that
an operator O is block-diagonal if O maps each subspace Lk into itself. Since H commutes
with W , we infer that H is block-diagonal. Therefore HW and eiHW are also block-diagonal.
Note that HW and kW have the same restriction onto Lk. Thus eiHW and eikH have the same
restriction onto Lk. By assumption, eiH = C. Thus eiHW and Ck have the same restriction onto
Lk. Likewise, Uhwb is block-diagonal and the restriction of Uhwb onto Lk is Ck. We conclude
that Uhwb and e
iHW have the same restriction onto Lk for all k. Since both operators are
block-diagonal, one has Uhwb = e
iHW .
We will construct a Hamiltonian H satisfying conditions of Lemma 5 using the language
of fermions and the fermionic Fourier transform [17, 18]. First, define fermionic creation and
annihilation operators a†p and ap with p ∈ Zn ≡ {0, 1, . . . , n− 1} as
a†p = Z ⊗ Z ⊗ · · · ⊗ Z︸ ︷︷ ︸
p
⊗|1〉〈0| ⊗ I ⊗ I ⊗ · · · ⊗ I︸ ︷︷ ︸
n−p−1
ap = Z ⊗ Z ⊗ · · · ⊗ Z︸ ︷︷ ︸
p
⊗|0〉〈1| ⊗ I ⊗ I ⊗ · · · ⊗ I︸ ︷︷ ︸
n−p−1
.
Here Z = |0〉〈0| − |1〉〈1| is the Pauli-Z operator.
Definition 1. A Fermionic Fourier Transform is a unitary n-qubit operator F such that F|0n〉 =
|0n〉 and
FapF
† =
1√
n
∑
q∈Zn
e2piipq/naq for all p ∈ Zn. (9)
Note that Eq. (9) uniquely specifies F. Indeed, suppose x ∈ {0, 1}n is a weight-k basis state
with ones at qubits p1 < p2 < . . . < pk. Then
F|x〉 = Fa†p1a†p2 · · · a†pk |0n〉 = Fa†p1Fa†p2 · · · a†pkF†|0n〉 =
k∏
i=1
Fa†piF
†|0n〉. (10)
Since each operator Fa†piF† = (FapiF†)† is determined by Eq. (9), this uniquely specifies the
action of F on the basis vectors |x〉. It will be important that F commutes with the Hamming
weight operator W ,
FW = WF. (11)
Indeed, from Eqs. (9,10) one can see that F|x〉 is a linear combination of states a†q1a†q2 · · · a†qk |0n〉.
Since (a†q)2 = 0, the state a†q1a
†
q2 · · · a†qk |0n〉 is non-zero only if all indices q1, q2, . . . , qk are distinct.
Such state has weight k. Thus F maps weight-k states to linear combinations of weight-k states
proving Eq. (11).
We will use the following fact established by Kivlichan et al. [18].
Lemma 6. The fermionic Fourier transform F on n qubits can be implemented by a quantum
circuit of size O(n2). The circuit requires no ancillary qubits.
11
For completeness, we provide a simplified proof of Lemma 6 and an explicit construction of
the quantum circuit realizing F in Appendix A. Now we are ready to define a Hamiltonian H
satisfying conditions of Lemma 5. Let
E =
1
2
(I + Z⊗n)
be the projector onto the even-weight subspace. Define n-qubit Hamiltonians
H0 =
2pi
n
∑
p∈Zn
p|1〉〈1|p, H ′ = H0 + pi
n
WE, (12)
and
H = V †H ′V where V = F†eiH0E/2. (13)
Lemma 7. The Hamiltonian H defined in Eq. (13) satisfies C = eiH .
A proof of this lemma is given in Appendix B. A high-level intuition behind the definition of
H comes from the fact that FH0F
† is the fermionic momentum operator. Note that H = FH0F†
in the odd-weight subspace where E= 0. The extra terms in the definition of H are needed
to change integer momentums (periodic boundary conditions) in the odd-weight subspace to
half-integer momentums (anti-periodic boundary conditions) in the even-weight subspace. This
accounts for the difference between the qubit cyclic shift and its fermionic analogue, as detailed
in Appendix B.
From Eq. (11) one can see that HW =WH. Thus H satisfies conditions of Lemma 5.
Combining Lemma 5, Lemma 7, and noting that VW =WV one arrives at
Uhwb = e
iHW = eiV
†H′VW = V †eiH
′WV = e−iH0E/2FeiH
′WF†eiH0E/2. (14)
Here we used the well-known fact that eiV
†OV = V †eiOV for any Hermitian operator O and
any unitary V (which can be verified by expanding the exponent using the Taylor series and
noting that (V †OV )p = V †OpV for all p ≥ 1). We claim that each term in Eq. (14) can be
implemented using O(n2) two-qubit gates without ancillary qubits. By Lemma 6, the layers F
and F† have gate cost O(n2).
For the term eiH0E/2 and its inverse, we have the following lemma.
Lemma 8. The operator eiH0E/2 can be implemented by a quantum circuit of size O(n) without
using ancillary qubits.
Proof. If we set θp = ppi/n, then
eiH0E/2 = R1R2 · · ·Rn−1, where Rp = eiθp|1〉〈1|pE . (15)
The operator |1〉〈1|pE projects the subset of qubits Zn\{p} onto the odd-weight subspace. Note
p6=0 and let Cp be a CNOT circuit that computes the parity of Zn\{p} into the qubit 0,
Cp =
∏
j∈Zn\{0,p}
CNOTj,0.
Then |1〉〈1|pE = C†p|11〉〈11|0pCp and thus
Rp = C
†
pe
iθp|11〉〈11|0pCp.
12
Therefore, an individual Rp is implemented with O(n) gates, which suggests e
iH0E/2 can be
implemented with O(n2) gates. However, we can improve this count by noting that for p 6=q
CpC
†
q = CNOTp,0CNOTq,0. (16)
Thus, in fact, the product in Eq. (15) can be implemented with just O(n) gates.
We still need to implement the term eiH
′W = eiH0W ei(pi/n)W
2E . The operator eiH0W is a
product of O(n2) rotations eiθ|11〉〈11| and eiθ|1〉〈1|. Although a naive implementation of ei(pi/n)W 2E
requires O(n3) gates, we next show that a better implementation exists.
Lemma 9. The operator ei(pi/n)W
2E can be implemented by a quantum circuit of size O(n2)
without using ancillary qubits.
Proof. First, note that
W 2E = 2
∑
p,p′∈Zn
0<p<p′
|11〉〈11|pp′E + 2
∑
p∈Zn
0<p
|11〉〈11|0pE +
∑
p∈Zn
|1〉〈1|pE. (17)
The terms in Eq. (17) commute. Therefore, we have, with arbitrary order within the products,
ei(pi/n)W
2E =
∏
p,p′∈Zn
0<p<p′
Upp′
∏
p∈Zn
0<p
U0p
∏
p∈Zn
Up, (18)
where, for p<p′,
Upp′ = e
i(2pi/n)|11〉〈11|pp′E and Up = ei(pi/n)|1〉〈1|pE .
The second and third products in Eq. (18) can be implemented with O(n) gates using
arguments similar to those in Lemma 8. In the rest of this proof we focus on the first product
and show that it can be implemented with O(n2) gates.
Notice that |11〉〈11|pp′E projects the subset of qubits Zn\{p, p′} onto the even weight
subspace while projecting qubits p and p′ to |11〉pp′ . Therefore, if 0<p<p′, we can define
Spp′ := Zn\{0, p, p′} and
Cpp′ :=
∏
j∈Spp′
CNOTj,0,
such that
Upp′ = C
†
pp′e
i(2pi/n)|011〉〈011|0pp′Cpp′ .
This implementation of Upp′ takes O(n) gates, which suggests O(n
3) gates might be needed to
implement all n(n−1)/2 factors in the first product in Eq. (18). However, we can order the
factors in such a way as to allow massive cancellation between consecutive CNOT circuits Cpp′
and implement the first product with just O(n2) total gates.
Notice that
Cpp′C
†
qq′ =
∏
j∈Spp′∆Sqq′
CNOTj,0
is a circuit of at most four CNOT gates. In fact, it is a circuit with just two CNOT gates when
|{p, p′} ∩ {q, q′}| = 1. Thus, the following two products can be implemented with O(n) gates:
Ux↑ = Ux,x+1Ux,x+2 · · ·Ux,n−1,
Uy↓ = Uy,n−1Uy,n−2 · · ·Uy,y+1,
13
where x, y ∈ Zn\{n−1}. Hence, the first product in Eq. (18) can be implemented with O(n2)
gates because ∏
p,p′∈Zn
0<p<p′
Upp′ = U1↑U2↓U3↑ · · ·Un−2,n−1.
The above implementation of ei(pi/n)W
2E requires three-qubit gates of the form eiθ|011〉〈011|.
The latter can be decomposed into a sequence of O(1) two-qubit Clifford gates and single-qubit
Z-rotations using the standard methods [22]. We summarize main result of this section in the
following Theorem.
Theorem 2. Eq. (14) reports an ancilla-free quantum circuit of size O(n2) implementing Uhwb.
5 Conclusion
In this paper, we introduced two ancilla-free circuits implementing the Hidden Weighted Bit
function, O(n6.42)-gate reversible circuit and O(n2)-gate quantum circuit. Our circuits im-
prove best previously known exponential size reversible and quantum ancilla-free circuits into
polynomial-size ones. Our results demote hwb by removing it from the class of “hard” bench-
marks [11]. Our ancilla-free reversible implementation marks a new point in the study of ancilla
vs gate count (space-time) tradeoff. Noting a high exponent in the reversible circuit complex-
ity and a more-than-qubic difference between complexities of our best quantum and reversible
circuit implementations, we suggest that a further line of inquiry may target improving the
reversible implementation.
Acknowledgements
SB and TY are partially supported by the IBM Research Frontiers Institute.
References
[1] Randal E. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE
Transactions on Computers, 100(8):677–691, 1986.
[2] Christoph Meinel and Thorsten Theobald. Algorithms and Data Structures in VLSI Design:
OBDD-foundations and applications. Springer Science & Business Media, 2012.
[3] Randal E. Bryant. On the complexity of VLSI implementations and graph representations
of Boolean functions with application to integer multiplication. IEEE Transactions on
Computers, (2):205–213, 1991.
[4] Beate Bollig, Martin Lo¨bbing, Martin Sauerhoff, and Ingo Wegener. On the complexity of
the hidden weighted bit function for various BDD models. RAIRO-Theoretical Informatics
and Applications, 33(2):103–115, 1999.
[5] Dmitri Maslov, Gerhard W. Dueck, and D. Michael Miller. Toffoli network synthesis with
templates. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, 24(6):807–817, 2005.
14
[6] Aditya K. Prasad, Vivek V. Shende, Igor L. Markov, John P. Hayes, and Ketan N. Pa-
tel. Data structures and algorithms for simplifying reversible circuits. ACM Journal on
Emerging Technologies in Computing Systems (JETC), 2(4):277–293, 2006.
[7] Dmitri Maslov, Gerhard W. Dueck, and D. Michael Miller. Techniques for the synthesis
of reversible Toffoli networks. ACM Transactions on Design Automation of Electronic
Systems (TODAES), 12(4):42.1–28, 2007.
[8] James Donald and Niraj K. Jha. Reversible logic synthesis with Fredkin and Peres gates.
ACM Journal on Emerging Technologies in Computing Systems (JETC), 4(1):1–19, 2008.
[9] Mehdi Saeedi, Morteza Saheb Zamani, Mehdi Sedighi, and Zahra Sasanian. Reversible
circuit synthesis using a cycle-based approach. ACM Journal on Emerging Technologies in
Computing Systems (JETC), 6(4):1–26, 2010.
[10] Dmitri Maslov, Gerhard W. Dueck, and D. Michael Miller. Synthesis of Fredkin-Toffoli
reversible networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
13(6):765–769, 2005.
[11] Mehdi Saeedi and Igor L. Markov. Synthesis and optimization of reversible circuitsa survey.
ACM Computing Surveys (CSUR), 45(2):1–34, 2013.
[12] Dmitry V. Zakablukov. Application of permutation group theory in reversible logic syn-
thesis. arXiv:1507.04309, 2015.
[13] Vivek V. Shende, Stephen S. Bullock, and Igor L. Markov. Synthesis of quantum-logic
circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
25(6):1000–1010, 2006.
[14] Dmitri Maslov. Reversible logic synthesis benchmarks page.
https://webhome.cs.uvic.ca/˜dmaslov/hwbpoly.html, 2005.
[15] David A. Barrington. Bounded-width polynomial-size branching programs recognize ex-
actly those languages in NC1. Journal of Computer and System Sciences, 38(1):150–164,
1989.
[16] Farid Ablayev, Aida Gainutdinova, Marek Karpinski, Cristopher Moore, and Christopher
Pollett. On the computational power of probabilistic and quantum branching program.
Information and Computation, 203(2):145–162, 2005.
[17] Ryan Babbush, Nathan Wiebe, Jarrod McClean, James McClain, Hartmut Neven, and
Garnet Kin Chan. Low depth quantum simulation of electronic structure. Physical Review
X, 8:011044, 2018.
[18] Ian D. Kivlichan, Jarrod McClean, Nathan Wiebe, Craig Gidney, Ala´n Aspuru-Guzik,
Garnet Kin-Lic Chan, and Ryan Babbush. Quantum simulation of electronic structure
with linear depth and connectivity. Physical Review Letters, 120(11):110501, 2018.
[19] Algirdas Avizienis. Signed-digit number representations for fast parallel arithmetic. IRE
Transactions on Electronic Computers, (3):389–400, 1961.
[20] Ingo Wegener. The complexity of Boolean functions. BG Teubner, 1987.
[21] Valerii M. Krapchenko. Asymptotic estimation of addition time of parallel adder. Syst.
Theory Res., 19:105–122, 1970.
[22] Michael A. Nielsen and Isaac Chuang. Quantum computation and quantum information,
2002.
15
[23] Frank Verstraete, J. Ignacio Cirac, and Jose´ I. Latorre. Quantum circuits for strongly
correlated quantum systems. Physical Review A, 79(3):032316, 2009.
Appendix A
In this Appendix we construct a quantum circuit implementing the fermionic Fourier transform
F on n qubits and illustrate it for n=3. The circuit is expressed using O(n2) single-qubit and
two-qubit gates
S(γ) = eiγ|1〉〈1| =
[
1 0
0 eiγ
]
and
R(α, β) = eαe
iβ |10〉〈01|−αe−iβ |01〉〈10| =

1 0 0 0
0 cos (α) −e−iβ sin (α) 0
0 eiβ sin (α) cos (α) 0
0 0 0 1
 .
Here α, β, γ are real parameters. We use subscripts p, q ∈ Zn to indicate qubits acted upon by
each gate. In the fermionic language, Rp,p+1(α, β) implements a Givens rotation in the two-
dimensional subspace spanned by operators ap and ap+1. Namely, let Rp,p+1 = Rp,p+1(α, β).
Then
Rp,p+1apR
†
p,p+1 = cos (α)ap − sin (α)eiβap+1, (19)
Rp,p+1ap+1R
†
p,p+1 = sin (α)e
−iβap + cos (α)ap+1, (20)
We also need a fermionic SWAP gate [18, 23] defined as
fSWAP = CZ · SWAP = R(pi/2, pi/2)S(−pi/2)⊗2.
One can easily check that
(fSWAPp,p+1)ap(fSWAPp,p+1)
† = ap+1 and (fSWAPp,p+1)ap+1(fSWAPp,p+1)† = ap.
Define a unitary n×n matrix f with matrix elements
fp,q = n
−1/2e2piipq/n, where p, q ∈ Zn. (21)
We will write row(f, p) for the p-th row of f . Below we define a function ColumnReduce(f,m,U)
that takes as input a unitary n×n matrix f , an integer m ∈ Zn, and a quantum circuit U acting
on n qubits. The function returns a modified unitary matrix f ′ and a modified quantum circuit
U ′. A quantum circuit realizing the fermionic Fourier transform F on n qubits is generated by
the following algorithm.
Algorithm 1 FermionicFourierTransform
1: Let f be the n× n unitary matrix defined in Eq. (21)
2: U ← I . Empty quantum circuit
3: for m = n− 1 to 0 do
4: (f, U) = ColumnReduce(f,m, U)
5: end for
6: return F = U−1
16
Algorithm 2 ColumnReduce(f,m, U)
1: for p = 0 to m− 1 do
2: if fp,m 6= 0 or fp+1,m 6= 0 then
3: if fp+1,m = 0 then
4: Swap row(f, p) and row(f, p+ 1)
5: U ← fSWAPp,p+1 · U . Add fSWAP gate
6: end if . Now fp+1,m 6= 0
7: Choose angles α, β such that tan (α)e−iβ = −fp,m/fp+1,m
8: v ← row(f, p)
9: row(f, p)← cos (α)row(f, p) + sin (α)e−iβrow(f, p+ 1) . Now fp,m = 0
10: row(f, p+ 1)← cos (α)row(f, p+ 1)− sin (α)eiβv
11: U ← Rp,p+1(α, β) · U . Add R gate
12: end if
13: end for . Now fm,m is the only nonzero in the m-th column of f
14: γ ← phase(fm,m) . Now fm,m = eiγ
15: fm,m = 1
16: U ← Sm(γ) · U . Add S gate
17: return (f, U)
We claim that the quantum circuit U and the unitary matrix f obtained after each call to
the function ColumnReduce have the property
(UF)ap(UF)
† =
n∑
q∈Zn
fp,qaq for all p ∈ Zn. (22)
Indeed, Eq. (22) is trivially true initially when U=I and f is defined by Eq. (21). The lines 4
and 7-10 of Algorithm 2 apply a sequence of Givens rotations to the matrix f setting to zero all
matrix elements fp,m with 0≤ p<m and setting fm,m=1. The order in which matrix elements
of f are set to 0 or 1 is illustrated for n=3 below (asterisks indicate matrix elements of f). ∗ ∗ ∗∗ ∗ ∗
∗ ∗ ∗
→
 ∗ ∗ 0∗ ∗ ∗
∗ ∗ ∗
→
 ∗ ∗ 0∗ ∗ 0
∗ ∗ ∗
→
 ∗ ∗ 0∗ ∗ 0
∗ ∗ 1
→
 ∗ 0 0∗ ∗ 0
∗ ∗ 1
→
 ∗ 0 0∗ 1 0
∗ ∗ 1
→
 1 0 0∗ 1 0
∗ ∗ 1

Since f remains unitary at each step, the final unit-diagonal low-triangular matrix is the identity,
i.e. f=I after the last iteration of Algorithm 1. Each time a Givens rotation is applied to some
rows p, p+1 of the matrix f , the corresponding Givens rotations of fermionic operators ap, ap+1
are added to the quantum circuit U , see Eqs. (19,20). More precisely, the angles α, β at Line 7
are chosen such that the operator
Rp,p+1(α, β)(fp,map + fp+1,map+1)Rp,p+1(α, β)
†
is proportional to ap+1, see Eqs. (19,20). Thus the property Eq. (22) is maintained at each step.
After the last iteration of Algorithm 1 one has f = I and Eq. (22) gives (UF)ap(UF)
† = ap for
all p. Furthermore U |0n〉 = |0n〉 since all gates added to U map |0n〉 to itself. We conclude that
U = F−1 after the last iteration of Algorithm 1. Thus the algorithm returns a quantum circuit
realizing F. The inverse circuit U−1 can be obtained from U using the identities R(α, β)−1 =
17
R(−α, β) and S(γ)−1 = S(−γ). The direct inspection shows that the total numberof gates
fSWAP, R and S added to U is O(n2). We implemented Algorithms 1,2 in Matlab obtaining the
following circuit in the case n=3.
! ⁄# 6
! #
! − ⁄2# 3
( − ⁄# 4 ,− ⁄# 6
( +, ⁄# 3
( − ⁄# 4 , ⁄# 3
Figure 4: Quantum circuit realizing the 3-qubit fermionic Fourier transform F. The circuit was
generated using Algorithm 1. Here α = −(1/2) arccos (1/3) ≈ −0.9553.
Appendix B
Here we prove Lemma 7. First note that
C = SWAP0,1SWAP1,2 · · · SWAPn−2,n−1.
Define a fermionic SWAP operator [18, 23]
fSWAP = CZ · SWAP =

1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 −1
 (23)
and a fermionic cyclic shift
fC = fSWAP0,1fSWAP1,2 · · · fSWAPn−2,n−1. (24)
A simple algebra shows that
C|x〉 = (−1)xn−1(x0+x1+...+xn−2)fC|x〉. (25)
Let k = x0 + x1 + . . .+ xn−1 be the Hamming weight of x. Then
(−1)xn−1(x0+x1+...+xn−2) = (−1)xn−1(k−xn−1) = (−1)xn−1k+xn−1 = (−1)xn−1(k+1). (26)
Thus C= fC on the odd-weight subspace and C= fCZn−1 on the even-weight subspace, i.e.
C = EfCZn−1 + (I − E)fC. (27)
We claim that
fC = FeiH0F†. (28)
18
Indeed, let
G ≡ FeiH0F†.
First note that fC|0n〉 = G|0n〉 = |0n〉. Since any state can be obtained from |0n〉 by applying
the creation operators a†p, it suffices to check that
fC†apfC = G†apG
for all p. A simple algebra shows that
(fSWAPp,p+1)ap(fSWAPp,p+1)
† = ap+1 and (fSWAPp,p+1)ap+1(fSWAPp,p+1)† = ap.
Combining this and Eq. (24) one gets
fC†apfC = ap−1,
where the indices of fermionic operators are evaluated modulo n. Using the identities
e−iH0aqeiH0 = e2piiq/naq,
FapF
† =
1√
n
n∑
q∈Zn
e2piipq/naq, and F
†aqF =
1√
n
n∑
r∈Zn
e−2piiqr/nar,
one gets
G†apG = n−1/2
∑
q∈Zn
e2pii(1−p)q/nFaqF† = n−1
∑
q,r∈Zn
e2pii(1−p+r)q/nar = ap−1.
Thus G†apG = fC†apfC = ap−1, proving Eq. (28).
Next we claim that
fCZn−1 = e−iH0/2fCeiH0/2ei(pi/n)W . (29)
Indeed, let
L := fCZn−1 and R := e−iH0/2fCeiH0/2ei(pi/n)W .
Since L|0n〉 = R|0n〉 = |0n〉, it suffices to check that L†apL = R†apR for all p ∈ Zn. A simple
algebra gives
e−i(pi/n)Wapei(pi/n)W = ei(pi/n)ap,
Zn−1apZn−1 =
{
ap if 0 ≤ p ≤ n− 2
−ap if p = n− 1 ,
eiH0/2ape
−iH0/2 = e−ipip/nap, and e−iH0/2apeiH0/2 = eipip/nap
for all p ∈ Zn. Recall that fC†apfC = ap−1. Using the above identities one gets
L†apL =
{
ap if 1 ≤ p ≤ n− 1
−ap if p = 0
and
R†apR = e−ipip/neipip
′/nei(pi/n)ap−1,
19
where p′ ≡ p− 1 (mod n). Note that
e−ipip/neipip
′/n =
{
e−ipi/n if 1 ≤ p ≤ n− 1
−e−ipi/n if p = 0 .
Thus L†apL = R†apR, that is, L=R, proving Eq. (29).
Combining Eqs. (27,28,29) one infers that the restrictions of C onto the odd-weight and
even-weight subspaces coincide with the operators
Codd = Fe
iH0F†
and
Ceven = e
−iH0/2(FeiH0F†)eiH0/2ei(pi/n)W
respectively. Thus
C = e−iH0E/2(FeiH0F†)eiH0E/2ei(pi/n)WE
on the full Hilbert space. Recall that the fermionic Fourier transform F preserves the Hamming
weight. Thus F† commutes with ei(pi/n)WE . Commuting the term ei(pi/n)WE to the left gives
C = e−iH0E/2F
(
eiH0ei(pi/n)WE
)
F†eiH0E/2 = V †eiH
′
V,
where V = F†eiH0E/2 and H ′ = H0 + (pi/n)WE. Thus C = eiV
†H′V , proving Lemma 7.
20
